ATTRIB 2023 (@ NeurIPS)

1st Workshop on Attributing Model Behavior at Scale

Friday, December 15th, 2023

New Orleans Convention Center (Rooms 271-273)

Accepted papers: OpenReview

Contact info: attrib-neurips23 [at] googlegroups [dot] com

What makes ML models tick? How do we attribute model behavior to the training data, algorithm, architecture, or scale used in training?

Recently-developed algorithmic innovations and large-scale datasets have given rise to machine learning models with impressive capabilities. However, there is much left to understand in how these different factors combine to give rise to observed behaviors. For example, we still do not fully understand how the composition of training datasets influence downstream model capabilities, how to attribute model capabilities to subcomponents inside the model, and which algorithmic choices really drive performance.

A common theme underlying all these challenges is model behavior attribution. That is, the need to tie model behavior back to factors in the machine learning pipeline—such as the choice of training dataset or particular training algorithm—that we can control or reason about. This workshop aims to bring together researchers and practitioners with the goal of advancing our understanding of model behavior attribution.

Call for Papers

Submissions open August 1st!
We are soliciting papers along two tracks: Along these two tracks, we welcome submissions pertaining to any aspect of model behavior attribution. For example:

Submission Instructions

  1. Format submissions as follows:
    • 3-6 pages (main track) or 2-4 pages (idea track)
    • NeurIPS 2023 paper formatting (download here)
    • Appendix included in the same PDF as the main body
    • No Appendix page limit
  2. When ready, submit to OpenReview (note our workshop is non-archival)
  3. (Optional) Camera-ready: upload to OpenReview using style files .sty .tex

Important Dates

August 1: Submission portal opens

October 2 (AOE): *Extended* Deadline for both idea track and main track papers

October 25: Decision notifications

November 18 (AOE): Camera-ready deadline

December 15: Workshop!


Conference Schedule
Welcome and Opening Remarks
Understanding ChatGPT’s behavior drift over time (James Zou)
Abstract: GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time, particularly in its ability to follow instructions and in the chain-of-thought behaviors. We further investigate LLM safety training of LLMs as a factor that can contribute to behavior drift.
What does scale give us: Why we are building a ladder to the moon (Sara Hooker)
Abstract: A talk about what we know about the role of scale at conferring valuable generalization properties. I will present some background, some of our work on understanding the role of scale (both of data and model size) and some thoughts about how we can get away from the painfully inefficient formula of just scaling capacity.
Coffee break and posters
Contributed talks
Unifying Corroborative and Contributive Attributions in Large Language Models (Teddi Worledge)
Abstract: As businesses, products, and services spring up around large language models, the trustworthiness of these models hinges on the verifiability of their outputs. However, methods for explaining language model outputs largely fall across two distinct fields of study which both use the term "attribution" to refer to entirely separate techniques: citation generation and training data attribution. In many modern applications, such as legal document generation and medical question answering, both types of attributions are important. In this work, we argue for and present a unified framework of large language model attributions. We show how existing methods of different types of attribution fall under the unified framework. We also use the framework to discuss real-world use cases where one or both types of attributions are required. We believe that this unified framework will guide the use case driven development of systems that leverage both types of attribution, as well as the standardization of their evaluation.
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization (Elan Rosenfeld)
Abstract: We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics and demonstrates how a small number training points can have an unusually large effect on a network's optimization trajectory and predictions. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong \emph{opposing signals}: consistent, large magnitude features which dominate the network output and occur in both groups with similar frequency. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We complement these experiments with a theoretical analysis of a two-layer linear network on a simple model of opposing signals. Our finding enables new qualitative predictions of behavior during and after training which we confirm experimentally. It also provides a new lens through which to study how specific data influence the learned parameters.
Successor Heads: Recurring, Interpretable Attention Heads In The Wild (Rhys Gould)
Abstract:In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment ‘Monday’ into ‘Tuesday’. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of ‘mod 10’ features that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.
Attributing Learned Concepts in Neural Networks to Training Data (Nicholas Konz)
Abstract: By now there is substantial evidence that deep learning models learn certain human-interpretable features as part of their internal representations of data. As having the right (or wrong) concepts is critical to trustworthy machine learning systems, it is natural to ask which inputs from the model's original training set were most important for learning a concept at a given layer. To answer this, we combine data attribution methods with methods for probing the concepts learned by a model. Training network and probe ensembles for two concept datasets on a range of network layers, we use the recently developed TRAK method for large-scale data attribution. We find some evidence for convergence, where removing the 10,000 top attributing images for a concept and retraining the model does not change the location of the concept in the network nor the probing sparsity of the concept. This suggests that rather than being highly dependent on a few specific examples, the features that inform the development of a concept are spread in a more diffuse manner across its exemplars, implying robustness in concept formation.
Poster session #1
What Neural Networks Memorize and Why (Vitaly Feldman)
Abstract: Deep learning algorithms tend to fit the entire training dataset (nearly) perfectly including mislabeled examples and outliers. In addition, in extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). We provide a simple conceptual explanation and a theoretical model demonstrating that memorization of labels is necessary for achieving close-to-optimal generalization error when learning from long-tailed data distributions. We also describe natural prediction problems for which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when most of that information is ultimately irrelevant to the task at hand. Finally, we demonstrate the utility of memorization and support our explanation empirically. These results rely on a new technique for efficiently estimating memorization and influence of training data points.
Evaluation Beyond Task Performance (Milad Nasr)
Abstract: As we increasingly release and productionize machine learning models, we focus primarily on their performance on a suite of downstream benchmarking tasks. However, improved performance on these benchmarks does not equate to universal improvement. In this talk, we discuss evaluations that live on a whole separate axis. In particular, we show that as models get larger there are more memorized training examples in the model outputs. These issues are not random artifacts that can be solved by scaling models or can be prevented in production models easily.
Coffee break
Poster session #2
Understanding LLMs via their Generative Successes and Shortcomings (Swabha Swayamdipta)
Abstract: Generative capabilities of large language models have grown beyond the wildest imagination of the broader AI research community, leading many to speculate whether these successes may be attributed to the training data or different factors concerning the model. At the same time however, LLMs continue to exhibit many shortcomings, which might contain important clues to understanding their behavior as well as attribution. I will present some work from my group which has revealed unique successes and shortcomings in the generative capabilities of LLMs, on knowledge-oriented tasks, tasks with human and social utility and tasks that reveal more than surface-level understanding of language. I will end with a brief discussion of the implications for attribution in the peculiar domain that natural language occupies.
Closing remarks & Poster session #3



Program Committee

Cem Anil, Alexander Atanasov, Juhan Bae, Anna Bair, Samyadeep Basu, Atoosa Chegini, Catherine Chen, Benjamin Cohen-Wang, Jean-Stanislas Denain, Lucas Dixon, Lisa Dunlap, Eve Fleisig, Saurabh Garg, Kristian Georgiev, Nate Gruver, Arushi Gupta, Evan Hernandez, Aspen Hopkins, Saachi Jain, Alaa Khaddaj, Keller Jordan, Polina Kirichenko, Dongyue Li, Weixin Liang, Sanae Lotfi, Benjamin Newman, Tai Nguyen, Nikhil Prakash, Nikunj Saunshi, Rylan Schaeffer, Melanie Sclar, Zineng Tang, Joshua Vendrow, Evan Vogelbaum, Haotian Ye, Mert Yuksekgonul, Lily Zhang