ATTRIB 2024 Workshop

Call for Papers

Submissions open August 1st!

We are soliciting papers along two tracks:

Main track papers: 3-6 page submissions on attributing model behaviors (see below for example topics).
Idea track papers: 2-4 page submissions, on a specific topic or on the field of attribution as a whole. Vision papers, unifying ideas, documentation of failed experiments are all welcome. Papers in this track will be held to the same standard as the main track, but can be opinionated (so long as opinions are clearly demarcated from facts) and need not have experiments (although lack of experimentation should be justified).

Along these tracks, we welcome submissions pertaining to any aspect of model behavior attribution. For example:

Data: Models are trained on large-scale datasets collected from disparate (and often arbitrarily chosen) sources. How can we understand how the composition training data affects model behavior? This includes:
- Data attribution and selection: How can we (efficiently) attribute model outputs back to specific training examples? How can we select data to optimize downstream performance/capabilities?
- Data leakage/contamination: How can we monitor and fix data leakage at internet scale? How do data feedback loops (e.g., training on LLM-generated outputs) influence model biases?
Trained models: Large models remain black boxes—how do we attribute a model's behavior to its subcomponents? Directions include:
- Mechanistic interpretability: How do individual neurons combine to yield model predictions?
- Concept-based interpretability: Can we attribute predictions to human-identifiable concepts? Can we attribute these concepts or other biases to subnetworks inside a DNN?
Learning algorithms: Designing a ML model involves dozens of choices, ranging from choice of model architecture, optimizer, to learning algorithm. How do these choices influence model behavior? For example, exploring issues such as:
- Understanding algorithmic choices: How do algorithmic choices affect model capabilities? What parts of model behavior can we attribute to specific algorithmic choices?
- Scalings laws/emergence: What emergent capabilities (if any) can we actually attribute to scale alone?

Schedule

Please note the starting time (not the same as Whova!)

Conference Schedule

9:00am-9:20am

Welcome and Opening Remarks

9:30am-10:00am

Invited Talk: Surbhi Goel

10:00am-10:30am

Invited Talk: Sanmi Koyejo

10:30am-11:05am

Contributed talks

On Linear Representations and Pretraining Data Frequency in Language Models

Authors: Jack Merullo, Sarah Wiegreffe, Yanai Elazar
Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data and downstream task behavior, we look at the effect on LM representations. Previous work has discovered that, in language models, some concepts are encoded as ``linear representations'' argued to be highly interpretable and useful for controllable generation. We study the connection between differences in pretraining data frequency and differences in trained models' linear representations of factual recall relations. We find evidence that the two are directly linked, with the formation of linear representations strongly connected to pretraining term frequencies. First, we establish that the presence of linear representations for subject-relation-object-formatted facts is highly correlated with both subject-object co-occurrence frequency and in-context learning accuracy. This is the case across all phases of pretraining, i.e., it is not affected by the model's underlying capability. In OLMo 7B and GPT-J (6B), we find that a linear representation forms predictably when the subjects and objects within a relation co-occur at least 1--2k times. Thus, it appears linear representations form as a result of consistent repeated occurrences, not due to lengthy pretraining time. In the OLMo 1B model, formation of these features only occurs after 4.4k occurrences. Finally, we train a regression model on measurements of linear representation robustness that can predict how often a term was seen in pretraining with low error, which generalizes to GPT-J without additional training, providing a new unsupervised method for exploring how possible data sources of closed-source models. We conclude that the presence/absence of linear representations contain a weak but significant signal that reflects an imprint of the pretraining corpus across LMs.

When Attention Sink Emerges in Language Models: An Empirical View

Authors: Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin
Abstract: Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.

Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations

Authors: Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick
Abstract: In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates — or, respectively, outcomes given covariates — if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions — the functional ANOVA and Accumulated Local Effects — can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide and partially prove a conjecture that this misattribution arises in any additive decomposition that depends on the distribution of covariates.

U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models

Authors: Tung-Yu Wu, Melody Lo
Abstract: Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where performance seems to stagnate at first and then improve sharply and unpredictably with scale beyond a threshold. By dividing questions in the datasets according to difficulty level by average performance, we observe U-shaped scaling for hard questions, and inverted-U scaling followed by steady improvement for easy questions. Moreover, the emergence threshold roughly coincides with the point at which performance on easy questions reverts from inverse scaling to standard scaling. Capitalizing on the observable though opposing scaling trend on easy and hard questions, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict both the emergence threshold and model performance beyond the threshold.

11:05am-11:50am

Panel

11:50am-1:00pm

Lunch

1:00pm-2:00pm

Poster session #1

2:00pm-2:30pm

Invited Talk: Baharan Mirzasoleiman

2:30pm-3:00pm

Invited Talk: Robert Geirhos

3:00pm-3:30pm

Coffee break