ATTRIB 2024 (@ NeurIPS)

2nd Workshop on Attributing Model Behavior at Scale

Saturday December 14, 2024

Vancouver Convention Center (Meeting 205 - 207)

Contact info: attribworkshop [at] gmail [dot] com

Submissions: OpenReview

Please note the starting time (not the same as Whova!)

What makes ML models tick? How do we attribute model behavior to the training data, algorithm, architecture, or scale used in training?


Recently-developed algorithmic innovations and large-scale datasets have given rise to machine learning models with impressive capabilities. However, there is much left to understand in how these different factors combine to give rise to observed behaviors. For example, we still do not fully understand how the composition of training datasets influence downstream model capabilities, how to attribute model capabilities to subcomponents inside the model, and which algorithmic choices really drive performance.

A common theme underlying all these challenges is model behavior attribution. That is, the need to tie model behavior back to factors in the machine learning pipeline—such as the choice of training dataset or particular training algorithm—that we can control or reason about. This workshop aims to bring together researchers and practitioners with the goal of advancing our understanding of model behavior attribution.

Call for Papers

Submissions open August 1st!
We are soliciting papers along two tracks: Along these tracks, we welcome submissions pertaining to any aspect of model behavior attribution. For example:

Submission Instructions

  1. Format submissions as follows:
    • 3-6 pages (main track) or 2-4 pages (idea track)
    • NeurIPS 2024 paper formatting (download from here)
    • Appendix included in the same PDF as the main body
    • No Appendix page limit
  2. When ready, submit to OpenReview (note our workshop is non-archival; published papers are fine to submit provided they meet the formatting requirements above)

Important Dates


August 1: Submission portal opens

September 25 (AOE): Deadline for both idea and main track papers

October 4 (AOE): Final deadline for papers (if >= 1 author agrees to emergency review)

October 10: Decision notifications

December 14: Workshop!

Schedule

Please note the starting time (not the same as Whova!)

Conference Schedule
9:00am-9:20am
Welcome and Opening Remarks
9:30am-10:00am
Invited Talk: Surbhi Goel
10:00am-10:30am
Invited Talk: Sanmi Koyejo
10:30am-11:05am
Contributed talks
On Linear Representations and Pretraining Data Frequency in Language Models
Authors: Jack Merullo, Sarah Wiegreffe, Yanai Elazar
Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data and downstream task behavior, we look at the effect on LM representations. Previous work has discovered that, in language models, some concepts are encoded as ``linear representations'' argued to be highly interpretable and useful for controllable generation. We study the connection between differences in pretraining data frequency and differences in trained models' linear representations of factual recall relations. We find evidence that the two are directly linked, with the formation of linear representations strongly connected to pretraining term frequencies. First, we establish that the presence of linear representations for subject-relation-object-formatted facts is highly correlated with both subject-object co-occurrence frequency and in-context learning accuracy. This is the case across all phases of pretraining, i.e., it is not affected by the model's underlying capability. In OLMo 7B and GPT-J (6B), we find that a linear representation forms predictably when the subjects and objects within a relation co-occur at least 1--2k times. Thus, it appears linear representations form as a result of consistent repeated occurrences, not due to lengthy pretraining time. In the OLMo 1B model, formation of these features only occurs after 4.4k occurrences. Finally, we train a regression model on measurements of linear representation robustness that can predict how often a term was seen in pretraining with low error, which generalizes to GPT-J without additional training, providing a new unsupervised method for exploring how possible data sources of closed-source models. We conclude that the presence/absence of linear representations contain a weak but significant signal that reflects an imprint of the pretraining corpus across LMs.
When Attention Sink Emerges in Language Models: An Empirical View
Authors: Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin
Abstract: Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.
Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations
Authors: Manuel Quintero, William T. Stephenson, Advik Shreekumar, Tamara Broderick
Abstract: In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates — or, respectively, outcomes given covariates — if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions — the functional ANOVA and Accumulated Local Effects — can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide and partially prove a conjecture that this misattribution arises in any additive decomposition that depends on the distribution of covariates.
U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models
Authors: Tung-Yu Wu, Melody Lo
Abstract: Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where performance seems to stagnate at first and then improve sharply and unpredictably with scale beyond a threshold. By dividing questions in the datasets according to difficulty level by average performance, we observe U-shaped scaling for hard questions, and inverted-U scaling followed by steady improvement for easy questions. Moreover, the emergence threshold roughly coincides with the point at which performance on easy questions reverts from inverse scaling to standard scaling. Capitalizing on the observable though opposing scaling trend on easy and hard questions, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict both the emergence threshold and model performance beyond the threshold.
11:05am-11:50am
Panel
11:50am-1:00pm
Lunch
1:00pm-2:00pm
Poster session #1
2:00pm-2:30pm
Invited Talk: Baharan Mirzasoleiman
2:30pm-3:00pm
Invited Talk: Robert Geirhos
3:00pm-3:30pm
Coffee break
3:00pm-4:30pm
Poster session #2
4:30pm-5:00pm
Invited Talk: Seong Joon Oh
5:00pm-5:15pm
Closing remarks

Speakers

Organizers