2512.05865

Model: healer-alpha-free

# Sparse Attention Post-Training for Mechanistic Interpretability **Authors**: Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf ## Abstract We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4\$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100× fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models. Machine Learning, ICML ## 1 Introduction Scaling has driven major advances in artificial intelligence, with ever-larger models trained on internet-scale datasets achieving remarkable capabilities across domains. Large language models (LLMs) now underpin applications from text generation to question answering, yet their increasing complexity renders their internal mechanisms largely opaque (Bommasani, 2021). Methods of mechanistic interpretability have been developed to address this gap by reverse-engineering neural networks to uncover how internal components implement specific computations and behaviors. Recent advances in this area have successfully identified interpretable circuits, features, and algorithms within LLMs (Nanda et al., 2023; Olsson et al., 2022), showing that large complex models can, in part, be understood mechanistically, opening avenues for improving transparency, reliability, and alignment (Bereska and Gavves, 2024). <details> <summary>x1.png Details</summary> ![5060aaf9](/v1/image/5060aaf927f061cd9cd714c5864a0afff70d0547a65e59ce580606887b0c3031) ### Visual Description ## Diagram: Neural Network Sparsity Regularization Comparison ### Overview The image is a technical diagram comparing the internal connectivity of two neural network models—a "Base Model" and a "Sparse Model"—before and after a process labeled "Sparsity-Regularised Finetuning." It visually demonstrates how the finetuning process drastically reduces the number of active connections (weights) within the network while maintaining the same input-output function for a simple arithmetic task. ### Components/Axes The diagram is divided into two primary rectangular panels, stacked vertically, with a large curved arrow connecting them on the left. 1. **Top Panel: "Base Model"** * **Title:** "Base Model" (top-left corner of the panel). * **Input Sequence (Bottom):** `3 6 + 2 8 = ? ? ? ? ?` * **Output Sequence (Top):** `3 6 + 2 8 = 0 0 0 6 4` * **Network Structure:** A grid of nodes representing neurons, organized into four horizontal rows labeled from bottom to top: "Layer 0", "Layer 1", "Layer 2", "Layer 3". * **Connections:** A dense web of light blue lines connects nodes between adjacent layers. Several connections are highlighted in a darker, more prominent blue, indicating stronger or more significant weights. The connections are dense and distributed across all layers. 2. **Bottom Panel: "Sparse Model"** * **Title:** "Sparse Model" (top-left corner of the panel). * **Input Sequence (Bottom):** `3 6 + 2 8 = ? ? ? ? ?` * **Output Sequence (Top):** `3 6 + 2 8 = 0 0 0 6 4` * **Network Structure:** Identical grid layout and layer labels ("Layer 0" to "Layer 3") as the Base Model. * **Connections:** The vast majority of connections are absent. Only a small cluster of dark blue lines remains, almost exclusively connecting nodes from the input sequence directly to nodes in "Layer 0". A single, isolated connection is visible between Layer 1 and Layer 2 on the far right. 3. **Connecting Element:** * A large, black, curved arrow points from the "Base Model" panel down to the "Sparse Model" panel. * **Label:** The text "Sparsity-Regularised Finetuning" is written vertically along the left side of this arrow. ### Detailed Analysis * **Task:** Both models are performing the same arithmetic task: calculating `36 + 28`. The correct answer, `00064` (likely representing a 5-digit output format), is shown as the target output for both. * **Connectivity Trend - Base Model:** The Base Model exhibits a "dense" or fully-connected-like architecture. Information flows through a complex, distributed network of connections across all four layers. The darker blue lines suggest certain pathways are more active or important for this specific computation. * **Connectivity Trend - Sparse Model:** The Sparse Model exhibits extreme sparsity. The finetuning process has pruned nearly all connections. The remaining active pathways are highly localized: * **Primary Cluster:** A fan of dark blue connections originates from the nodes representing the digits `3`, `6`, `+`, `2`, `8` in the input line and converges onto a subset of nodes in **Layer 0**. This suggests the model has learned to perform the core computation using direct, early-layer feature extraction. * **Isolated Connection:** One dark blue line connects a node in Layer 1 to a node in Layer 2 on the far right side of the grid. This may represent a minimal residual pathway for propagating the result. * **Output Consistency:** Crucially, the output `0 0 0 6 4` is identical in both models, indicating that the sparse model retains the functional capability of the dense base model for this task. ### Key Observations 1. **Dramatic Pruning:** The "Sparsity-Regularised Finetuning" process removes the vast majority of neural connections, transforming a dense network into a highly sparse one. 2. **Functional Preservation:** Despite the massive reduction in parameters (connections), the sparse model produces the exact same correct output for the given input. 3. **Localization of Computation:** The remaining computation in the sparse model is heavily concentrated in the first layer (Layer 0), with direct connections from the input. This implies the essential logic for solving `36 + 28` can be encoded with very few, direct transformations. 4. **Architectural Insight:** The diagram suggests that for this specific, simple algorithmic task, a dense network is over-parameterized. The true computational "circuit" required is small and can be isolated via sparsity-inducing techniques. ### Interpretation This diagram is a powerful visual argument for the efficacy of **sparsity regularization** in neural network training. It demonstrates that: * **Efficiency through Sparsity:** Large, dense models contain significant redundancy. Techniques like sparsity regularization can identify and preserve only the essential sub-network (a "winning ticket") needed for a specific task, leading to models that are potentially much smaller and faster. * **Mechanistic Understanding:** The process acts as a form of "circuit discovery." By pruning away unused connections, it reveals the minimal computational graph the network uses to solve the problem. Here, the core arithmetic appears to be handled by direct input-to-first-layer mappings. * **Generalization vs. Specialization:** While shown for a simple task, the principle raises questions about how such sparse sub-networks might generalize to more complex problems. The diagram highlights a trade-off: the sparse model is highly efficient for this task but may lack the distributed representation that could be beneficial for broader reasoning. In essence, the image moves beyond showing *that* sparsity works to illustrating *how* it works—by surgically removing non-essential pathways and leaving behind a lean, functional core. </details> Figure 1: Visualised attention patterns for a 4-layer toy model trained on a simple 2-digit addition task. The main idea of this work is to induce sparse attention between tokens via a post-training procedure that optimizes for attention sparsity while maintaining model performance. In this example, while both models are able to correctly predict the sum, the sparse model solves the problem with a naturally interpretable circuit. Details of this toy setup and more examples are provided in Appendix A However, interpretability is bottlenecked by the model itself: even with sophisticated reverse-engineering techniques that can faithfully reveal internal algorithms, the underlying computations implemented by large models can still remain highly complex and uninterpretable. Circuits for seemingly simple tasks may span hundreds of interacting attention heads and MLPs with densely intertwined contributions across layers (Conmy et al., 2023), and features can influence each other along combinatorially many attention-mediated paths, complicating attention attribution (Kamath et al., 2025). To exemplify this, Figure 1 (top) illustrates the attention patterns of a small, single-head transformer trained on a simple two-digit addition task. Here, the model has learned to solve the task in a highly diffused manner, where information about each token is dispersed across all token locations, rendering the interpretation of the underlying algorithm extremely difficult even in this simple case. The crux of the problem is that models are not incentivised to employ simple algorithms during training. In this work, we advocate for directly embedding interpretability constraints into model design in a way that induces simple circuits while preserving performance. We focus our analysis on attention mechanisms and investigate sparsity regularisation on attention patterns, originally proposed in (Lei et al., 2025), as an inductive bias. To demonstrate how sparse attention patterns can give rise to interpretable circuits, we return to the two-digit addition example: Figure 1 (bottom) shows the attention patterns induced by penalising attention edges during training. Here, the sparsity inductive bias forces the model to solve the problem with much smaller, intrinsically interpretable computation circuits. In this work, we investigate using this sparsity regularisation scheme as a post-training strategy for pre-trained LLMs. We propose a practical method for fine-tuning existing models without re-running pretraining, offering a flexible way to induce sparse attention patterns and enhance interpretability. We show, on models of up to 7B parameters, that our proposed procedure preserves the performance of the base models on pretraining data while reducing the effective attention map to less than $0.5\$ of its edges. To evaluate our central hypothesis that sparse attention facilitates interpretability, we consider two complementary settings. First, we study circuit discovery, where the objective is to identify the minimal set of components responsible for task performance (Conmy et al., 2023). We find that sparsified models yield substantially simpler computational graphs: the resulting circuits explain model behaviour using up to four times fewer attention heads and up to two orders of magnitude fewer edges. Second, using cross-layer transcoders (Ameisen et al., 2025), we analyse attribution graphs, which capture feature-level interactions across layers. In this setting, sparse attention mitigates the attention attribution problem by making it possible to identify which attention heads give rise to a given edge, owing to the reduced number of components mediating each connection. We argue that this clarity enables a tighter integration of feature-based and circuit-based perspectives, allowing feature interactions to be understood through explicit, tractable circuits. Taken together, these results position attention sparsity as an effective and practical inductive tool for surfacing the minimal functional backbone underlying model behaviour. ## 2 Related Work ### 2.1 Sparse Attention As self-attention is a key component of the ubiquitous Transformer architecture, a large number of variants of attention mechanisms have been explored in the literature. Related to our approach are sparse attention methods, which are primarily designed to alleviate the quadratic scaling of vanilla self-attention. These methods typically rely on masks based on fixed local and strided patterns (Child et al., 2019) or sliding-window and global attention patterns (Beltagy et al., 2020; Zaheer et al., 2020) to constrain the receptive field of each token. While these approaches are successful in reducing the computational complexity of self-attention, they require hand-defined heuristics that do not reflect the internal computations learned by the model. Beyond these fixed-pattern sparse attention methods, Top- $k$ attention, which enforces sparsity by dynamically selecting the $k$ most relevant keys per query based on their attention scores, has also been explored (Gupta et al., 2021; DeepSeek-AI, 2025). While Top- $k$ attention enables learnable sparse attention, the necessity to specify $k$ limits its scope for interpretability for two reasons. First, selecting the optimal $k$ is difficult, and setting $k$ too low can degrade model performance. Second, and more fundamentally, Top-k attention does not allow the model to choose different $k$ for different attention heads based on the context. We argue that this flexibility is crucial for maintaining model performance. More recently, gated attention mechanisms (Qiu et al., 2025) provide a scalable and performant framework for inducing sparse attention. In particular, Lei et al. (2025) introduce a sparsity regularisation scheme for world modelling that reveals sparse token dependencies. We adopt this method and examine its role as an inductive bias for interpretability. ### 2.2 Circuit Discovery Mechanistic interpretability seeks to uncover how internal components of LLMs implement specific computations. Ablation studies assess performance drops from removing components (Nanda et al., 2023), activation patching measures the effect of substituting activations (Zhang and Nanda, 2023), and attribution patching scales this approach via local linearisation (Syed et al., 2024). Together, these approaches allow researchers to isolate sub-circuits, minimal sets of attention heads and MLPs that are causally responsible for a given behavior or task (Conmy et al., 2023). Attention itself plays a dual role: it both routes information and exposes interpretable relational structure, making it a key substrate for mechanistic study. Our work builds on this foundation by leveraging sparsity to simplify these circuits, amplifying the interpretability of attention-mediated computation while preserving model performance. ### 2.3 Attribution Graph Mechanistic interpretability has gradually shifted from an emphasis on explicit circuit discovery towards the analysis of internal representations and features. Recent work on attribution graphs and circuit tracing seeks to reunify these perspectives by approximating MLP outputs as sparse linear combinations of features and computing causal effects along linear paths between them (Dunefsky et al., 2024; Ameisen et al., 2025; Lindsey et al., 2025b). This framework enables the construction of feature-level circuits spanning the computation from input embeddings to final token predictions. Within attribution graphs, edges correspond to direct linear causal relationships between features. However, these relationships are mediated by attention heads that transmit information across token positions. Identifying which attention heads give rise to a particular edge, and understanding why they do so, is essential, as this mechanism forms a fundamental component of the computational graph (Kamath et al., 2025). A key limitation of current attribution-based approaches is that individual causal edges are modulated by dozens of attention components. We show that this leads to feature-to-feature influences that are overly complex, rendering explanations in terms of other features in the graph both computationally expensive and conceptually challenging. ## 3 Method Our main hypothesis is that post-training existing LLMs to encourage sparse attention patterns leads to the emergence of more interpretable circuits. In order to instantiate this idea, we require a post-training pipeline that satisfies three main desiderata: 1. To induce sparse message passing between tokens, we need an attention mechanism that can ‘zero-out’ attention edges, which in turn enables effective $L_{0}$ -regularisation on the attention weights. This is in contrast to the standard softmax attention mechanism, where naive regularisation would result in small but non-zero attention weights that still allow information flow between tokens. 1. The model architecture needs to be compatible with the original LLM such that the pre-trained LLM weights can be directly loaded at initialisation. 1. The post-training procedure needs to ensure that the post-trained models do not lose prediction performance compared to their fully-connected counterparts. To this end, we leverage the Sparse Transformer architecture in the SPARTAN framework proposed in (Lei et al., 2025), which uses sparsity-regularised hard attention instead of the standard softmax attention. In the following subsections, we describe the Sparse Transformer architecture and the optimisation setup, highlighting how this approach satisfies the above desiderata. ### 3.1 Sparse Attention Layer Given a set of token embeddings, the Sparse Transformer layer computes the key, query, and value embeddings, $\{k_{i},q_{i},v_{i}\}$ , via linear projections, analogous to the standard Transformer. Based on the embeddings, we sample a binary gating matrix from a learnable distribution parameterised by the keys and queries, $$ A_{ij}\sim\mathrm{Bern}(\sigma(q_{i}^{T}k_{j})), \tag{1} $$ where $\mathrm{Bern}(\cdot)$ is the Bernoulli distribution and $\sigma(\cdot)$ is the logistic sigmoid function. This sampling step can be made differentiable via the Gumbel Softmax trick (Jang et al., 2017). This binary matrix acts as a mask that controls the information flow across tokens. Next, the message passing step is carried out in the same way as standard softmax attention, with the exception that we mask out the value embeddings using the sampled binary mask, $$ \mathrm{SparseAttn}(Q,K,V)=\bigg[A\odot\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})\bigg]V, \tag{2} $$ where $d_{k}$ is the dimension of the key embeddings and $\odot$ denotes element-wise multiplication. During training, we regularise the expected number of edges between tokens based on the distribution over the gating matrix. Concretely, the expected number of edges for each layer can be calculated as $$ \mathbb{E}\big[|A|\big]=\sum_{i,j}\sigma(q^{T}_{i}k_{j}). \tag{3} $$ Note that during the forward pass, each entry of $A$ is a hard binary sample that zeros out attention edges, which serves as an effective $L_{0}$ regularisation. Moreover, since the functional form of the sparse attention layer after the hard sampling step is the same as standard softmax attention, pre-trained model weights can be directly used without alterations. Technically, the sampled $A$ affects the computation. This can be mitigated by adding a positive bias term inside the sigmoid function to ensure all gates are open at initialisation. Experimentally, we found this to be unnecessary as the models quickly recover their original performance within a small number of gradient steps. ### 3.2 Constrained Optimisation In order to ensure that the models do not lose prediction performance during the post-training procedure, as per desideratum 3, we follow the approach proposed in (Lei et al., 2025), which employs the GECO algorithm (Rezende and Viola, 2018). Originally developed in the context of regularising VAEs, the GECO algorithm places a constraint on the performance of the model and uses a Lagrangian multiplier to automatically find the right strength of regularisation during training. Concretely, we formulate the learning process as the following optimisation problem, $$ \min_{\theta}\sum_{l}\mathbb{E}\big[|A_{l}|\big]\qquad s.t.\quad CE\leq\tau, \tag{4} $$ where $A_{l}$ denotes the gating matrix at layer $l$ , $CE$ is the standard next token prediction cross-entropy loss, and $\tau$ is the required target loss, and $\theta$ is the model parameters. In practice, we set this target as the loss of the pre-trained baseline models. We solve this optimisation problem via Lagrangian relaxation, yielding the following max-min objective, $$ \max_{\lambda>0}\min_{\theta}\bigg[\sum_{l}\mathbb{E}\big[|A_{l}|\big]+\lambda(CE-\tau)\bigg]. \tag{5} $$ This can be solved by taking gradient steps on $\theta$ and $\lambda$ alternately. During training, updating $\lambda$ automatically balances the strength of the sparsity regularisation: when $CE$ is lower than the threshold, $\lambda$ decreases, and hence more weight is given to the sparsity regularisation term. This effectively acts as an adaptive schedule which continues to increase the strength of the regularisation until the model performance degrades. Here, the value of $\tau$ is selected as a hyperparameter to ensure that the sparse model’s performance remains within a certain tolerance of the original base model. In practice, the choice of $\tau$ controls a trade off between sparsity and performance: picking a tight $\tau$ can lead to a slower training process, whereas a higher tolerance can substantially speed up training at the cost of potentially harming model performance. In Appendix C, we provide further discussion on this optimisation process and its training dynamics. ### 3.3 Practical Considerations One of the main strengths of our proposed method is that, architecturally, the only difference between a sparse Transformer and a normal one lies in how the dot-product attention is computed. As such, most practical training techniques for optimising Transformers can be readily adapted to our setting. In our experiments, we find the following techniques helpful for improving computational efficiency and training stability. #### LoRA finetuning (Hu et al., 2022) . Low rank finetuning techniques can significantly reduce the computational requirements for training large models. In our experiments, we verify on a 7B parameter model that LoRA finetuning is sufficiently expressive for inducing sparse attention patterns. #### FlashAttention (Dao, 2023) FlashAttention has become a standard method for reducing the memory footprint of dot-product attention mechanisms. In Appendix B, we discuss how the sampled sparse attention can be implemented in an analogous manner. #### Distillation (Gu et al., 2024) . Empirically, we find that adding an auxiliary distillation loss based on the KL divergence between the base model and the sparse model improves training stability and ensures that the behaviour of the model remains unchanged during post-training. <details> <summary>x2.png Details</summary> ![17563bba](/v1/image/17563bba6ca3f44d7b0df41fc745347f7aec4f33d61f5f570db5f1ccbe0b1f60) ### Visual Description ## Bar Chart: Benchmark Comparison ### Overview The image is a grouped bar chart titled "Benchmark Comparison." It compares the accuracy of two models, "OLMo-7B" and "Sparse OLMo-7B," across four different benchmark datasets. The chart uses a vertical bar format with a clear legend and labeled axes. ### Components/Axes * **Title:** "Benchmark Comparison" (centered at the top). * **Y-Axis:** Labeled "Accuracy." The scale runs from 0.0 to 1.0, with major tick marks at intervals of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). * **X-Axis:** Lists four benchmark categories. The labels are rotated approximately 45 degrees for readability. From left to right: "TruthfulQA", "PIQA", "OpenBookQA", "ARC-Easy". * **Legend:** Located in the top-right corner of the plot area. It defines the two data series: * A teal/green square labeled "OLMo-7B". * A mauve/pink square labeled "Sparse OLMo-7B". ### Detailed Analysis The chart presents accuracy scores for two models on four tasks. For each benchmark, the "OLMo-7B" bar (teal) is positioned to the left of the "Sparse OLMo-7B" bar (mauve). **1. TruthfulQA** * **Trend:** Both models show the lowest performance on this benchmark compared to the others. * **Data Points:** * OLMo-7B: Accuracy is approximately **0.24**. * Sparse OLMo-7B: Accuracy is approximately **0.24**. The bars appear nearly identical in height. **2. PIQA** * **Trend:** This benchmark yields the highest accuracy scores for both models. * **Data Points:** * OLMo-7B: Accuracy is approximately **0.80**. * Sparse OLMo-7B: Accuracy is approximately **0.79**. The sparse model's bar is marginally shorter. **3. OpenBookQA** * **Trend:** Performance is moderate, lower than PIQA and ARC-Easy but higher than TruthfulQA. * **Data Points:** * OLMo-7B: Accuracy is approximately **0.37**. * Sparse OLMo-7B: Accuracy is approximately **0.35**. A small but visible gap exists, with the sparse model scoring lower. **4. ARC-Easy** * **Trend:** The second-highest performing benchmark for both models. * **Data Points:** * OLMo-7B: Accuracy is approximately **0.59**. * Sparse OLMo-7B: Accuracy is approximately **0.57**. Again, the sparse model shows a slight decrease. ### Key Observations * **Consistent Performance Gap:** Across all four benchmarks, the "Sparse OLMo-7B" model consistently achieves a slightly lower accuracy score than the standard "OLMo-7B" model. The difference is small but visually apparent in three of the four categories (PIQA, OpenBookQA, ARC-Easy). * **Benchmark Difficulty Hierarchy:** The relative difficulty of the benchmarks is consistent for both models. From easiest to hardest (highest to lowest accuracy): PIQA > ARC-Easy > OpenBookQA > TruthfulQA. * **No Outliers:** The data follows a clear pattern without any anomalous spikes or drops that break the trend. ### Interpretation This chart demonstrates the impact of model sparsification on performance across diverse reasoning and knowledge tasks. The key takeaway is that **sparsifying the OLMo-7B model results in a minor but consistent reduction in accuracy** across all tested benchmarks. The data suggests a trade-off: the "Sparse OLMo-7B" likely offers advantages in computational efficiency (memory, speed) at the cost of a small performance penalty. The fact that the performance drop is uniform and small indicates that the sparsification technique preserves the model's core capabilities effectively. The benchmark hierarchy (PIQA being easiest, TruthfulQA hardest) reveals the relative challenges these tasks pose to this class of language models, independent of their size or sparsity. </details> Figure 2: Comparison of model performance between the base OLMo model and the sparsified model evaluated on the various benchmarks. Across all tasks, the performance of the sparse model remains comparable with the base model despite using substantially fewer attention edges. ## 4 Experiments To evaluate the effectiveness of our post-training pipeline, we finetune pre-trained LLMs and compare their prediction performance and interpretability before and after applying sparsity regularisation. We perform full finetuning on a GPT-2 base model (Radford et al., 2019) (124M parameters) on the OpenWebText dataset (Gokaslan and Cohen, 2019). To investigate the generality and scalability of our method, we perform LoRA finetuning on the larger OLMo-7B model (Groeneveld et al., 2024) on the Dolma dataset (Soldaini et al., 2024), which is the dataset on which the base model was trained. The GPT-2 model and the OLMo model are trained on sequences of length 64 and 512, respectively. In the following subsections, we first present a quantitative evaluation of model performance and sparsity after sparse post-training. We then conduct two interpretability studies, using activation patching and attribution graphs, to demonstrate that our method enables the discovery of substantially smaller circuits. ### 4.1 Model Performance and Sparsity We begin by evaluating both performance retention and the degree of sparsity achieved by post-training. We set cross-entropy targets of 3.50 for GPT-2 (base model: 3.48) and 2.29 for OLMo (base model: 2.24). After training, the mean cross-entropy loss for both models remains within $\pm 0.01$ of the target, indicating that the dual optimisation scheme effectively enforces a tight performance constraint. To quantify the sparsity achieved by the models, we evaluate them on the validation split of their respective datasets and compute the mean number of non-zero attention edges per attention head. We find that the sparsified GPT-2 model activates, on average, only 0.22% of its attention edges, while the sparsified OLMo model activates 0.44%, indicating substantial sparsification in both cases. Table 1 provides a summary of the results. To further verify that this drastic reduction in message passing between tokens does not substantially alter model behaviour, we evaluate the sparsified OLMo model on a subset of the benchmarks used to assess the original model. As shown in Figure 2, the sparse model largely retains the performance of the base model across a diverse set of tasks. In sum, our results demonstrate that sparse post-training is effective in consolidating information flow into a small number of edges while maintaining a commensurate level of performance. | GPT-2 | 3.48 | 3.50 | 3.501 | 0.22% | | --- | --- | --- | --- | --- | | OLMo | 2.24 | 2.29 | 2.287 | 0.44% | Table 1: Performance and sparsity of post-trained models. Final cross-entropy losses closely match the specified targets, while attention sparsity is substantially increased. ### 4.2 Circuit Discovery with Activation Patching <details> <summary>x3.png Details</summary> ![3727160a](/v1/image/3727160a74afcdf974a3385b4bde32543288832fe9c21675177f51a2561b133c) ### Visual Description ## Attention Pattern Visualization: GPT-2 vs. Sparse GPT-2 ### Overview The image displays a comparative visualization of attention patterns from two transformer models: a standard "GPT2" model and a "Sparse GPT2" model. The visualization consists of two distinct panels, each containing a grid of small, square subplots. Each subplot represents the attention pattern of a specific attention head within a specific layer of the model. The patterns are depicted as heatmaps where dark blue dots indicate high attention weights between token positions, and white space indicates low or zero attention. ### Components/Axes * **Panel Titles:** The top panel is labeled **"GPT2"**. The bottom panel is labeled **"Sparse GPT2"**. * **Subplot Labels:** Each subplot has a unique identifier in its top-left corner, following the format `L[Layer Number]H[Head Number]`. For example, `L0H0` denotes Layer 0, Head 0. * **Subplot Content:** Each subplot is a square matrix. The x-axis and y-axis of these matrices represent token positions in a sequence (e.g., from the first token to the last). The color intensity at coordinate (i, j) represents the attention weight from token `i` to token `j`. * **Color Scale:** A monochromatic blue scale is used. Dark blue signifies high attention weight, while white signifies low or zero attention weight. No explicit color bar legend is provided. ### Detailed Analysis #### **GPT2 Panel (Top)** This panel contains a large grid of 59 attention head visualizations, arranged in 7 rows. The number of subplots per row varies (Row 1: 10, Row 2: 10, Row 3: 10, Row 4: 10, Row 5: 10, Row 6: 9, Row 7: 7). **Complete List of Head Labels (Row-wise, Left to Right):** * **Row 1:** L0H0, L0H2, L5H5, L0H7, L0H1, L0H10, L0H3, L8H5, L1H3 * **Row 2:** L7H6, L3H10, L4H8, L2H11, L5H8, L6H7, L11H0, L6H10, L7H3 * **Row 3:** L3H0, L5H9, L7H1, L2H10, L7H2, L8H10, L7H10, L0H5, L0H9 * **Row 4:** L6H6, L7H11, L2H9, L1H4, L6H11, L3H8, L5H3, L7H7, L1H10 * **Row 5:** L9H5, L6H9, L4H9, L1H2, L11H10, L4H3, L6H1, L5H6, L11H4 * **Row 6:** L3H11, L6H8, L4H7, L0H6, L3H1, L5H7, L10H2, L1H0, L5H1 * **Row 7:** L3H6, L6H3, L11H11, L5H2, L3H2, L2H3, L0H11 **Visual Trend & Pattern Distribution:** The attention patterns in the standard GPT-2 model are highly diverse: 1. **Diagonal Patterns:** Many heads (e.g., L5H5, L0H1, L0H3, L7H2, L5H6) show a strong, clean diagonal line. This indicates a "local" or "previous-token" attention pattern, where each token attends primarily to itself or the immediately preceding token. 2. **Vertical/Horizontal Lines:** Some heads (e.g., L0H7, L6H7, L3H1) show vertical or horizontal lines. A vertical line means a specific token is attended to by all other tokens (a "global" or "summary" token). A horizontal line means a specific token attends to all other tokens. 3. **Scattered/Diffuse Patterns:** A significant number of heads (e.g., L0H0, L7H6, L3H10, L9H5) exhibit scattered, diffuse attention across the matrix, suggesting more complex, non-local relationships. 4. **Blocky/Clustered Patterns:** Some heads (e.g., L4H8, L2H11, L1H4) show attention concentrated in blocks or clusters, indicating attention within phrases or syntactic units. #### **Sparse GPT2 Panel (Bottom)** This panel contains a single row of 9 attention head visualizations. **Complete List of Head Labels (Left to Right):** L0H5, L5H1, L4H11, L6H8, L5H5, L1H0, L6H9, L3H4, L5H6 **Visual Trend & Pattern Distribution:** The attention patterns in the Sparse GPT-2 model are strikingly uniform and distinct from the standard model: 1. **Dominant Diagonal:** **Every single head** (L0H5, L5H1, L4H11, L6H8, L5H5, L1H0, L6H9, L3H4, L5H6) displays a very clean, sharp diagonal line. This is the defining characteristic of this panel. 2. **Sparsity:** The off-diagonal areas are almost entirely white, indicating near-zero attention weights. This visual sparsity is the direct result of the "Sparse" modification, which likely prunes or masks attention connections to enforce this diagonal, local pattern. 3. **Consistency:** There is almost no variation in pattern type across the sampled heads. The sparsity constraint appears to have homogenized the attention behavior towards a strict local focus. ### Key Observations 1. **Pattern Diversity vs. Uniformity:** The most striking contrast is between the high diversity of attention patterns in standard GPT-2 and the extreme uniformity in Sparse GPT-2. 2. **Sparsity Enforcement:** The Sparse GPT-2 visualization provides clear visual evidence of a sparsity mechanism at work, successfully limiting attention to a local, diagonal band. 3. **Layer/Head Specificity:** In the standard GPT-2 panel, patterns are not strictly organized by layer. For example, Layer 0 contains both diagonal (L0H1, L0H3) and diffuse (L0H0) heads. This suggests functional specialization occurs at the head level, not uniformly across a layer. 4. **Potential Redundancy:** The Sparse GPT-2 panel shows multiple heads (e.g., L5H5 and L5H6) with nearly identical diagonal patterns, suggesting potential redundancy in the sparse model's attention heads. ### Interpretation This visualization is a powerful diagnostic tool for understanding the internal mechanics of transformer models. * **What the data demonstrates:** It provides direct visual proof of how a "sparse" attention modification alters model behavior. The standard GPT-2 model utilizes a rich repertoire of attention strategies (local, global, clustered, diffuse) to build its representations. In contrast, the Sparse GPT-2 model is constrained to a much simpler, local-only strategy. * **Relationship between elements:** The two panels serve as a controlled comparison. By showing heads from similar layers (e.g., L5H5 appears in both), the image isolates the effect of the sparsity constraint. The diversity in the top panel is the "baseline" behavior, while the uniformity in the bottom panel is the "constrained" outcome. * **Implications and Anomalies:** * **Efficiency vs. Capability:** The sparse model's patterns suggest a potential gain in computational efficiency (fewer non-zero operations) but raise questions about its ability to model long-range dependencies, which are crucial for tasks like coreference resolution or document-level understanding. * **Interpretability:** The sparse patterns are far easier to interpret (each token focuses only on its immediate context), which could be beneficial for model transparency and debugging. * **Investigative Question:** A key follow-up question would be: Does this enforced locality hurt the model's performance on downstream tasks that require non-local reasoning? The visualization alone cannot answer this, but it frames the critical hypothesis to test. The homogeneity in the sparse model might also indicate an over-regularization, where useful, diverse attention patterns have been inadvertently suppressed. </details> Figure 3: Attention patterns of the heads required to explain 90% of model behaviour on a copy task. The sparse model requires substantially fewer attention heads. Moreover, the selected heads exhibit the characteristic ‘induction head’ pattern: each token attends to a previous token at a fixed relative offset, effectively copying information forward through the sequence, a pattern well known to implement the copy mechanism in transformer models. Equivalent plots for OLMo can be found in Appendix D. <details> <summary>x4.png Details</summary> ![51102b68](/v1/image/51102b68e2751989b4e960992098ac360f98c1fba57ec9e3d9bdbe4235f7694b) ### Visual Description ## Line Charts: Comparative Efficiency of Dense vs. Sparse Transformer Models ### Overview The image displays a set of four line charts arranged horizontally, comparing the performance of dense and sparse versions of two language models (GPT-2 and OLMo-7B) across four different evaluation tasks. The charts plot the "Explained Effect" (y-axis) against the "Number of Heads Kept" (x-axis), demonstrating how model performance scales with the number of attention heads retained after pruning. Each chart includes an annotation indicating a speedup factor. ### Components/Axes * **Chart Titles (Top, Left to Right):** "Greater Than", "IOI", "Docstring", "IOI Long". * **Y-Axis (All Charts):** Label: "Explained Effect". Scale: 0.0 to 1.0, with major ticks at 0.0, 0.5, and 1.0. * **X-Axis (All Charts):** Label: "Number of Heads Kept". * Charts 1 & 2 ("Greater Than", "IOI"): Scale 0 to 100, with major ticks at 0, 50, 100. * Charts 3 & 4 ("Docstring", "IOI Long"): Scale 0 to 1000, with major ticks at 0, 250, 500, 750, 1000. * **Legends (Bottom-Right of each chart):** * Charts 1 & 2: Blue line = "GPT-2", Orange line = "Sparse GPT-2". * Charts 3 & 4: Green line = "OLMo-7B", Pink line = "Sparse OLMo-7B". * **Annotations (Within each chart):** A horizontal dashed black line connects a point on the sparse model's curve to a point on the dense model's curve at the same y-value (Explained Effect ≈ 0.95). A text label above this line indicates the speedup factor: "4.5x", "2.2x", "2.2x", and "1.4x" respectively. ### Detailed Analysis **Chart 1: Greater Than** * **Trend:** Both lines show a sigmoidal increase. The orange "Sparse GPT-2" line rises much more steeply from the origin than the blue "GPT-2" line. * **Data Points (Approximate):** The Sparse GPT-2 curve reaches an Explained Effect of ~0.95 with approximately 22 heads kept. The standard GPT-2 curve reaches the same effect with approximately 100 heads kept. * **Annotation:** The dashed line and label "4.5x" indicate that Sparse GPT-2 achieves the target performance with roughly 4.5 times fewer heads than standard GPT-2. **Chart 2: IOI** * **Trend:** Similar sigmoidal pattern. The orange "Sparse GPT-2" line again has a steeper initial ascent. * **Data Points (Approximate):** Sparse GPT-2 reaches Explained Effect ~0.95 with ~45 heads. Standard GPT-2 reaches it with ~100 heads. * **Annotation:** The "2.2x" label indicates a 2.2x efficiency gain for the sparse model on this task. **Chart 3: Docstring** * **Trend:** Both lines increase, but with more noise/fluctuation than the GPT-2 charts. The pink "Sparse OLMo-7B" line rises more sharply initially. * **Data Points (Approximate):** Sparse OLMo-7B reaches Explained Effect ~0.95 with ~250 heads. Standard OLMo-7B reaches it with ~550 heads. * **Annotation:** The "2.2x" label indicates a 2.2x efficiency gain. **Chart 4: IOI Long** * **Trend:** Both lines show a sharp, almost step-like increase. The pink "Sparse OLMo-7B" line's ascent is significantly earlier and steeper. * **Data Points (Approximate):** Sparse OLMo-7B reaches Explained Effect ~0.95 with ~400 heads. Standard OLMo-7B reaches it with ~560 heads. * **Annotation:** The "1.4x" label indicates a 1.4x efficiency gain, the smallest among the four tasks. ### Key Observations 1. **Consistent Superiority of Sparse Models:** In all four tasks, the sparse model variant (orange/pink) achieves a high "Explained Effect" with fewer attention heads than its dense counterpart (blue/green). 2. **Task-Dependent Speedup:** The efficiency gain (speedup factor) varies significantly by task, ranging from 1.4x to 4.5x. The "Greater Than" task shows the most dramatic benefit from sparsification. 3. **Performance Ceiling:** All models eventually approach an Explained Effect of 1.0, indicating that with enough heads, both dense and sparse models can fully explain the effect. The key difference is the *rate* of convergence. 4. **Model Architecture Difference:** The OLMo-7B charts (3 & 4) operate on a scale of hundreds of heads (x-axis up to 1000), while the GPT-2 charts (1 & 2) operate on a scale of tens of heads (x-axis up to 100), reflecting the different sizes/architectures of the base models. ### Interpretation This set of charts provides strong empirical evidence for the **efficiency of sparse attention mechanisms** in transformer language models. The "Explained Effect" likely measures how well a subset of attention heads can replicate the full model's behavior on specific linguistic tasks (e.g., "Greater Than" for numerical comparison, "IOI" for indirect object identification). The core finding is that **pruning (sparsifying) the model retains the most critical computational pathways**. The sparse models reach high performance with a fraction of the active parameters (heads), suggesting that dense models contain significant redundancy. The varying speedup factors (1.4x to 4.5x) imply that the "importance" or "dispersal" of critical information across heads is task-dependent. Some tasks ("Greater Than") may rely on a very small, core set of heads that are preserved in the sparse model, while others ("IOI Long") may have their critical functions more distributed, leading to a smaller efficiency gain upon pruning. From a practical standpoint, this data supports the use of sparse models for **reducing computational cost and memory footprint** during inference without a proportional loss in performance on targeted tasks. The charts serve as a validation of the sparsification technique used, showing it successfully identifies and preserves the most salient model components. </details> Figure 4: Logit attribution keeping only the top- $k$ attention heads. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. Sparse models yields 1.4 $\times$ to 4.5 $\times$ smaller circuits. Shaded areas show standard error across 20 prompts. <details> <summary>x5.png Details</summary> ![f70e7c31](/v1/image/f70e7c314a7e7a55fd2f75f87104344680cf3a3742697c609acd987a915a8924) ### Visual Description ## Comparative Analysis of Sparse vs. Dense Model Efficiency ### Overview The image displays a 1x4 grid of line charts comparing the performance of standard ("dense") neural network models against their "sparse" counterparts. Each chart plots the "Explained Effect" (y-axis) against the "Number of Edges Kept" (x-axis, logarithmic scale). The central finding is that sparse models achieve the same level of explained effect with significantly fewer parameters (edges), with the efficiency gain (multiplier) varying by task. ### Components/Axes * **Chart Type:** 4 separate line charts arranged horizontally. * **X-Axis (All Charts):** "Number of Edges Kept". Scale is logarithmic (base 10). * Charts 1 & 2 ("Greater Than", "IOI"): Range from 10⁰ (1) to 10⁴ (10,000). * Charts 3 & 4 ("Docstring", "IOI Long"): Range from 10¹ (10) to 10⁵ (100,000). * **Y-Axis (All Charts):** "Explained Effect". Linear scale from 0.0 to 1.0. * **Legends:** Located in the bottom-right corner of each subplot. * Chart 1: Blue line = "GPT-2", Orange line = "Sparse GPT-2" * Chart 2: Blue line = "GPT-2", Orange line = "Sparse GPT-2" * Chart 3: Green line = "OLMo-7B", Pink line = "Sparse OLMo-7B" * Chart 4: Green line = "OLMo-7B", Pink line = "Sparse OLMo-7B" * **Annotations:** Each chart contains a dashed horizontal line near the top (y ≈ 0.95-0.98) with a multiplier value (e.g., "97.0x") indicating the relative efficiency of the sparse model. ### Detailed Analysis **Chart 1: "Greater Than"** * **Trend Verification:** The orange "Sparse GPT-2" curve rises steeply from the left, reaching near-maximum explained effect (~0.95) at approximately 10² edges. The blue "GPT-2" curve rises much more gradually, reaching the same effect level at nearly 10⁴ edges. * **Key Data Point/Annotation:** "97.0x". This indicates the sparse model requires roughly 97 times fewer edges to achieve the same high explained effect as the dense model on this task. * **Spatial Grounding:** The dashed line and "97.0x" label are positioned in the upper center of the plot area. **Chart 2: "IOI"** * **Trend Verification:** Similar pattern to Chart 1. The orange "Sparse GPT-2" curve is significantly to the left of the blue "GPT-2" curve, indicating superior parameter efficiency. * **Key Data Point/Annotation:** "42.8x". The efficiency gain is substantial but less extreme than for the "Greater Than" task. * **Spatial Grounding:** Annotation is centered in the upper plot area. **Chart 3: "Docstring"** * **Trend Verification:** The pink "Sparse OLMo-7B" curve is to the left of the green "OLMo-7B" curve, but the gap between them is narrower than in the first two charts. Both curves have a sigmoidal shape. * **Key Data Point/Annotation:** "8.6x". The sparse model is still more efficient, but the advantage is an order of magnitude smaller than for the GPT-2 models on the previous tasks. * **Spatial Grounding:** Annotation is centered in the upper plot area. **Chart 4: "IOI Long"** * **Trend Verification:** The pink "Sparse OLMo-7B" curve is again to the left of the green "OLMo-7B" curve. The curves are closer together than in Chart 3. * **Key Data Point/Annotation:** "5.4x". This represents the smallest efficiency multiplier among the four charts. * **Spatial Grounding:** Annotation is centered in the upper plot area. ### Key Observations 1. **Universal Efficiency Gain:** In all four tasks, the sparse model variant achieves any given level of "Explained Effect" with fewer edges than its dense counterpart. 2. **Diminishing Multiplier:** The efficiency multiplier decreases dramatically across the charts: 97.0x → 42.8x → 8.6x → 5.4x. 3. **Task/Model Dependency:** The magnitude of the sparsification benefit is highly dependent on both the model architecture (GPT-2 vs. OLMo-7B) and the specific task ("Greater Than" vs. "IOI" vs. "Docstring" vs. "IOI Long"). 4. **Curve Shape:** All curves are sigmoidal (S-shaped), indicating a phase transition where explanatory power rapidly increases after a certain threshold of edges is retained. ### Interpretation This data provides strong empirical evidence for the **pruning hypothesis** in neural networks: that a significant portion of a model's parameters (edges) are redundant for specific tasks. The "Explained Effect" likely measures how well a subnetwork (defined by the kept edges) can replicate the full model's behavior or performance. The key insight is that **the degree of redundancy is not constant**. The massive 97x multiplier for "Greater Than" suggests this is a relatively simple, localized computation that can be encoded in a very small sub-circuit of GPT-2. In contrast, the "IOI Long" task (likely a more complex, long-range dependency problem) shows less redundancy (5.4x), implying its solution is more distributed across the network's parameters. The transition from GPT-2 to the larger OLMo-7B model, and from simpler to more complex tasks, shows a clear trend: **as task complexity and/or model scale increases, the relative efficiency gain from sparsification decreases, though it remains positive.** This has profound implications for model compression and efficient AI, suggesting that while pruning is universally beneficial, its most dramatic savings are found in simpler cognitive tasks or within smaller models. The consistent sigmoidal curves further suggest there exists a critical minimal subnetwork size required to solve a task, below which performance collapses. </details> Figure 5: Logit attribution per sentence keeping only the top- $k$ attention edges. Sparse models yields 5.4 $\times$ to 97 $\times$ smaller circuits. Shaded area shows standard error across 20 prompts. We begin by outlining the experimental procedure used for circuit discovery. Activation patching (Nanda et al., 2023) is a widely used technique for identifying task-specific circuits in transformer models. In a typical setup, the model is evaluated on pairs of prompts: a clean prompt, for which the model predicts a correct target token, and a corrupted prompt that shares the overall structure of the clean prompt but is modified to induce an incorrect prediction. Here, the goal is to find the set of model components that is responsible for the model’s preference for the correct answer over the wrong one, as measured by the logit difference between the corresponding tokens. In activation patching, individual model components, such as attention heads and individual edges, can be ’switched-off’ by patching activation at the specific positions. Circuit discovery amounts to finding a set of components whose replacement causes the model’s prediction to shift from the correct to the corrupted answer. Since searching over every possible subset of model components is infeasible due to the exponential number of potential subsets, we adopt a common heuristic to rank each model component. Specifically, for each individual component, we compute an importance score by replacing the activations of the component with the corrupted activations and measuring its effect on the logit difference. In our experiments, we use this ranking to select the top- $k$ components and intervene on the model by freezing all remaining components, with the goal of identifying the minimal set that accounts for at least 90% of the model’s preference for the correct prediction. Note that these importance scores can be computed at two levels: (i) a single-sentence level, using a single pair of correct and corrupted inputs, and (ii) a global level, obtained by averaging scores across many task variants. In our experiments, we report the results using single-sentence scores. In Appendix D, we also provide results using the global scores, which are largely consistent with our main results. There are also two standard approaches for freezing component activations: setting the activation to zero or replacing it with a mean activation value (Conmy et al., 2023). We evaluate both variants for each model and report results for the patching strategy that yields the smallest circuits. We first focus on the copy task with the following prompt: "AJEFCKLMOPQRSTVWZS, AJEFCKLMOPQRSTVWZ", where the model has to copy the letter S to the next token position. This task is well studied and is widely believed to be implemented by emergent induction heads (Elhage et al., 2021), which propagate token information forward in the sequence. Figure 3 illustrates the attention patterns of the set of attention heads that explains this prompt for the sparse and base GPT-2 models. See Appendix D for analogous results for the OLMo models. The sparse model admits a substantially smaller set of attention heads (9 heads) than its fully connected counterpart (61 heads). Moreover, the identified heads in the sparse model exhibit cleaner induction head patterns, with each token attending to a single prior position at a fixed relative offset. These results illustrate how sparsification facilitates interpretability under simple ranking-based methods and support our hypothesis that sparse post-training yields models that are more amenable to mechanistic interpretability techniques. To further verify our hypothesis, we repeat the experiment on classical circuit discovery tasks. For GPT-2, we evaluate variants of the Indirect Object Identification (IOI) task, in which the model copies a person’s name from the start of a sentence, and the Greater Than task, in which the model predicts a number that is larger than a previously mentioned number. To further assess the scalability of our approach, we investigate more challenging and longer horizon tasks for OLMo, including a longer context IOI task and a Docstring task where the model needs to predict an argument name in a Docstring based on an implemented function. Details of each task can be found in Appendix E. Figure 4 and 5 show the fraction of model behaviour explained as a function of the number of retained model components (attention heads and attention edges, respectively). Across all tasks and models, the sparse models consistently produce significantly smaller circuits, as measured by the number of model components needed to explain 90% of model prediction. This further corroborates our claim that sparse models lead to simpler and more interpretable internal circuits. ### 4.3 Attribution-graph Next, we present a more fine-grained, feature-level investigation of whether sparsity in attention leads to interpretable circuits in practice using cross-layer transcoders (CLTs). Since training CLTs on OLMo-7B is computationally prohibitive The largest open-source CLT is on Gemma-2B at the time of writing., we focus our analysis on the GPT-2 models. For the rest of the section, we perform analysis on CLTs trained on the sparse and base GPT-2 models, trained with an expansion factor of $32$ and achieve above $80\$ replacement score measured with Circuit Tracer (Hanna et al., 2025). See Appendix F and G for details on training and visualisation. We study the problem of attention attribution, which seeks to understand how edges between features are mediated. The key challenge here is that any given edge can be affected by a large number of model components, making mediation circuits difficult to analyse both computationally and conceptually: computationally, exhaustive enumeration is costly; conceptually, the resulting circuits are often large and uninterpretable. In this experiment, we demonstrate that sparse attention patterns induced via post-training substantially alleviate these challenges, as the vast majority of attention components have zero effect on the computation. As in (Ameisen et al., 2025), we define the total attribution score between feature $n$ at layer $\ell$ and position $k$ , and feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ as $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}=f_{k,n}^{\ell}\;J_{\ell,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}. \tag{6} $$ Here, $f_{k,n}^{\ell}$ denotes the decoder vector corresponding to feature $n$ at layer $\ell$ and position $k$ , and $g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}$ is the corresponding encoder vector for feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ . The term $J_{\ell,k}^{\ell^{\prime},k^{\prime}}$ is the Jacobian from the MLP output at $(\ell,k)$ to the MLP input at $(\ell^{\prime},k^{\prime})$ . This Jacobian is computed during a forward pass in which all nonlinearities are frozen using stop-gradient operations. Under this linearisation, the attribution score represents the sum over all linear paths from the source feature to the target feature. To analyse how this total effect between two features is mediated by each model component, we define the component-specific attribution by subtracting the contribution of all paths that do not pass through the component: $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}(h)=f_{k,n}^{\ell}\;J_{\ell,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}-f_{k,n}^{\ell}\;\bigl[J_{\ell,k}^{\ell^{\prime},k^{\prime}}\bigr]_{h}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}. $$ Here, $\bigl[J_{\ell,k}^{\ell^{\prime},k^{\prime}}\bigr]_{h}$ denotes a modified Jacobian computed under the same linearization as above, but with the specific attention component $h$ additionaly frozen via stop-gradient. As such, these component-specific scores quantifies how much each model component impacts a particular edge between features. Empicially, we evaluate the method on ten pruned attribution graphs, computed on the IOI, greater-than, completion, and category tasks. Similar to our previous circuit discovery experiment, we compute attribution scores on the level of attention heads as well as individual key–query pairs. In practice, attention sparsity yields substantial computational savings: because inactive key–query pairs are known a priori to have exactly zero attribution score, attribution need only be computed for a small subset of components. This reduces the computation time per attribution graph from several hours to several minutes. <details> <summary>x6.png Details</summary> ![fe1f3fb7](/v1/image/fe1f3fb7ea080407f6be227b211f80966dbd0410fa97414d2f29acef21168a05) ### Visual Description ## Line Charts: Edges and Heads Cumulative Mass Comparison ### Overview The image displays two side-by-side line charts comparing the "Mean Cumulative Mass" of "Sparse" versus "Non Sparse" methods across two different domains: "Edges" (left chart) and "Heads" (right chart). Both charts plot cumulative mass against a sorted index, demonstrating that the Sparse method achieves higher cumulative mass with fewer elements. ### Components/Axes **Common Elements:** * **Y-Axis (Both Charts):** Labeled "Mean Cumulative Mass". The scale ranges from 0.50 to 1.00, with major tick marks at 0.50, 0.75, and 1.00. * **Legend (Located in bottom-right of the "Heads" chart):** * **Blue Line:** "Non Sparse" * **Orange Line:** "Sparse" **Left Chart: "Edges"** * **Title:** "Edges" (top-left). * **X-Axis:** Labeled "Sorted Index (log scale)". It is a logarithmic scale with major tick marks at 10⁰ (1), 10¹ (10), 10² (100), and 10³ (1000). * **Annotation:** A dashed black horizontal line connects the two curves near the top. The text "16.1x" is placed above this line, indicating a multiplicative factor. **Right Chart: "Heads"** * **Title:** "Heads" (top-left). * **X-Axis:** Labeled "Sorted Index". It is a linear scale with major tick marks at 25, 50, 75, 100, and 125. * **Annotation:** A dashed black horizontal line connects the two curves near the top. The text "3.4x" is placed above this line, indicating a multiplicative factor. ### Detailed Analysis **Trend Verification & Data Points:** * **General Trend (Both Charts):** The "Sparse" (orange) line rises more steeply and plateaus at a cumulative mass of 1.00 much earlier (at a lower sorted index) than the "Non Sparse" (blue) line. This indicates the Sparse method concentrates mass in fewer top-ranked elements. * **"Edges" Chart (Log Scale):** * The Sparse line starts at approximately 0.45 at index 1 (10⁰) and reaches near 1.00 by index ~100 (10²). * The Non Sparse line starts near 0 at index 1 and rises more gradually, reaching near 1.00 by index ~1000 (10³). * The "16.1x" annotation suggests that to achieve a specific high cumulative mass (visually estimated at ~0.95), the Non Sparse method requires approximately 16.1 times more sorted elements than the Sparse method. * **"Heads" Chart (Linear Scale):** * The Sparse line starts at approximately 0.50 at index 0 and reaches near 1.00 by index ~25. * The Non Sparse line starts near 0 at index 0 and rises more gradually, reaching near 1.00 by index ~125. * The "3.4x" annotation suggests that to achieve a specific high cumulative mass (visually estimated at ~0.95), the Non Sparse method requires approximately 3.4 times more sorted elements than the Sparse method. ### Key Observations 1. **Consistent Superiority of Sparse:** In both domains (Edges and Heads), the Sparse method demonstrates superior efficiency, accumulating mass significantly faster. 2. **Magnitude of Difference:** The efficiency gain is more dramatic for "Edges" (16.1x) than for "Heads" (3.4x), as indicated by the annotations. 3. **Scale Context:** The "Edges" chart uses a logarithmic x-axis, compressing a wide range of indices (1 to 1000+), while the "Heads" chart uses a linear axis over a narrower range (0 to ~140). This difference in scale is crucial for interpreting the absolute index values. 4. **Convergence:** Both methods in both charts eventually converge to a cumulative mass of 1.00, meaning all mass is accounted for when considering all elements. ### Interpretation These charts likely visualize the concept of **sparsity** in a technical context, such as neural network pruning, graph theory, or attention mechanisms. The "Mean Cumulative Mass" represents the proportion of a total quantity (e.g., weight magnitude, signal strength, importance score) captured by the top-ranked elements. * **What the Data Suggests:** The Sparse method is highly effective at identifying and concentrating the most significant elements. A small subset of "Sparse" edges or heads contains the vast majority of the "mass" or importance. The Non Sparse distribution is more diffuse, requiring many more elements to capture the same amount of information. * **Relationship Between Elements:** The "Edges" and "Heads" likely refer to different components of a system (e.g., edges in a graph neural network, attention heads in a transformer). The analysis shows sparsity is beneficial in both, but the effect is more pronounced for edges. * **Notable Anomalies/Outliers:** There are no apparent outliers in the smooth curves. The key anomaly is the stark difference in the rate of accumulation between the two methods, which is the central point of the visualization. * **Practical Implication:** The multipliers (16.1x, 3.4x) quantify potential efficiency gains. For example, if one can achieve 95% of the performance using only the top 10% of elements identified by the Sparse method, this translates to significant computational savings or model compression. The charts provide empirical evidence for the effectiveness of a sparsity-inducing technique. </details> Figure 6: Mean cumulative distribution of the component scores that mediate an attribution graph edge. The components are on the left key-query pairs within a head, and on the right full attention heads. In terms of circuit size, Figure 6 shows the mean cumulative distribution of component attribution scores for each edge in the attribution graph. We find that, to reach a cumulative attribution threshold of $90\$ , the sparse model on average requires $16.1\times$ fewer key–query pairs and $3.4\times$ fewer attention heads when compared to the dense GPT-2 model, supporting our hypothesis that sparse attention patterns leads to simpler mediation circuits. <details> <summary>x7.png Details</summary> ![ae21d501](/v1/image/ae21d50113add26a9fde2e735cef280dd94bb7fa36639e1e0e6ba0f36db00625) ### Visual Description ## Diagram: Attention Pattern Comparison (GPT-2 vs. Sparse GPT-2) ### Overview The image is a technical diagram comparing the attention head patterns of a standard GPT-2 model (left panel) and a "Sparse GPT-2" model (right panel). It visualizes how different attention heads in various layers map key positions to query positions for a specific input sentence. The diagram includes a main heatmap-style grid, a detailed breakout of specific heads, and a flow diagram linking words to model layers. ### Components/Axes **Left Panel (GPT-2):** * **Label:** "GPT-2" (bottom center). * **Grid:** A 12x12 grid representing attention heads (likely layers x heads or a similar mapping). The grid contains colored squares. * **Color Legend (Implied):** Blue and red squares. The right panel clarifies that red indicates a specific mapping. **Right Panel (Sparse GPT-2):** * **Label:** "Sparse GPT-2" (bottom center). * **Top Annotation:** "All heads map key pos 5 to query pos 8" (red text, top center). * **Axes:** A small coordinate system in the top-left corner with: * Vertical axis labeled "Q" (Query) with an upward arrow. * Horizontal axis labeled "K" (Key) with a rightward arrow. * **Grids:** Five smaller 12x12 grids arranged horizontally, each labeled with a specific attention head: * L11-H7 (Layer 11, Head 7) * L10-H1 (Layer 10, Head 1) * L9-H7 (Layer 9, Head 7) * L9-H1 (Layer 9, Head 1) * L8-H6 (Layer 8, Head 6) * **Legend/Key:** A red square and a blue square in the top-left grid (L11-H7). The annotation confirms red squares represent the mapping from key position 5 to query position 8. * **Flow Diagram Components:** * **Input Sentence:** "The opposite of \" large \" is \" \"" with positional indices 1 through 8 below each token. The word "large" (position 5) is highlighted in red. * **Word Boxes:** Three rounded rectangles representing words and their associated layer ranges: * "opposite" / "layer 0-1" * "large" / "layer 0-3" * "brackets" / "layer 0-10" * **Modulation Box:** A box labeled "small" / "layer 12" with an arrow pointing to the "large" box, annotated with "Modulated at 80% by". * **Connecting Lines:** Lines connect the words in the sentence to their respective boxes and show the flow of information/modulation. ### Detailed Analysis **1. GPT-2 Attention Grid (Left Panel):** * **Visual Trend:** The grid shows a pattern of vertical blue lines and scattered red squares. The blue lines suggest consistent attention patterns across certain positions for many heads. The red squares, representing the specific mapping from key pos 5 to query pos 8, are distributed across various heads without a single dominant pattern. **2. Sparse GPT-2 Attention Grids (Right Panel - Top):** * **Trend Verification:** For the five highlighted heads (L11-H7, L10-H1, L9-H7, L9-H1, L8-H6), the red squares (mapping K5->Q8) are consistently present. Their placement varies slightly but is generally in the upper-right quadrant of each head's grid. * **Data Points (Approximate Grid Coordinates for Red Square):** * L11-H7: (Q≈9, K≈5) * L10-H1: (Q≈9, K≈5) * L9-H7: (Q≈9, K≈5) * L9-H1: (Q≈9, K≈5) * L8-H6: (Q≈9, K≈5) * **Observation:** The red squares are highly aligned across these specific heads in the Sparse model, indicating a focused, shared attention mechanism on the target position pair (5->8). **3. Flow Diagram (Right Panel - Bottom):** * **Component Isolation:** * **Header/Context:** The sentence "The opposite of \" large \" is \" \"" defines the input. Token 5 ("large") is the focal key position. * **Main Processing Flow:** The word "opposite" is processed in early layers (0-1). The word "large" is processed across layers 0-3. The word "brackets" (likely referring to the quotation marks) is processed across a wide range of layers (0-10). * **Modulation:** A separate "small" component in layer 12 modulates the processing of "large" at an 80% rate. This suggests a late-layer, high-influence adjustment. ### Key Observations 1. **Focused vs. Distributed Attention:** The standard GPT-2 shows a distributed pattern for the K5->Q8 mapping (red squares scattered). The Sparse GPT-2 shows a highly concentrated pattern for the same mapping across the five highlighted heads. 2. **Layer Specialization:** The flow diagram indicates different words are active in different layer ranges. "Brackets" have the widest layer engagement (0-10), while "opposite" is limited to the earliest layers (0-1). 3. **Late-Stage Modulation:** A significant modulation (80%) occurs at the final layer (12) specifically targeting the processing of the word "large". 4. **Positional Consistency:** In the Sparse GPT-2 heads shown, the red square (K5->Q8) appears at a consistent relative position within each head's grid (around Q=9, K=5). ### Interpretation This diagram illustrates a key difference in how sparse attention models can function compared to standard transformers. The data suggests that **Sparse GPT-2 achieves efficiency by forcing multiple attention heads across different layers to specialize in the same critical token-to-token mapping** (here, connecting the key at position 5 ["large"] to the query at position 8). This creates a robust, redundant pathway for that specific linguistic relationship. In contrast, the standard GPT-2's attention for the same mapping is more diffuse, spread across many heads without strong consensus. The flow diagram provides a mechanistic hypothesis: early layers handle basic word relationships ("opposite"), mid-layers process core semantic content ("large"), and a late, powerful modulation (80% from "small" layer 12) fine-tunes the output based on that core content. The wide layer engagement for "brackets" suggests punctuation or structural tokens require sustained processing throughout the network. The **outlier** is the "brackets" component, which has an unusually broad layer range (0-10), indicating that syntactic or structural elements may be processed in a fundamentally different, more persistent manner than content words. The **notable trend** is the engineered sparsity leading to head specialization, which is the core concept the diagram aims to demonstrate. </details> Figure 7: Sketch of the attribution graph for the sentence “The opposite of ‘large’ is”. The cluster of features associated with large at token position 5 maps directly to the final next-token prediction logit small. We show the attention patterns of all key–query pairs required to account for $80\$ of the cumulative attribution score. In the sparse-attention setting, this corresponds to five attention heads, compared to more than forty heads in the dense-attention case. In the sparse model, these heads read from token position 5 and write directly to the last token residual stream at token position 8. These heads thus compute in parallel and provide a clear picture of the internal computation. Next, we present a qualitative case-study to showcase the benefits of sparse attention patterns. For a given key–query pair, we compute the causal effect from all other features in the attribution graph to both the key and the query vectors. Figure 7 illustrates this analysis for the prompt “The opposite of ‘large’ is”. The resulting attribution graph decomposes into four coherent clusters of features: features related to opposite, features related to large, features activating on bracketed tokens, and the final next-token logit corresponding to small (see Appendix H for example of features and visualization). Here, the features in the large cluster are directly connected to the small logit. The key question is then to understand how this connection from the large to the small logit comes about. To this end, we analyse their mediation structure. We find that $80\$ of the cumulative attribution score of the edges connecting the large cluster to the small logit is mediated by the same five late layer attention key–query pairs. These attention components map features from token position $5$ directly into the final-layer residual stream at position $8$ , and thus operate in parallel. For these five key–query pairs, we then compute the causal influence of all other features in the graph on their key and query vectors. The query vectors are primarily modulated by features associated with bracketed tokens in the last token position, while the key vectors are driven by strongly active features in both the opposite and large clusters, as shown in Figure 8.These results are in agreement with the recent work on attention attribution and the ”opposite of” attribution graph (Kamath et al., 2025). In stark contrast, Figure 7 (left) shows that a similar (and more computationally expensive) analysis on the dense model produces a much more complicated circuit. This case study illustrates the potential of sparse attention in the context of attribution graphs, as it enables a unified view of features and circuits. By jointly analyzing feature activations, attention components, and their mediating roles, we obtain a more faithful picture of the computational graph underlying the model’s input–output behavior. ## 5 Conclusion Achieving interpretability requires innovations in both interpretation techniques and model design. We investigate how large models can be trained to be intrinsically interpretable. We present a flexible post-training procedure that sparsifies transformer attention while preserving the original pretraining loss. By minimally adapting the architecture, we apply a sparsity penalty under a constrained-loss objective, allowing pre-trained model to reorganise its connectivity into a much more selective and structured pattern. $\rightarrow$ Query ⬇ 1. large (pos 5) 2. large (pos 5) 3. quantities (pos 5) 4. comparison (pos 3) 5. opposite (pos 3) $\rightarrow$ Key ⬇ 1. bracket (pos 8) 2. bracket (pos 8) 3. bracket (pos 8) 4. bracket (pos 8) 5. bracket (pos 8) Figure 8: Minimal description of the top5 features activating the query and the key vectors for the attention head L8-H6 from Figure 7. Mechanistically, this induced sparsity gives rise to substantially simpler circuits: task-relevant computation concentrates into a small number of attention heads and edges. Across a range of tasks and analyses, we show that sparsity improves interpretability at the circuit level by reducing the number of components involved in specific behaviours. In circuit discovery experiments, most of the model’s behaviour can be explained by circuits that are orders of magnitude smaller than in dense models; in attribution graph analyses, the reduced number of mediating components renders attention attribution tractable. Together, these results position sparse post-training of attention as a practical and effective tool for enhancing the mechanistic interpretability of pre-trained models. #### Limitations and Future Work. One limitation of the present investigation is that, while we deliberately focus on sparsity as a post-training intervention, it remains an open question whether injecting a sparsity bias directly during training would yield qualitatively different or simpler circuit structures. Also, a comprehensive exploration of the performance trade-offs for larger models and for tasks that require very dense or long-range attention patterns would be beneficial, even if beyond the computational means currently at our disposal. Moreover, our study is primarily restricted to sparsifying attention patterns, the underlying principle of leveraging sparsity to promote interpretability naturally extends to other components of the transformer architecture. As such, combining the proposed method with complementary approaches for training intrinsically interpretable models, such as Sparse Mixture-of-Experts (Yang et al., 2025), sparsifying model weights (Gao et al., 2024), or limiting superposition offers a promising direction for future work. Another exciting avenue for future work is to apply the sparsity regularisation framework developed here within alternative post-training paradigms, such as reinforcement learning (Ouyang et al., 2022; Zhou et al., 2024) or supervised fine-tuning (Pareja et al., 2025). ## Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. ## Acknowledgment F. D. acknowledges support through a fellowship from the Hector Fellow Academy. A. L. is supported by an EPSRC Programme Grant (EP/V000748/1). I. P. holds concurrent appointments as a Professor of Applied AI at the University of Oxford and as an Amazon Scholar. This paper describes work performed at the University of Oxford and is not associated with Amazon. ## References - E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N. L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson (2025) Circuit tracing: revealing computational graphs in language models. Transformer Circuits Thread. External Links: Link Cited by: Appendix F, Appendix G, Appendix G, §1, §2.3, §4.3. - I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §2.1. - L. Bereska and E. Gavves (2024) Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082. Cited by: §1. - R. e. al. Bommasani (2021) On the opportunities and risks of foundation models. ArXiv. External Links: Link Cited by: §1. - R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.1. - T. Conerly, H. Cunningham, A. Templeton, J. Lindsey, B. Hosmer, and A. Jermyn (2025) Circuits updates – january 2025. Note: Transformer Circuits Thread External Links: Link Cited by: Appendix F. - A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso (2023) Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems 36, pp. 16318–16352. Cited by: §E.2, §E.3, §1, §1, §2.2, §4.2. - T. Dao (2023) Flashattention-2: faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691. Cited by: Appendix B, §3.3. - DeepSeek-AI (2025) DeepSeek-v3.2: pushing the frontier of open large language models. External Links: 2512.02556, Link Cited by: §2.1. - J. Dunefsky, P. Chlenski, and N. Nanda (2024) Transcoders find interpretable LLM feature circuits. Advances in Neural Information Processing Systems 37, pp. 24375–24410. Cited by: §2.3. - N. Elhage, N. Nanda, et al. (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2021/framework/index.html Cited by: §E.1, §E.2, §4.2. - L. Gao, A. Rajaram, J. Coxon, S. V. Govande, B. Baker, and D. Mossing (2024) Weight-sparse transformers have interpretable circuits. Technical report OpenAI. External Links: Link Cited by: §5. - A. Gokaslan and V. Cohen (2019) OpenWebText corpus. Note: http://Skylion007.github.io/OpenWebTextCorpus Cited by: §4. - D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, N. Subramani, M. Wortsman, P. Dasigi, N. Lambert, K. Richardson, J. Dodge, K. Lo, L. Soldaini, N. A. Smith, and H. Hajishirzi (2024) OLMo: accelerating the science of language models. Preprint. Cited by: §4. - Y. Gu, L. Dong, F. Wei, and M. Huang (2024) MiniLLM: knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §3.3. - A. Gupta, G. Dar, S. Goodman, D. Ciprut, and J. Berant (2021) Memory-efficient transformers via top- $k$ attention. arXiv preprint arXiv:2106.06899. Cited by: §2.1. - M. Hanna, M. Piotrowski, J. Lindsey, and E. Ameisen (2025) Circuit-tracer. Note: https://github.com/safety-research/circuit-tracer The first two authors contributed equally and are listed alphabetically. Cited by: Appendix G, §4.3. - S. Heimersheim and J. Janiak (2023) A circuit for python docstrings in a 4-layer attention-only transformer. In Alignment Forum, Cited by: §E.3. - E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. (2022) Lora: low-rank adaptation of large language models.. ICLR 1 (2), pp. 3. Cited by: §3.3. - E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, External Links: Link Cited by: §3.1. - H. Kamath, E. Ameisen, I. Kauvar, R. Luger, W. Gurnee, A. Pearce, S. Zimmerman, J. Batson, T. Conerly, C. Olah, and J. Lindsey (2025) Tracing attention computation through feature interactions. Transformer Circuits Thread. External Links: Link Cited by: §1, §2.3, §4.3. - A. Lei, B. Schölkopf, and I. Posner (2025) SPARTAN: a sparse transformer world model attending to what matters. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §2.1, §3.2, §3. - J. Lindsey, E. Ameisen, N. Nanda, S. Shabalin, M. Piotrowski, T. McGrath, M. Hanna, O. Lewis, C. Tigges, J. Merullo, C. Watts, G. Paulo, J. Batson, L. Gorton, E. Simon, M. Loeffler, C. McDougall, and J. Lin (2025a) The circuits research landscape: results and perspectives. Neuronpedia. External Links: Link Cited by: Appendix F. - J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson (2025b) On the biology of a large language model. Transformer Circuits Thread. External Links: Link Cited by: §2.3. - N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt (2023) Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217. Cited by: §E.1, §1, §2.2, §4.2. - C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. (2022) In-context learning and induction heads. arXiv preprint arXiv:2209.11895. Cited by: §1. - L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, pp. 27730–27744. Cited by: §5. - A. Pareja, N. S. Nayak, H. Wang, K. Killamsetty, S. Sudalairaj, W. Zhao, S. Han, A. Bhandwaldar, G. Xu, K. Xu, L. Han, L. Inglis, and A. Srivastava (2025) Unveiling the secret recipe: a guide for supervised fine-tuning small LLMs. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §5. - Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin (2025) Gated attention for large language models: non-linearity, sparsity, and attention-sink-free. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §2.1. - A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §4. - D. J. Rezende and F. Viola (2018) Taming vaes. External Links: 1810.00597, Link Cited by: §3.2. - L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer, N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld, J. Dodge, and K. Lo (2024) Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint. Cited by: §4. - A. Syed, C. Rager, and A. Conmy (2024) Attribution patching outperforms automated circuit discovery. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 407–416. Cited by: §2.2. - X. Yang, C. Venhoff, A. Khakzar, C. S. de Witt, P. K. Dokania, A. Bibi, and P. Torr (2025) Mixture of experts made intrinsically interpretable. arXiv preprint arXiv:2503.07639. Cited by: §5. - M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. (2020) Big bird: transformers for longer sequences. Advances in neural information processing systems 33, pp. 17283–17297. Cited by: §2.1. - F. Zhang and N. Nanda (2023) Towards best practices of activation patching in language models: metrics and methods. arXiv preprint arXiv:2309.16042. Cited by: §2.2. - Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar (2024) ArCHer: training language model agents via hierarchical multi-turn rl. In ICML, External Links: Link Cited by: §5. ## Appendix A Two-Digit Addition Study <details> <summary>x8.png Details</summary> ![7bba7db1](/v1/image/7bba7db13340d357336853eaca905183ae196d243d426001bd5bce0c3688ef9c) ### Visual Description ## Diagram: Comparison of Non-Sparse and Sparse Neural Network Architectures for Arithmetic Operations ### Overview The image is a technical diagram comparing two types of neural network connectivity patterns—labeled "Non-Sparse" and "Sparse"—applied to three different addition problems. It visually demonstrates how information flows through a 4-layer network (Layer 0 to Layer 3) to compute the sum. The top half shows the "Non-Sparse" architecture with dense connections, while the bottom half shows the "Sparse" architecture with significantly fewer, targeted connections. Both architectures produce identical numerical outputs for the given problems. ### Components/Axes * **Main Sections:** The diagram is split horizontally into two primary sections. * **Top Section:** Labeled **"Non-Sparse"** on the left. * **Bottom Section:** Labeled **"Sparse"** on the left. * **Columns:** Each section contains three columns, each dedicated to a specific arithmetic problem. * **Left Column:** Problem: `5 3 + 2 1 = ? ? ? ? ?` * **Middle Column:** Problem: `3 6 + 2 8 = ? ? ? ? ?` * **Right Column:** Problem: `4 3 + 5 9 = ? ? ? ? ?` * **Network Layers (Y-Axis):** Each diagram has a vertical axis labeled with four layers, from bottom to top: * `Layer 0` * `Layer 1` * `Layer 2` * `Layer 3` * **Nodes:** Each layer consists of a horizontal row of 11 small, grey circular nodes. * **Connections:** Blue lines represent connections (weights) between nodes in adjacent layers. * **Output Display:** The computed result is displayed in two places for each problem: 1. **Top of each diagram:** The full equation with the result, e.g., `5 3 + 2 1 = 0 0 0 7 4`. 2. **Bottom of each diagram:** The equation with the result represented by question marks, e.g., `5 3 + 2 1 = ? ? ? ? ?`. * **Additional Symbols:** Small, black, curved arrow icons appear above the result in the top display for the middle and right columns in both sections. ### Detailed Analysis **1. Non-Sparse Architecture (Top Section):** * **Connectivity Pattern:** Extremely dense. Nearly every node in a given layer is connected to every node in the layer above it, creating a complex web of blue lines. * **Problem 1 (53 + 21):** * **Top Result:** `5 3 + 2 1 = 0 0 0 7 4` * **Visual Trend:** Connections are uniformly dense across all layers (0 through 3). * **Problem 2 (36 + 28):** * **Top Result:** `3 6 + 2 8 = 0 0 0 6 4` * **Visual Trend:** Dense connections, similar to Problem 1. A small black arrow icon is present above the result `6 4`. * **Problem 3 (43 + 59):** * **Top Result:** `4 3 + 5 9 = 0 0 1 0 2` * **Visual Trend:** Dense connections. Two small black arrow icons are present above the result `1 0 2`. **2. Sparse Architecture (Bottom Section):** * **Connectivity Pattern:** Highly selective. Connections are minimal and appear to follow specific, learned pathways, primarily originating from nodes in the lower layers (Layer 0 and Layer 1). * **Problem 1 (53 + 21):** * **Top Result:** `5 3 + 2 1 = 0 0 0 7 4` (Identical to Non-Sparse). * **Visual Trend:** Connections are almost exclusively from nodes in Layer 0 to nodes in Layer 1. Very few connections exist to higher layers. * **Problem 2 (36 + 28):** * **Top Result:** `3 6 + 2 8 = 0 0 0 6 4` (Identical to Non-Sparse). * **Visual Trend:** Connections start in Layer 0, converge to a few nodes in Layer 1, and then a single, prominent connection path extends up to Layer 3. The black arrow icon is present. * **Problem 3 (43 + 59):** * **Top Result:** `4 3 + 5 9 = 0 0 1 0 2` (Identical to Non-Sparse). * **Visual Trend:** Similar to Problem 2, with a focused pathway from lower layers to a specific node in Layer 3. Two black arrow icons are present. ### Key Observations 1. **Identical Outputs:** Despite the drastic difference in connectivity density, both the Non-Sparse and Sparse architectures compute the exact same results for all three addition problems. 2. **Connectivity Efficiency:** The Sparse architecture achieves the same result using a tiny fraction of the connections used by the Non-Sparse architecture. This suggests a model that has learned to prune irrelevant connections. 3. **Pathway Specialization:** In the Sparse diagrams, the connection patterns differ between problems (e.g., the pathway for `36+28` is distinct from `43+59`), indicating the network may develop specialized sub-circuits for different input patterns. 4. **Layer Utilization:** The Non-Sparse model uses all layers heavily. The Sparse model appears to perform significant computation in the lower layers (0 and 1), with only critical information being propagated to the highest layer (Layer 3) for the final output. ### Interpretation This diagram is a powerful visual argument for the efficiency of **sparse neural networks** or **pruned networks**. It demonstrates that a densely connected ("Non-Sparse") network contains a vast amount of redundancy. Through training or pruning, a network can be transformed into a "Sparse" version that retains full accuracy on its task while eliminating the majority of its parameters (connections). The **implications** are significant for computational efficiency: * **Reduced Memory:** A sparse model requires less memory to store its weights. * **Faster Computation:** Fewer connections mean fewer mathematical operations during the forward pass, leading to faster inference. * **Interpretability:** The focused pathways in the sparse model are easier to analyze, potentially revealing how the network solves the arithmetic problem (e.g., which input digits contribute to which output digits). The black arrow icons above certain results in the top display may indicate a specific operation (like a "carry" in addition) or highlight outputs that required more complex computation, which the sparse network still handles correctly with its efficient pathways. The diagram ultimately argues that intelligence (or at least, correct computation) in a neural network does not require every neuron to talk to every other neuron; it can emerge from a sparse, efficient, and structured set of connections. </details> Figure 9: Simple example showing the attention patterns (shown in blue) of sparse and non-sparse transformers trained on a two digit addition task. Both models are able to correctly predict the sum, but the attention patterns are very different: the non-sparse model solves the task with highly dispersed information flow, while the sparse model uses a highly interpretable attention pattern: in Layer 0, the model first attends to the corresponding digits to be added, then in Layer 1, it attends to the carry bit only if it is needed (see middle and right columns, where the model has to carry once and twice respectively). In the introduction, we used a two-digit addition task to demonstrate how sparse attention patterns can lead to intrinsically interpretable circuits. The result presented is gathered in a small scale toy experiment described below. We train 4-layer single-head Transformer models on a two-digit addition task, where the input is a sequence of digits and the model is trained to predict the sum. In this task, there are 13 total tokens: ten digits and three symbols ”+”, ”=” and ”?”. Within this setting, we train two models: a standard transformer model and a sparse transformer with a fixed sparsity regularisation strength. Figure 9 shows several examples of the learned attention patterns. In these examples, we can clearly see that the pressure of sparsity leads to the emergence of human-recognisable algorithmic patterns: in the first layer, each digit in the answer attends to the corresponding digits in the input, while the second layer computes the carry bit when necessary. By enforcing selective information flow through sparse message-passing, the sparse model is able to learn crisp and localised mechanisms that are immediately amenable to interpretation. ## Appendix B Sparse Attention Implementation For the experiments, we implemented efficient GPU kernels for the sparse attention layers using the helion domain-specific language https://helionlang.com/. We refer to this implementation as Splash Attention (Sp arse f lash Attention). Our implementation follows the same core algorithmic structure as FlashAttention-2 (Dao, 2023), including the use of online softmax computation and tiling. Note that the sparse attention variant (Eq. 2) only differs from the standard attention by a pointwise multiplication of the adjacency matrix, which can be easily integrated into FlashAttention by computing $A_{ij}$ on-the-fly. We additionally fuse the Gumbel-softmax computation, the straight-through gradient, and the computation of the expected number of edges (required for the penalty) into a single optimized kernel, the implementation of which will be released together with the experiment code. Figure 10 compares our Splash Attention implementation against a naive baseline based on PyTorch-native operations. <details> <summary>figures/splash_attention.png Details</summary> ![a0574832](/v1/image/a0574832f887e66a36b49176be35f067af4315d4bef10db6bca02f7752b705e8) ### Visual Description ## Line Charts: Execution Time and Memory Peak Comparison ### Overview The image displays two side-by-side line charts comparing the performance of two computational methods, "Splash Attention" and "Naive Attention," across increasing sequence lengths. The left chart measures execution time, and the right chart measures peak memory usage. Both charts use a logarithmic scale for the x-axis (Sequence Length). ### Components/Axes **Common Elements:** * **X-Axis (Both Charts):** Labeled "Sequence Length". It is a logarithmic scale with major tick marks at `10^2` (100) and `10^3` (1000). Data points are plotted at approximate sequence lengths of 32, 64, 128, 256, 512, 1024, and 2048. * **Legend (Both Charts):** Located in the top-left corner of each chart's plot area. * Blue line with circle markers: "Splash Attention" * Orange line with square markers: "Naive Attention" **Left Chart: Execution Time Comparison** * **Title:** "Execution Time Comparison" * **Y-Axis:** Labeled "Time (ms)". Linear scale from 0 to 45, with major ticks at 0, 10, 20, 30, 40. **Right Chart: Memory Peak Comparison** * **Title:** "Memory Peak Comparison" * **Y-Axis:** Labeled "Peak Memory (GB)". Linear scale from 0 to 10, with major ticks at 0, 2, 4, 6, 8, 10. ### Detailed Analysis **1. Execution Time Comparison (Left Chart)** * **Trend Verification:** * **Splash Attention (Blue, Circles):** The line shows a very gradual, near-linear increase on the log-linear plot, indicating sub-quadratic time complexity relative to sequence length. * **Naive Attention (Orange, Squares):** The line remains low and flat for shorter sequences, then curves sharply upward after sequence length ~512, indicating a steep, likely quadratic or worse, increase in time. * **Data Points (Approximate):** * **Sequence Length ~32:** Both methods ~0 ms. * **Sequence Length ~64:** Both methods ~0 ms. * **Sequence Length ~128:** Both methods ~0 ms. * **Sequence Length ~256:** Splash ~0 ms; Naive ~1 ms. * **Sequence Length ~512:** Splash ~0.5 ms; Naive ~3 ms. * **Sequence Length ~1024:** Splash ~1.5 ms; Naive ~12 ms. * **Sequence Length ~2048:** Splash ~6 ms; Naive ~45 ms. **2. Memory Peak Comparison (Right Chart)** * **Trend Verification:** * **Splash Attention (Blue, Circles):** The line is essentially flat and close to zero across all sequence lengths, indicating constant or very low memory overhead. * **Naive Attention (Orange, Squares):** The line shows a gradual increase for shorter sequences, then a dramatic, near-vertical spike at the largest sequence length, indicating a severe memory scaling issue. * **Data Points (Approximate):** * **Sequence Length ~32:** Splash ~0.2 GB; Naive ~0 GB. * **Sequence Length ~64:** Splash ~0 GB; Naive ~0 GB. * **Sequence Length ~128:** Splash ~0 GB; Naive ~0.1 GB. * **Sequence Length ~256:** Splash ~0 GB; Naive ~0.2 GB. * **Sequence Length ~512:** Splash ~0 GB; Naive ~0.6 GB. * **Sequence Length ~1024:** Splash ~0 GB; Naive ~2.4 GB. * **Sequence Length ~2048:** Splash ~0 GB; Naive ~9.8 GB. ### Key Observations 1. **Performance Divergence Point:** Both metrics show a critical divergence between the two methods starting around sequence length 512. Before this point, performance is similar; after, Naive Attention degrades rapidly. 2. **Scalability:** Splash Attention demonstrates excellent scalability for both time and memory. Naive Attention scales poorly, with time increasing steeply and memory usage exploding at the largest tested sequence length (2048). 3. **Memory Catastrophe:** The most striking feature is the memory usage of Naive Attention at sequence length 2048 (~9.8 GB), which is orders of magnitude higher than Splash Attention (~0 GB) and represents a potential out-of-memory failure point. 4. **Time vs. Memory:** While Naive Attention's execution time increases by a factor of ~3.75x from seq len 1024 to 2048 (12ms to 45ms), its memory usage increases by a factor of ~4.1x (2.4GB to 9.8GB), indicating memory is the more severely affected resource. ### Interpretation The data strongly suggests that **Splash Attention is a highly optimized implementation** designed to overcome the fundamental scalability limitations of the standard ("Naive") attention mechanism in transformers. * **What the data demonstrates:** The charts provide empirical evidence that Splash Attention successfully decouples computational and memory costs from sequence length in a way that Naive Attention does not. The flat memory curve is particularly significant, as it implies the method likely uses a fixed-memory or streaming algorithm, avoiding the need to materialize large intermediate matrices (like the full attention score matrix). * **Relationship between elements:** The side-by-side presentation directly correlates the two key performance bottlenecks in deep learning: compute time and memory capacity. It shows that for Naive Attention, these bottlenecks are linked and compound at scale. Splash Attention breaks this link, maintaining low cost in both dimensions. * **Implications:** This has profound practical implications. Using Splash Attention would allow processing much longer sequences (e.g., for high-resolution images, long documents, or genomic data) on the same hardware, or processing the same sequences with significantly smaller, cheaper hardware. The Naive Attention method becomes practically unusable for sequences beyond ~1024 tokens due to the memory wall. The charts serve as a compelling technical justification for adopting the Splash Attention method in production systems where sequence length is a variable. </details> Figure 10: Performance comparison between our implementation (Splash) and a naive PyTorch baseline. ## Appendix C Training Details ### C.1 Hyperparameters and Compute Resources | Hyperparameter | OLMo | GPT-2 | | --- | --- | --- | | Base Model | allenai/OLMo-7B-hf | gpt2 | | Context window | 512 | 64 | | Dataset | dolma-v1 | OpenWebText | | Batch size | 16 | 256 | | Gradient accumulation steps | 4 | 4 | | Total steps | 400,000 | 1,200,000 | | Learning rate | $1\times 10^{-5}$ | $1\times 10^{-5}$ | | Minimum learning rate | $1\times 10^{-6}$ | $1\times 10^{-6}$ | | Optimizer | Adam | Adam | | Weight decay | 0.1 | 0.1 | | Scheduler | Cosine (1 cycle) | Cosine (1 cycle) | | Warmup steps | 1,000 | 1,000 | | Finetuning strategy | LoRA | Full | | LoRA rank ( $r$ ) | 400 | - | | LoRA scaling ( $\alpha$ ) | 800 | - | | LoRA dropout | 0 | - | | LoRA target modules | q,k,v,o,fc_in,fc_out | - | | Dual Optimisation LR | 0.01 | 0.1 | | Target cross-entropy | 2.29 | 3.5 | Table 2: Key hyperparameters used for sparse post-training experiments on OLMo-7B. We provide the key hyperparameters for our experiments in table 2. All training are performed on NVIDIA H100 GPUs. The GPT-2 model is trained on a single GPU while the OLMo model is trained on a node of 8 GPUs. The total training time for both models is roughly 14 days. The main sparse attention code will be made available as a Transformer library wrapper. The implementation code as well as model weights will also be released. ### C.2 Training Dynamics <details> <summary>x9.png Details</summary> ![e4db6b9d](/v1/image/e4db6b9d816425d77b458ec2913463e923dd7416d0537c2ed3afa4e4e8a6f799) ### Visual Description ## Line Charts: Training Dynamics of a Machine Learning Model ### Overview The image displays three horizontally aligned line charts that track different metrics over the course of model training. The charts share a common x-axis ("Training Steps") but monitor distinct y-axis variables: Sparsity, Regularization Strength, and Validation Cross Entropy. The data suggests a training process with a significant event or intervention occurring around 100,000 steps. ### Components/Axes **Common X-Axis (All Charts):** * **Label:** `Training Steps` * **Scale:** Linear, from 0 to 400,000. * **Major Tick Marks:** 0, 100000, 200000, 300000, 400000. **Left Chart:** * **Y-Axis Label:** `Sparsity` * **Y-Axis Scale:** Logarithmic (base 10). Major ticks at `10^-2` (0.01) and `10^-1` (0.1). * **Data Series:** A single blue line. **Middle Chart:** * **Y-Axis Label:** `Regularization Strength` * **Y-Axis Scale:** Linear, from 0 to 3000. * **Major Tick Marks:** 0, 500, 1000, 1500, 2000, 2500, 3000. * **Data Series:** A single blue line. **Right Chart:** * **Y-Axis Label:** `Validation Cross Entropy` * **Y-Axis Scale:** Linear, from 2.0 to 3.0. * **Major Tick Marks:** 2.0, 2.2, 2.4, 2.6, 2.8, 3.0. * **Data Series:** A single blue line. * **Additional Element:** A horizontal dashed black line at approximately y = 2.3. ### Detailed Analysis **1. Sparsity (Left Chart):** * **Trend:** The line begins at a high sparsity value (≈0.2-0.3) and remains stable until approximately 50,000 steps. It then undergoes a steep, near-vertical drop to a minimum near `10^-2` (0.01) around 75,000 steps. Following this, there is a sharp rebound, forming a peak of ≈0.03-0.04 at 100,000 steps. After this peak, the sparsity decays exponentially, stabilizing at a low, noisy baseline between `10^-2` and `2x10^-2` from 200,000 steps onward. * **Key Points (Approximate):** * Start (0 steps): ~0.25 * Minimum (~75k steps): ~0.01 * Local Peak (100k steps): ~0.035 * End (400k steps): ~0.015 (with noise) **2. Regularization Strength (Middle Chart):** * **Trend:** The line starts near zero and remains low until ≈50,000 steps. It then rises sharply, forming a first major peak of ≈2000 at 100,000 steps. Immediately after this peak, the value plummets back to near zero. From ≈125,000 steps, it begins a second, more sustained ascent, characterized by high-frequency noise/fluctuations. This second rise continues throughout the training, reaching its maximum value of ≈3100-3200 by 400,000 steps. * **Key Points (Approximate):** * First Peak (100k steps): ~2000 * Trough (~110k steps): ~50 * Value at 200k steps: ~2500 * End (400k steps): ~3150 **3. Validation Cross Entropy (Right Chart):** * **Trend:** The line is remarkably stable, hovering just below the dashed reference line (≈2.3) for almost the entire training duration. The most prominent feature is an extreme, narrow spike where the cross-entropy shoots up to the maximum y-axis value of 3.0 precisely at 100,000 steps. It immediately returns to its baseline level. A very slight, temporary dip is visible just before the spike. * **Key Points (Approximate):** * Baseline (most of training): ~2.28 - 2.30 * Spike Peak (100k steps): 3.0 * Dashed Reference Line: ~2.3 ### Key Observations 1. **Synchronized Event at 100,000 Steps:** All three metrics exhibit a dramatic, simultaneous change at exactly 100,000 training steps. Sparsity peaks, Regularization Strength peaks and then crashes, and Validation Cross Entropy spikes to its maximum. 2. **Two-Phase Regularization:** The Regularization Strength plot shows two distinct phases: an initial sharp increase and reset, followed by a noisier, sustained increase. 3. **Stable Validation Performance:** Despite the dramatic internal changes (sparsity, regularization), the model's validation loss (cross entropy) remains largely unaffected, except for the single anomalous spike. The dashed line suggests a target or baseline performance level that is consistently met. 4. **Sparsity Dynamics:** The model's sparsity is highly dynamic early in training, undergoing a collapse and recovery before settling into a low, stable regime. ### Interpretation This set of charts likely visualizes the training dynamics of a neural network employing a **dynamic or adaptive regularization technique**, possibly related to pruning or sparsity induction (e.g., a method that adjusts regularization strength based on gradient statistics or model weights). * **The 100k-Step Event:** The synchronized anomaly strongly suggests a planned intervention or a phase transition in the training algorithm. This could be: * A scheduled change in the learning rate or optimization hyperparameters. * The activation or deactivation of a specific regularization component. * A "reset" or "re-evaluation" point in an adaptive algorithm, where the regularization strength is recalculated, causing a temporary disruption (the cross-entropy spike) before recovery. * **Relationship Between Metrics:** The data implies a causal relationship. The initial rise in Regularization Strength (50k-100k steps) correlates with the collapse in Sparsity. The event at 100k steps resets the regularization, allowing sparsity to recover slightly. The subsequent sustained increase in regularization does not further reduce sparsity, suggesting the model has reached a stable sparse state. The validation loss remains robust, indicating these internal adjustments do not harm generalization (outside the transient spike). * **Purpose of the Dashed Line:** The horizontal dashed line in the Validation Cross Entropy chart serves as a performance benchmark. The model's ability to stay at or below this line for nearly all steps (except the spike) demonstrates successful training relative to that target. **In summary, the image documents a training run where an adaptive regularization mechanism is actively modulating model sparsity. A critical algorithmic event at 100,000 steps causes a temporary loss spike but is followed by recovery and continued stable training, ultimately maintaining validation performance at a desired benchmark level.** </details> Figure 11: The training curves for post-training OLMo-7B tacking the model sparsity (left), regularisation strength (middle), and the cross-entropy loss (right). The black dotted line on the cross-entropy plot indicates the pre-defined threshold, $\tau$ . A key feature of our post-training framework is that the strength of the sparsity regularisation is automatically controlled via a constrained optimisation scheme. By pre-specifying an accepted level for the cross-entropy target, $\tau$ , the training procedure can be written as the max-min objective: $$ \max_{\lambda>0}\min_{\theta}\bigg[\sum_{l}\mathbb{E}\big[|A_{l}|\big]+\lambda(CE-\tau)\bigg], \tag{7} $$ which can be optimised by taking alternating gradient steps in the model weight space and in the $\lambda$ space. The resulting training dynamics means that the sparsity regularisation strength increases when the model cross-entropy is lower than the target, and decreases when the model is above the threshold. Figure 11 shows the training curves for the OLMo-7B model. Here, we observe that the strength of sparsity regularisation keeps increasing slowly while the model cross-entropy is clipped at the desired level. Note that during a model spike (at around 100K steps), the sparsity regularisation automatically decreases to let the model recover. ## Appendix D Extra Experiments for Circuit Discovery In this section, we provide additional results for the activation patching circuit discovery experiment presented in the main text. Figure 12 shows the attention patterns of the heads required to explain 90% of model behaviour on a copy task. To fully test the longer context window afforded by OLMo, we use a longer prompt than the one used for GPT2 in the main text. The result is consistent with the GPT-2 experiment: sparsified model facilitates the discovery of smaller circuits of induction heads that implement the copy task. <details> <summary>x10.png Details</summary> ![8e6ba6aa](/v1/image/8e6ba6aa06a24b8ba2cb5b623a32f5c44d4f50ac4a213a270c7c3970d3f14d88) ### Visual Description ## Diagram: Grid of Small Multiples Plots ### Overview The image displays two separate panels, each containing a grid of small, square plots arranged in rows and columns. The panels are presented side-by-side on a white background. The left panel is larger and contains a denser grid, while the right panel is smaller with a less dense grid. Each individual plot within the grids appears to be a simple line graph or scatter plot, but the data points and lines are extremely faint and not legible at this image resolution. The primary visible information consists of alphanumeric labels in the top-left corner of each small plot. ### Components/Axes * **Panel Layout:** Two distinct rectangular panels with rounded corners and a thin black border. * **Left Panel:** Titled "SLN-78" in the top-left corner. Contains a grid of 16 rows by 12 columns (192 total plots). * **Right Panel:** Titled "Spectra 10-19" in the top-left corner. Contains a grid of 10 rows by 10 columns (100 total plots). * **Plot Labels:** Each small plot has a unique alphanumeric identifier in its top-left corner. The labels follow a consistent pattern. * **Left Panel (SLN-78) Label Pattern:** Labels begin with "L04" followed by a two-digit number, starting from "L0401" in the top-left plot. The numbering appears to proceed sequentially row by row, from left to right. The final visible label in the bottom-right plot is "L0496". The sequence suggests there are 96 plots in this grid (8 rows x 12 columns), not 192 as the visual grid suggests. This indicates the grid may be 8 rows of 12, with each "plot" being a pair of sub-plots, or the labels repeat in a pattern not fully discernible. * **Right Panel (Spectra 10-19) Label Pattern:** Labels begin with "L04" followed by a two-digit number, starting from "L0401" in the top-left plot. The numbering proceeds sequentially. The final visible label in the bottom-right plot is "L0410". This suggests a 10-plot sequence, possibly corresponding to the title "Spectra 10-19". The grid is 10x10, but only the first 10 labels are unique, implying the same set of 10 plots may be repeated across the grid for comparison or that the labels are placeholders. ### Detailed Analysis * **Plot Content:** The graphical content within each small square is not interpretable. Faint, light-blue lines or dots are visible, suggesting each plot contains a simple chart (e.g., a line graph, a scatter plot with a trend line, or a spectrum). However, no axes, scales, data points, or line shapes can be reliably extracted. * **Data Table Reconstruction:** Not possible. The image does not contain a structured data table with readable values. * **Legend:** There is no visible legend within the image to explain the meaning of the lines or data points in the small plots. ### Key Observations 1. **High-Level Organization:** The image is a composite of two "small multiples" displays, a technique used to compare many similar charts side-by-side. 2. **Labeling System:** A clear, sequential labeling system ("L04xx") is used, suggesting these are samples, experiments, or spectra from a larger series. 3. **Visual Data Inaccessibility:** The critical data—the trends, values, and patterns within each small plot—is lost due to the image's resolution and compression. The plots serve as placeholders or indices rather than as sources of quantitative information. 4. **Discrepancy in Left Panel:** The visual grid (16x12) does not match the label sequence (which ends at L0496, implying 96 items). This could mean the grid is 8 rows of 12 paired plots, or the labels are not unique to each visual cell. ### Interpretation This image appears to be a **summary index or catalog page** from a technical report, likely in fields like spectroscopy, materials science, or experimental physics/chemistry. Its primary function is not to present detailed data but to provide an organized overview of a large set of measurements or samples. * **Purpose:** The "SLN-78" and "Spectra 10-19" titles suggest these are two different datasets or experimental runs. The grids allow a researcher to quickly locate a specific sample (by its "L04xx" ID) and see a thumbnail of its corresponding data (the faint plot). * **Relationship Between Elements:** The title defines the dataset, the grid organizes the samples within it, and the label provides a unique key. The actual data for each label would be found elsewhere in the full document. * **Notable Anomaly:** The mismatch between the visual grid size and the label count in the "SLN-78" panel is a significant observation. It implies either a specific grouping structure (e.g., each labeled item has two associated plots) or a potential error in the figure's construction. * **Limitation:** As a standalone image, its informational value is low because the core data is illegible. Its value is purely structural and organizational. To extract facts or data, one would need the higher-resolution source document or the accompanying data files referenced by these labels. **Conclusion:** This image is a **table of contents for graphical data**, not the data itself. It successfully conveys the scale and organization of an experiment (hundreds of samples) but fails to convey any specific results due to the illegibility of the embedded plots. </details> Figure 12: Attention patterns of the heads required to explain 90% of model behaviour on a longer copy task. Similar to the GPT-2 results in Figure 3, the sparse model requires substantially fewer attention heads. Figure 13 and 14 show the fraction of explained model preference as a function of the number of model components kept un-ablated. The difference between these plots and Figure 4 and 5 lies in the way individual model components are ranked. Here, the ranking is performed on a task level, meaning that the importance score for each component is pooled across different instances of the same task. Overall, the results are commensurate with results presented in the main paper, where the ranking strategy consistently discover smaller circuits in sparse models. The only exception is the Greater Than task for GPT-2 where the number of attention heads required for the sparse model is larger than that of the base model. We hypothesise that this is due to the sparse model choosing different circuits to implement different instances of the same task, rendering the task-level importance score less suitable for circuit discovery in this case. Finally, in Figure 15, we provide a qualitative visualisation of the edges required to completed the IOI task. <details> <summary>x11.png Details</summary> ![edbe2c68](/v1/image/edbe2c685f2f5c1280a4e91c0d5a6352e3b5d2b8d88f1297fc7b3ae2a518abe1) ### Visual Description ## Line Charts: Model Performance Comparison (Explained Effect vs. Number of Heads Kept) ### Overview The image displays four separate line charts arranged horizontally, comparing the performance of standard and sparse versions of two language models (GPT-2 and OLMo-7B) across four different evaluation tasks. Each chart plots the "Explained Effect" (y-axis) against the "Number of Heads Kept" (x-axis), showing how model performance changes as more attention heads are retained. The charts include annotations indicating the performance multiplier (e.g., "2.0x") of the sparse model relative to the standard model at a specific point. ### Components/Axes * **Common Y-Axis (All Charts):** Label: "Explained Effect". Scale: Linear, ranging from 0.0 to 1.0, with major ticks at 0.0, 0.5, and 1.0. * **X-Axis (Varies by Chart):** Label: "Number of Heads Kept". Scale: Linear. The range differs: * "Greater Than" & "IOI": 0 to ~120 (ticks at 0, 50, 100). * "Docstring" & "IOI Long": 0 to 1000 (ticks at 0, 250, 500, 750, 1000). * **Legends:** * **Left Two Charts ("Greater Than", "IOI"):** Located in the bottom-right corner of each plot area. * Blue Line: "GPT-2" * Orange Line: "Sparse GPT-2" * **Right Two Charts ("Docstring", "IOI Long"):** Located in the bottom-right corner of each plot area. * Green Line: "OLMo-7B" * Pink Line: "Sparse OLMo-7B" * **Annotations:** Each chart contains a horizontal dashed black line with a text label indicating a multiplier (e.g., "0.6x"). This line connects points on the two curves to highlight the relative performance difference. ### Detailed Analysis **1. Chart: "Greater Than"** * **Trend Verification:** Both lines show a steep, sigmoidal increase from near 0.0, plateauing near 1.0. The orange line (Sparse GPT-2) rises more sharply initially. * **Data Points & Annotation:** * The dashed line is positioned at approximately y=0.85. * It connects the blue line (GPT-2) at x ≈ 50 to the orange line (Sparse GPT-2) at x ≈ 30. * Annotation: "0.6x". This indicates that the Sparse GPT-2 model achieves the same explained effect (~0.85) using only about 60% (0.6x) of the heads required by the standard GPT-2 model at that performance level. * Both models converge to an Explained Effect of ~0.95-1.0 by x=100. **2. Chart: "IOI"** * **Trend Verification:** Both lines increase from near 0.0. The orange line (Sparse GPT-2) maintains a significant lead over the blue line (GPT-2) throughout the ascent before both plateau. * **Data Points & Annotation:** * The dashed line is positioned at approximately y=0.9. * It connects the blue line (GPT-2) at x ≈ 100 to the orange line (Sparse GPT-2) at x ≈ 50. * Annotation: "2.0x". This indicates that the Sparse GPT-2 model achieves the same explained effect (~0.9) using only about 50% (1/2.0x) of the heads required by the standard GPT-2 model. * Sparse GPT-2 reaches near-maximum performance (~1.0) by x=75, while GPT-2 reaches it by x=120. **3. Chart: "Docstring"** * **Trend Verification:** Both lines show a very sharp, almost step-like increase from near 0.0 to 1.0. The pink line (Sparse OLMo-7B) is slightly to the left (more efficient) of the green line (OLMo-7B). * **Data Points & Annotation:** * The dashed line is positioned at approximately y=0.95. * It connects the green line (OLMo-7B) at x ≈ 350 to the pink line (Sparse OLMo-7B) at x ≈ 270. * Annotation: "1.3x". This indicates that the Sparse OLMo-7B model achieves the same high explained effect (~0.95) using about 77% (1/1.3x) of the heads required by the standard OLMo-7B model. * Both models reach a plateau of 1.0 by x=500. **4. Chart: "IOI Long"** * **Trend Verification:** Similar to "Docstring", both lines exhibit a sharp, step-like increase. The pink line (Sparse OLMo-7B) again leads the green line (OLMo-7B). * **Data Points & Annotation:** * The dashed line is positioned at approximately y=0.9. * It connects the green line (OLMo-7B) at x ≈ 400 to the pink line (Sparse OLMo-7B) at x ≈ 250. * Annotation: "1.6x". This indicates that the Sparse OLMo-7B model achieves the same explained effect (~0.9) using about 63% (1/1.6x) of the heads required by the standard OLMo-7B model. * Both models converge to 1.0 by x=750. ### Key Observations 1. **Consistent Superiority of Sparse Models:** In all four tasks, the sparse model variant (orange or pink line) achieves any given level of "Explained Effect" with fewer attention heads than its standard counterpart (blue or green line). The curves for sparse models are always to the left/above. 2. **Task-Dependent Efficiency Gains:** The magnitude of the efficiency gain (the multiplier) varies significantly by task. The gain is most pronounced on the "IOI" task for GPT-2 (2.0x) and least pronounced on the "Greater Than" task (0.6x, which is actually a slowdown, meaning Sparse GPT-2 needed *more* heads at that specific performance point). 3. **Different X-Axis Scales:** The GPT-2 models (left charts) are evaluated on keeping up to ~120 heads, while the OLMo-7B models (right charts) are evaluated on a scale up to 1000 heads, suggesting OLMo-7B has a much larger total number of attention heads. 4. **Performance Plateau:** All models eventually reach an Explained Effect of 1.0 (or very near it), indicating that with enough heads, full performance is recovered. The sparse models reach this plateau earlier. ### Interpretation These charts demonstrate the effectiveness of a "sparsification" technique applied to language models. The technique aims to identify and retain only the most important attention heads, allowing for model compression or more efficient inference without sacrificing performance on specific tasks. * **What the data suggests:** The sparse models are more "head-efficient." They achieve the same task performance (Explained Effect) with a smaller subset of attention heads. This implies the sparsification method successfully identifies a core set of functionally important heads. * **How elements relate:** The "Explained Effect" metric likely measures how well the model's behavior on a task can be predicted or reconstructed using only the kept heads. The x-axis shows the cost (number of heads). The curves show the trade-off: keeping more heads improves performance, but sparse models offer a better trade-off. * **Notable Anomalies:** The "Greater Than" task shows a 0.6x multiplier, meaning the sparse model was *less* efficient at that specific operating point (~85% performance). This could indicate that for this particular, possibly simpler task, the sparsification heuristic was suboptimal, or that the standard model's heads are more uniformly useful. However, the sparse model still reaches full performance eventually. * **Broader Implication:** The varying multipliers (0.6x to 2.0x) highlight that the benefit of sparsification is highly task-dependent. A model compressed for one task (like IOI) may see dramatic efficiency gains, while gains on another task (like Greater Than) may be minimal or even negative at certain performance thresholds. This underscores the importance of task-aware model compression. </details> Figure 13: Logit attribution per sentence keeping only the top-k attention heads based on a global ranking score. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. With the exception of the Greater Than task for GPT-2, the sparse models admits smaller circuits. <details> <summary>x12.png Details</summary> ![8e165d41](/v1/image/8e165d411e288d53625cbb48dbfd274d3b246fa3632d2007b079340dda4648d2) ### Visual Description ## Line Charts: Model Sparsity Efficiency Comparison ### Overview The image displays four separate line charts arranged horizontally, comparing the performance of standard versus sparse versions of two language models (GPT-2 and OLMo-7B) across four different tasks or benchmarks. Each chart plots "Explained Effect" against the "Number of Edges Kept" on a logarithmic scale, demonstrating how model performance (explained effect) improves as more edges (likely representing model parameters or connections) are retained. The sparse model variants consistently achieve higher explained effect with fewer edges. ### Components/Axes * **Titles (Top of each chart, left to right):** "Greater Than", "IOI", "Docstring", "IOI Long". * **Y-Axis (All charts):** Label: "Explained Effect". Scale: Linear, from 0.0 to 1.0, with major ticks at 0.0, 0.5, and 1.0. * **X-Axis (All charts):** Label: "Number of Edges Kept (log scale)". Scale: Logarithmic. * Charts 1 & 2 ("Greater Than", "IOI"): Range from 10⁰ to 10⁴. * Charts 3 & 4 ("Docstring", "IOI Long"): Range from 10⁰ to 10⁵. * **Legends:** * Charts 1 & 2: Located in the bottom-right corner. Contains two entries: a blue line labeled "GPT-2" and an orange line labeled "Sparse GPT-2". * Charts 3 & 4: Located in the bottom-right corner. Contains two entries: a green line labeled "OLMo-7B" and a pink line labeled "Sparse OLMo-7B". * **Annotations:** Each chart contains a horizontal dashed black line near the top, connecting the plateau points of the two curves. Above this line is a text annotation indicating a multiplier (e.g., "41.9x"). ### Detailed Analysis **Chart 1: Greater Than** * **Trend Verification:** The blue line (GPT-2) shows a gradual, sigmoidal increase from near 0.0 at 10⁰ edges to 1.0 at approximately 10⁴ edges. The orange line (Sparse GPT-2) rises much more steeply, reaching 1.0 at just above 10² edges. * **Data Points & Annotation:** The dashed line and annotation "41.9x" indicate that the Sparse GPT-2 model achieves the same maximum explained effect (1.0) using approximately 41.9 times fewer edges than the standard GPT-2 model. **Chart 2: IOI** * **Trend Verification:** Similar pattern to Chart 1. The blue line (GPT-2) increases gradually. The orange line (Sparse GPT-2) increases more rapidly. * **Data Points & Annotation:** The annotation "14.9x" signifies that Sparse GPT-2 reaches peak performance with about 14.9 times fewer edges than GPT-2 on the IOI task. **Chart 3: Docstring** * **Trend Verification:** The green line (OLMo-7B) shows a steady increase. The pink line (Sparse OLMo-7B) has a steeper slope, indicating faster performance gain per edge added. * **Data Points & Annotation:** The annotation "5.5x" shows Sparse OLMo-7B is 5.5 times more edge-efficient than standard OLMo-7B for the Docstring task. **Chart 4: IOI Long** * **Trend Verification:** The green line (OLMo-7B) and pink line (Sparse OLMo-7B) follow similar trajectories to Chart 3, with the sparse variant maintaining a lead. * **Data Points & Annotation:** The annotation "3.1x" indicates a 3.1x edge efficiency advantage for Sparse OLMo-7B over OLMo-7B on the IOI Long task. ### Key Observations 1. **Consistent Superiority of Sparse Models:** In all four tasks, the sparse model variant (orange or pink line) achieves a higher "Explained Effect" at any given number of edges compared to its standard counterpart (blue or green line). 2. **Varying Efficiency Gains:** The magnitude of the efficiency gain (the multiplier) varies significantly by task and model. It is highest for GPT-2 on the "Greater Than" task (41.9x) and lowest for OLMo-7B on the "IOI Long" task (3.1x). 3. **Task/Model Dependency:** The efficiency gap appears larger for the GPT-2 model pair (Charts 1 & 2) than for the OLMo-7B pair (Charts 3 & 4) across the presented tasks. 4. **Performance Ceiling:** All models eventually reach an Explained Effect of 1.0, but the sparse models reach this ceiling at a much lower edge count. ### Interpretation These charts provide a Peircean investigation into the efficiency of model sparsification techniques. The "Explained Effect" likely measures how well a simplified (sparse) model can replicate the behavior or performance of a full model. The data demonstrates a clear and significant finding: **pruning a model to keep only a subset of its edges (sparsification) does not merely maintain performance but does so with dramatically greater parameter efficiency.** The relationship shown is a classic efficiency frontier: sparse models dominate the standard models on a plot of performance vs. resource (edges) usage. The varying multipliers (41.9x, 14.9x, etc.) suggest that the benefit of sparsification is not uniform. It is highly dependent on both the base model architecture (GPT-2 vs. OLMo-7B) and the specific cognitive task being evaluated (e.g., "Greater Than" vs. "IOI Long"). This implies that some tasks or model structures are more amenable to compression than others. The anomaly to note is the significant drop in the efficiency multiplier for the OLMo-7B model on the "IOI Long" task compared to its performance on "Docstring," suggesting the long-context IOI task may rely on distributed knowledge that is harder to sparsify effectively. Overall, the visual evidence strongly advocates for the use of sparse models as a means to achieve high performance with a fraction of the computational footprint. </details> Figure 14: Logit attribution per sentence keeping only the top-k attention edges based on a global ranking score. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. <details> <summary>x13.png Details</summary> ![d26b6319](/v1/image/d26b6319ba8161f752bed4f7bb1bcfb7b861dcbd6e4805f83abc04fc6ad8e270) ### Visual Description ## Diagram: Neural Network Connectivity Comparison ### Overview The image is a side-by-side comparison diagram illustrating the connectivity patterns between two versions of the GPT-2 language model architecture. It visually contrasts a "Sparse GPT-2" model with a standard "GPT-2 (baseline)" model, highlighting differences in the density of connections between nodes across layers. ### Components/Axes * **Titles:** * Left Panel: **"Sparse GPT-2"** * Right Panel: **"GPT-2 (baseline)"** * **Directional Indicator:** A vertical arrow on the far left, pointing downward, is labeled **"Layers"**. This indicates the flow of information or the hierarchical structure from top (likely input/earlier layers) to bottom (likely output/deeper layers). * **Visual Elements:** * **Nodes:** Represented by small circles arranged in a grid pattern (approximately 10 columns by 12 rows in each panel). Some nodes are filled (solid gray), while others are outlined (white interior with a gray border). * **Connections:** Gray lines of varying opacity connect the nodes between different rows (layers). ### Detailed Analysis * **Spatial Layout:** The diagram is split into two distinct, equally sized rectangular panels. The "Layers" arrow is positioned to the left of the "Sparse GPT-2" panel. * **Node Distribution:** Both panels show an identical grid layout of nodes. The pattern of filled vs. outlined nodes appears similar between the two panels, suggesting the underlying node architecture is the same. * **Connection Density - Primary Contrast:** * **Sparse GPT-2 (Left Panel):** Connections are relatively few. Many nodes, especially in the upper half, have no visible connections. Connections that do exist are often isolated or form small, localized clusters. The overall visual impression is one of significant sparsity. * **GPT-2 (baseline) (Right Panel):** Connections are extremely dense, forming a complex, tangled web. Nearly every node in the lower two-thirds of the grid is connected to multiple nodes in the rows above and below it. The density is so high that individual lines are difficult to trace, creating a gray mass, particularly in the central and lower regions. ### Key Observations 1. **Dramatic Sparsity Difference:** The most striking feature is the orders-of-magnitude difference in connection density between the two models. The baseline model is densely interconnected, while the sparse model has had the vast majority of its connections removed. 2. **Layer-wise Pattern:** In the sparse model, connections appear more prevalent in the lower (deeper) layers compared to the upper (earlier) layers. The baseline model shows high density throughout, but it is most intense in the middle and lower sections. 3. **Node Activity:** The pattern of filled vs. outlined nodes is consistent across both diagrams, implying that the sparsification process affects connections (edges) rather than the nodes themselves. ### Interpretation This diagram is a powerful visual metaphor for **model pruning** or **sparsification** in neural networks. * **What it Demonstrates:** It shows the structural result of applying a technique (like the "SparseGPT" method referenced in the title) to a dense, baseline transformer model (GPT-2). The technique identifies and removes a large percentage of synaptic connections (weights) that are deemed less critical for the model's performance. * **Relationship Between Elements:** The "Layers" arrow establishes the hierarchical, feed-forward nature of the network. The comparison argues that a significant portion of the connections in the original, dense baseline model are redundant or unnecessary. * **Implications:** The sparse model likely represents a more computationally efficient version that requires less memory and fewer operations for inference, potentially with minimal loss in accuracy. The visual starkness suggests the potential for extreme compression. The diagram doesn't show performance metrics, so the trade-off between sparsity and model capability is implied but not quantified here. The core message is the feasibility of achieving a radically simpler network structure from a complex starting point. </details> Figure 15: An example of the attention-head edges required to reach 0.9 cumulative score based on the averaged scores for the IOI task. ## Appendix E Circuit Discovery Tasks In the following, we provide the details and the prompts for the various tasks used in section 4.2. ### E.1 Greater-Than Task Each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations, answers and wrong_answers. A typical entry is: { "clean": "The demonstrations lasted from the year 1363 to 13", "corrupt": "The demonstrations lasted from the year 1301 to 13", "answers": ["64", "65", ..., "99"], "wrong_answers": ["00", "01", ..., "63"] } For the clean prompt, any token in answers yields an end year strictly greater than the start year (e.g. "1364" – "1399"), whereas tokens in wrong_answers correspond to years that are less than or equal to the start year. The corrupt prompt changes only the starting year, shifting which continuations correspond to valid end years. We use the logit difference between the aggregated probability mass on answers vs. wrong_answers in clean vs. corrupt contexts as our signal, in the spirit of prior mechanistic studies on simple algorithmic tasks (Elhage et al., 2021; Nanda et al., 2023). ### E.2 Indirect Object Identification (IOI) Task Our IOI setup follows the standard indirect object identification paradigm for mechanistic interpretability (Elhage et al., 2021; Conmy et al., 2023). Each example is generated by combining: - a pair of names $(A,B)$ , e.g. (" Mary", " John"); - a natural-language template with placeholders [A], [B], and [S]. We instantiate templates such as: "Then, [B] and [A] went to the park. [S] gave a ball to" "When [B] and [A] got a snack at the cafe, [S] decided to give it to" "After the lunch, [B] and [A] went to the mall. [S] gave a gift to" by sampling a name pair and substituting $[A]$ and $[B]$ , then choosing the subject $[S]$ (either one of the pair). The correct continuation is the indirect object, i.e. the other member of the pair. For example, with $(A,B)=(\texttt{" John"},\texttt{" Mary"})$ and $S=B$ , one instance is: Then, Mary and John went to the park. Mary gave a ball to The correct continuation is " John", while " Mary" and any distractor names are treated as incorrect candidates. In the OLMo experiments, in order to further test the capability of our approach, we use a different set of IOI task with increased complexity and prompt length. Example templates include: "After several months without any contact due to conflicting schedules and unexpected personal obligations, [B] and [A] finally met again at the park, where they spent a long afternoon catching up on past events, sharing stories, and reflecting on how much had changed. As the day came to an end, [S] gave a ball to" "Although [B] and [A] had previously been involved in a long and emotionally charged argument that left several issues unresolved, they agreed to meet in order to clarify their misunderstandings. After a tense but honest conversation, [S] said to" ### E.3 Docstring Task We also test the OLMo models on a more complex Docstring task (Heimersheim and Janiak, 2023; Conmy et al., 2023), where the model needs to attend to a specific argument for a specified function in order to complete a Docstring. Similarly to the Greater Than task, each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations. A typical entry is: { "clean": "def model(self, results, old, option): """ stage agency security vision spot tone joy session river unit :param results: bone paper selection sky :param old: host action hell miss :param", "corrupt": "def model(self, command, output, state): """ stage agency security vision spot tone joy session river unit :param old: bone paper selection sky :param results: host action hell miss param", "answers": [" option"], "wrong_answers": [" results"," old"] } ## Appendix F Cross-Layer-Transcoder | Category Model Input dimension ( $d_{\text{in}}$ ) | Setting GPT-2 (HookedTransformer) 768 | | --- | --- | | Latent dimension ( $d_{\text{latent}}$ ) | 24 576 | | Expansion factor | 32 | | Context size | 64 | | Batch size (tokens) | 1 024 | | Precision | Mixed (FP32 / AMP) | | Device | CUDA | | Distributed training | DDP | | Optimizer | Adam | | Learning rate | $2\times 10^{-4}$ | | Adam $\beta_{1}$ / $\beta_{2}$ | 0.9 / 0.999 | | Learning rate warm-up | Cosine (1 000 steps) | | Learning rate decay steps | 1 874 | | Final LR scale | 0.1 | | $L_{0}$ coefficient | 2 | | Optimal $L_{0}$ | 3 | | $L_{0}$ warm-up | Linear (18 749 steps) | | Dead feature penalty | $10^{-5}$ | | Dead feature window | 250 | Table 3: Training configuration for the GPT-2 cross-layer-transcoders. To implement a cross-layer transcoder, let $\mathbf{h}_{\ell}\in\mathbb{R}^{d_{\text{model}}}$ denote the input to the MLP at layer $\ell$ for a single token position. This representation is projected into a sparse feature space via an encoder, $$ \mathbf{z}_{\ell}=\mathrm{ReLU}\!\left(\mathbf{W}_{\mathrm{enc}}^{\ell}\mathbf{h}_{\ell}+\mathbf{b}_{\mathrm{enc}}^{\ell}\right)\in\mathbb{R}^{d_{\text{features}}}, \tag{8} $$ where $\mathbf{W}_{\mathrm{enc}}^{\ell}\in\mathbb{R}^{d_{\text{features}}\times d_{\text{model}}}$ and $\mathbf{b}_{\mathrm{enc}}^{\ell}\in\mathbb{R}^{d_{\text{features}}}$ are layer-specific encoder parameters. The CLT reconstructs the MLP output at a target layer $\ell^{\prime}$ by linearly aggregating feature activations originating from all preceding layers, $$ \hat{\mathbf{m}}_{\ell^{\prime}}=\sum_{\ell\leq\ell^{\prime}}\mathbf{W}_{\mathrm{dec}}^{\ell\rightarrow\ell^{\prime}}\mathbf{z}_{\ell}+\mathbf{b}_{\mathrm{dec}}^{\ell^{\prime}}, \tag{9} $$ where $\mathbf{W}_{\mathrm{dec}}^{\ell\rightarrow\ell^{\prime}}\in\mathbb{R}^{d_{\text{model}}\times d_{\text{features}}}$ denotes the decoder mapping from layer $\ell$ to layer $\ell^{\prime}$ . The summation over layers reflects the fact that a given semantic feature may manifest in different representations across multiple MLP layers. For example, a feature that emerges in the MLP at layer $\ell$ may reappear, potentially in a transformed form, in the outputs of subsequent MLPs. Without accounting for these layer-dependent variations, such duplicated representations would lead to redundant nodes in the attribution graph. By allowing features to be represented differently across layers while being linked through a shared latent space, the cross-layer transcoder avoids this duplication and yields a more compact and interpretable attribution structure. For a detailed comparison between cross-layer transcoders and standard transcoders, we refer the reader to Lindsey et al. (2025a). Following the training procedure proposed by Anthropic (Ameisen et al., 2025), the final objective combines reconstruction accuracy with sparsity and dead-feature regularization: $$ \displaystyle\mathcal{L}= \displaystyle\underbrace{\sum_{\ell^{\prime}}\left\|\hat{\mathbf{m}}_{\ell^{\prime}}-\mathbf{m}_{\ell^{\prime}}\right\|_{2}^{2}}_{\text{MSE reconstruction}} \displaystyle+\lambda_{0}\underbrace{\sum_{\ell}\tanh\!\big(C\,(\mathbf{z}_{\ell}\odot\|\mathbf{W}_{\mathrm{dec}}^{\ell}\|)\big)}_{\text{$L_{0}$ sparsity}} \displaystyle+\lambda_{\mathrm{df}}\underbrace{\sum_{\ell}\mathrm{ReLU}\!\Big(\exp(\tau)-\mathbf{h}_{\ell}^{\mathrm{pre}}\Big)\|\mathbf{W}_{\mathrm{dec}}^{\ell}\|}_{\text{dead-feature penalty}}, \tag{10} $$ where $\mathbf{W}_{\mathrm{dec}}^{\ell}$ denotes the concatenated decoder weights associated with layer $\ell$ , $\mathbf{h}_{\ell}^{\mathrm{pre}}$ are the corresponding pre-activation values, $\tau$ is a threshold parameter, and $C$ is a scaling constant. The hyperparameters $\lambda_{0}$ and $\lambda_{\mathrm{df}}$ control the strength of the sparsity and dead-feature regularization terms. We initialize the weights with following circuits updates (Conerly et al., 2025). The encoder biais is initialize to have a fixed proportion of the features active at initialization. We provide in Figure 16 the training curves of the sparsity value, the sparsity coefficient, the explained variance, and the amount of dead features. We hope this can help the community in training their own cross-layer transcoders. <details> <summary>figures/l0_vs_steps.png Details</summary> ![c5b97222](/v1/image/c5b9722267c3c2fa3d0cc8a90dea27b1104eb0fe76103b62252c0a3185b69de6) ### Visual Description ## Line Chart: L0 over Training Steps ### Overview The image displays a single-line chart plotting a metric labeled "L0" against "Training steps (M)". The chart uses a logarithmic scale for the vertical axis (y-axis) and a linear scale for the horizontal axis (x-axis). The data shows a clear, continuous downward trend, indicating that the L0 value decreases as the number of training steps increases. ### Components/Axes * **Chart Title:** "L0 over Training Steps" (centered at the top). * **Y-Axis:** * **Label:** "L0" (positioned vertically on the left side). * **Scale:** Logarithmic (base 10). * **Major Tick Marks & Labels:** `10^1` (10), `10^2` (100), `10^3` (1000), `10^4` (10000). The axis spans from just below 10^1 to just above 10^4. * **X-Axis:** * **Label:** "Training steps (M)" (centered at the bottom). The "(M)" likely denotes "Millions". * **Scale:** Linear. * **Major Tick Marks & Labels:** 0, 25, 50, 75, 100, 125, 150, 175, 200. The axis spans from 0 to 200 million steps. * **Data Series:** A single, solid blue line representing the L0 metric over time. There is no legend, as only one series is present. ### Detailed Analysis **Trend Verification:** The blue line exhibits a steep, near-vertical descent at the beginning of training, which then transitions into a progressively shallower, but still consistent, downward slope for the remainder of the charted steps. **Approximate Data Points & Trend:** * **At Step 0 (Start):** The line originates at a very high L0 value, approximately **8,000** (just below the 10^4 mark). * **Initial Phase (0 to ~25M steps):** The line plummets dramatically. By step 25M, the L0 value has fallen to approximately **200** (slightly above the 10^2 mark). This represents a reduction of roughly 97.5% from the starting value. * **Middle Phase (~25M to ~100M steps):** The rate of decrease slows but remains steady. The line passes through: * ~50M steps: L0 ≈ **80** * ~75M steps: L0 ≈ **40** * ~100M steps: L0 ≈ **20** (aligning with the 10^1.3 region). * **Late Phase (~100M to 200M steps):** The decline continues at a roughly constant, shallow slope on the log-linear plot. The line ends at 200M steps with an L0 value of approximately **4** (visibly below the 10^1 mark). ### Key Observations 1. **Logarithmic Scale Impact:** The use of a log scale on the y-axis compresses the visual representation of the massive initial drop and expands the view of the later, smaller improvements. On a linear scale, the curve would appear as an almost immediate drop followed by a long, flat tail. 2. **Two-Phase Learning:** The curve suggests two distinct phases of improvement: a rapid, initial "learning burst" followed by a prolonged period of gradual refinement. 3. **Consistent Improvement:** There are no visible plateaus, spikes, or reversals in the trend. The L0 metric improves (decreases) consistently throughout the entire 200 million training steps shown. 4. **Magnitude of Change:** The total improvement over the charted period is immense, spanning over three orders of magnitude (from ~8,000 to ~4). ### Interpretation This chart is characteristic of a loss function or error metric (where "L0" likely stands for "Loss at stage 0" or a primary loss component) during the training of a machine learning model, particularly a neural network. * **What the Data Suggests:** The model is successfully learning from the data. The steep initial drop indicates it is quickly grasping the most obvious patterns. The continued, slower decline shows it is fine-tuning its parameters to capture more subtle nuances, a process that yields diminishing returns per step but is crucial for high performance. * **Relationship Between Elements:** The x-axis (training steps) is the independent variable representing computational effort. The y-axis (L0) is the dependent variable representing model performance (lower is better). The curve maps the efficiency of the training process. * **Notable Implications:** The lack of a plateau by 200M steps suggests that further training might still yield (small) improvements, though the cost-benefit ratio is changing. The smoothness of the curve implies a stable training process with well-tuned hyperparameters (like learning rate). An investigator would use this plot to diagnose training health, decide when to stop training, and compare the efficiency of different model architectures or training algorithms. </details> (a) $L_{0}$ vs steps <details> <summary>figures/l0_coef_vs_steps.png Details</summary> ![5bb28569](/v1/image/5bb2856997658b761a8e6e6fa9982d928ed32060f7d528c73a0885d9b852989f) ### Visual Description ## Line Chart: L0 Coefficient over Training Steps ### Overview The image displays a simple line chart plotting the value of an "L0 Coefficient" against the number of training steps, measured in millions (M). The chart shows a single, continuous data series with a distinct two-phase trend: a steady linear increase followed by a plateau. ### Components/Axes * **Chart Title:** "L0 Coefficient over Training Steps" (centered at the top). * **Y-Axis (Vertical):** * **Label:** "L0 Coefficient". * **Scale:** Linear scale ranging from 0.00 to 2.00. * **Major Tick Marks:** 0.00, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00. * **X-Axis (Horizontal):** * **Label:** "Training steps (M)". * **Scale:** Linear scale ranging from 0 to 200. * **Major Tick Marks:** 0, 25, 50, 75, 100, 125, 150, 175, 200. * **Data Series:** A single solid blue line. There is no legend, as only one series is present. ### Detailed Analysis The data series follows a precise, piecewise linear path: 1. **Phase 1 - Linear Increase:** * **Trend:** The line slopes upward at a constant rate from the origin. * **Start Point:** (0 M steps, 0.00 coefficient). * **End Point:** The line reaches its maximum value at approximately 150 M steps. * **Slope Calculation:** The coefficient increases from 0.00 to 2.00 over 150 M steps, yielding an approximate slope of **0.0133 coefficient units per million steps** (2.00 / 150 M). 2. **Phase 2 - Plateau:** * **Trend:** The line becomes perfectly horizontal, indicating a constant value. * **Start Point:** (~150 M steps, 2.00 coefficient). * **End Point:** The line continues at this constant value to the end of the plotted range at 200 M steps. * **Value:** The L0 Coefficient is held fixed at **2.00** from step 150 M onward. ### Key Observations * The transition from the increasing phase to the plateau phase is sharp and occurs at a single point (~150 M steps), not a gradual curve. * The chart depicts a perfectly deterministic schedule, not noisy experimental data. The line is straight in both segments. * The maximum value of the L0 Coefficient is 2.00, and the minimum is 0.00 within the observed window. * The chart contains no gridlines, annotations, or additional data markers beyond the line itself. ### Interpretation This chart illustrates a predefined **scheduling strategy** for a hyperparameter called the "L0 Coefficient" during a model training process. The L0 norm is often associated with promoting sparsity in machine learning models (e.g., in L0 regularization). The data suggests the following training protocol: 1. **Warm-up / Gradual Introduction:** For the first 150 million training steps, the strength of the L0-related constraint or penalty (the coefficient) is gradually and linearly increased from zero to its maximum value of 2.00. This allows the model to initially learn without the constraint, which is then slowly "turned on" to guide the optimization towards a desired property (like sparsity) without destabilizing early training. 2. **Stable Application:** After 150 million steps, the coefficient is fixed at 2.00 for the remainder of the training (at least until 200 M steps). This indicates the constraint has reached its full intended strength and is maintained to finalize the model's parameters under this fixed regularization regime. The clear, piecewise linear nature of the plot indicates this is a planned schedule, not a measured outcome. It answers the question: "How was the L0 Coefficient varied over the course of training?" The answer is a controlled ramp-up followed by a constant hold. </details> (b) $L_{0}$ coefficient vs steps <details> <summary>figures/dead_features_vs_steps.png Details</summary> ![46bc7f16](/v1/image/46bc7f16b8c5984a22069a6ee797ecf6164c0b4c2ff04f83cb6d0d6c7158d4cd) ### Visual Description ## Line Chart: Dead Features over Training Steps ### Overview The image displays a single-line chart titled "Dead Features over Training Steps." It plots the cumulative count of "Dead Features" against the number of training steps, measured in millions (M). The chart shows a clear, non-linear growth pattern over the training duration. ### Components/Axes * **Chart Title:** "Dead Features over Training Steps" (centered at the top). * **X-Axis (Horizontal):** * **Label:** "Training steps (M)" * **Scale:** Linear scale from 0 to 200. * **Major Tick Marks:** 0, 25, 50, 75, 100, 125, 150, 175, 200. * **Y-Axis (Vertical):** * **Label:** "Dead Features" * **Scale:** Linear scale from 0 to approximately 3800. * **Major Tick Marks:** 0, 500, 1000, 1500, 2000, 2500, 3000, 3500. * **Data Series:** A single, solid blue line representing the count of dead features over time. * **Legend:** None present. The chart contains only one data series. * **Background:** Plain white. No grid lines, annotations, or additional visual elements are present. ### Detailed Analysis **Trend Verification:** The blue line exhibits three distinct phases: 1. **Initial Slow Growth (0M to ~15M steps):** The line remains near zero, showing minimal increase. 2. **Rapid, Near-Linear Increase (~15M to ~100M steps):** The line slopes steeply upward, indicating a fast accumulation of dead features. 3. **Plateau with Late Uptick (~100M to 200M steps):** The growth rate slows dramatically, forming a noisy plateau between approximately 3000 and 3100 dead features from 100M to 175M steps. After 175M steps, the line resumes a clear upward trend, ending at its highest point. **Approximate Data Points (Visual Estimation):** * At 0M steps: ~0 dead features. * At 25M steps: ~250 dead features. * At 50M steps: ~1250 dead features. * At 75M steps: ~2250 dead features. * At 100M steps: ~2900 dead features. * At 125M steps: ~3050 dead features. * At 150M steps: ~3100 dead features. * At 175M steps: ~3100 dead features (start of final uptick). * At 200M steps: ~3800 dead features (chart maximum). ### Key Observations 1. **Sigmoidal-like Shape:** The overall curve resembles an S-shape (sigmoid), characterized by an initial lag, a period of exponential-like growth, and a final saturation phase—though the saturation is broken by a late increase. 2. **Inflection Point:** The most significant change in growth rate occurs around 100M training steps, where the steep ascent transitions to a plateau. 3. **Late-Stage Resurgence:** The renewed growth after 175M steps is a notable deviation from a pure saturation curve, suggesting a potential change in training dynamics or model behavior in the final quarter of the observed period. 4. **Noise in Plateau:** The plateau phase (100M-175M) is not perfectly flat but shows small, random fluctuations, indicating minor variability in the dead feature count during this period. ### Interpretation This chart likely visualizes a phenomenon in machine learning model training, where "dead features" refer to neurons or components in a neural network that have ceased to activate or contribute meaningfully (e.g., due to the "dying ReLU" problem). * **What the data suggests:** The accumulation of dead features is not constant. It accelerates dramatically during the core learning phase (15M-100M steps), suggesting that as the model learns and specializes, a significant number of features become obsolete or inactive. The plateau indicates a period of relative stability where the number of dead features is maintained. The final uptick is critical—it may signal overtraining, a shift in the data distribution, or the model entering a new phase where previously stable features begin to die off again. * **Relationship between elements:** The x-axis (training steps) is the independent variable driving the change in the dependent variable (dead features). The shape of the curve directly maps the lifecycle of feature utility throughout the training process. * **Notable Anomalies:** The primary anomaly is the late-stage increase after 175M steps. In a typical saturation scenario, one would expect the curve to flatten completely. This uptick warrants investigation—it could be an artifact of the specific training run or an indicator of a meaningful model pathology emerging late in training. The initial near-zero phase also indicates a "warm-up" period before feature death becomes prevalent. </details> (c) Dead features vs steps <details> <summary>figures/explained_variance_vs_steps.png Details</summary> ![a4950665](/v1/image/a495066543572c5d866bab71e3ff9b8f26539473ec94ee8562812ea1c13277e3) ### Visual Description ## Line Chart: Explained Variance over Training Steps ### Overview The image is a single-line chart plotting the "Explained Variance" of a model or system against the number of "Training steps (M)". The chart demonstrates a non-linear relationship where the explained variance increases rapidly at the beginning of training, reaches a peak, and then gradually declines as training continues. ### Components/Axes * **Title:** "Explained Variance over Training Steps" (centered at the top). * **X-Axis (Horizontal):** * **Label:** "Training steps (M)" (centered below the axis). The "(M)" indicates the unit is millions of steps. * **Scale:** Linear scale from 0 to 200. * **Major Tick Marks:** At 0, 25, 50, 75, 100, 125, 150, 175, and 200. * **Y-Axis (Vertical):** * **Label:** "Explained Variance" (rotated 90 degrees, centered to the left of the axis). * **Scale:** Linear scale from 0.5 to 1.0. * **Major Tick Marks:** At 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0. * **Data Series:** A single, somewhat noisy blue line representing the explained variance metric over time. There is no legend, as only one series is plotted. ### Detailed Analysis The data series follows a distinct three-phase trend: 1. **Rapid Ascent (0 to ~25M steps):** The line starts at approximately 0.5 (the bottom of the y-axis) at 0 steps. It exhibits a very steep, near-vertical climb, reaching an explained variance of ~0.85 by 25 million steps. 2. **Peak and Plateau (~25M to ~60M steps):** The rate of increase slows dramatically. The line curves and forms a broad peak. The maximum explained variance is achieved in this region, visually estimated at **~0.905** (±0.005). This peak appears to occur between 50 and 60 million training steps. 3. **Gradual Decline (~60M to 200M steps):** After the peak, the line begins a steady, linear decline. The slope is negative but much shallower than the initial ascent. By the end of the plotted data at 200 million steps, the explained variance has fallen to approximately **0.78** (±0.01). The line also shows increased high-frequency noise or variance in its measurements during this decline phase. **Approximate Key Data Points:** * At 0 M steps: Explained Variance ≈ 0.50 * At 25 M steps: Explained Variance ≈ 0.85 * At 50 M steps (near peak): Explained Variance ≈ 0.905 * At 100 M steps: Explained Variance ≈ 0.86 * At 150 M steps: Explained Variance ≈ 0.82 * At 200 M steps: Explained Variance ≈ 0.78 ### Key Observations * **Optimal Training Point:** The model's performance, as measured by explained variance, is maximized between 50 and 60 million training steps. * **Overtraining Signal:** The consistent decline after ~60M steps is a classic visual indicator of potential overtraining or overfitting, where additional training on the same data begins to degrade the model's generalizable performance. * **Noise Increase:** The thickness/noise of the line increases during the decline phase, suggesting the metric becomes less stable or more variable as training progresses beyond the optimum. * **Asymmetric Curve:** The rise to peak performance is much faster than the subsequent decline. ### Interpretation This chart illustrates a fundamental concept in machine learning model training: the trade-off between underfitting and overfitting. The initial rapid rise shows the model quickly learning the dominant patterns in the data. The peak represents the point of optimal generalization, where the model has captured the signal without fitting too much noise. The subsequent decline indicates that prolonged training is causing the model to overfit to the specific training dataset, losing its ability to explain variance in new, unseen data. The increasing noise in the metric during the decline phase further supports this, as the model's predictions become more erratic. For a practitioner, this chart is a clear directive to implement **early stopping** around the 50-60 million step mark to preserve model performance. </details> (d) Explained variance vs steps Figure 16: Training dynamics of the cross-layer transcoder, showing sparsity, regularization strength, dead features, and reconstruction quality over training. ## Appendix G Attribution-Graph Following Ameisen et al. (2025), we define the attribution score between feature $n$ at layer $\ell$ and position $k$ , and feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ , as $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}=\sum_{\ell\leq s\leq\ell^{\prime}}f_{k,n}^{\ell\rightarrow s}\;J_{s,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}, \tag{11} $$ where $f_{k,n}^{\ell\rightarrow s}$ denotes the decoder vector associated with feature $n$ projecting from layer $\ell$ to layer $s$ , $J_{s,k}^{\ell^{\prime},k^{\prime}}$ is the Jacobian mapping the MLP output at $(\ell,k)$ to the MLP input at $(\ell^{\prime},k^{\prime})$ , and $g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}$ is the corresponding encoder feature at layer $\ell^{\prime}$ and position $k^{\prime}$ . The sum in this equation reflects the cross-layer mapping of the cross-layer transcoder. The Jacobian is computed during a modified forward pass in which all nonlinear operations, including normalization layers, attention mechanisms, and MLPs, are frozen using stop-gradient operations. The resulting attribution graph is pruned by retaining only those features that cumulatively explain $80\$ of the contribution to the final logit, and only those edges that account for $95\$ of the total edge-level effect. All attribution computations are performed using the circuit-tracer library (Hanna et al., 2025). For a complete description of the attribution graph computation and pruning, we refer the user to reading (Ameisen et al., 2025). For the visualization and the autointerp, we write our own pipeline. In Figure 17, we show a screenshot of the interface for the ’The opposite of ”large” is ”’ attribution graph. The features are colored with respect to their corresponding clusters. <details> <summary>x14.png Details</summary> ![ac4825d6](/v1/image/ac4825d678343b6ab2d70b00b224077233fe01f17ee8b61094566f68c1f375ab) ### Visual Description ## Network Diagram: Cluster Relationships Across Hierarchical Levels ### Overview The image displays a hierarchical network diagram visualizing relationships between text tokens (words/punctuation) grouped into semantic clusters, mapped across 12 levels (L0–L11). The diagram illustrates how lower-level textual elements connect to higher-level conceptual clusters, with a dense web of gray connection lines showing the flow of relationships. The primary focus is on four labeled clusters: "opposite" (green), "large" (orange), "brackets" (blue), and "say small" (pink). ### Components/Axes **Vertical Axis (Left Side):** - Labeled from **L0** (bottom) to **L11** (top) in ascending order. - Represents hierarchical levels, likely indicating abstraction or processing stages. **Horizontal Axis (Bottom):** - Contains nine labeled token groups, each with a row of circular nodes beneath the label. - Labels from left to right: 1. `introduction` (with a small square icon below) 2. `The` (with a small square icon below) 3. `sentence` (with a small square icon below) 4. `to` (with a small square icon below) 5. `.` (period, with a small square icon below) 6. `large` (with a small square icon below) 7. `,` (comma, with a small square icon below) 8. `is` (with a small square icon below) 9. `.` (period, with a small square icon below) **Legend/Title (Top Center):** - Text: `Clusters: opposite – large – brackets – say small` - Color coding: - `opposite`: Green - `large`: Orange - `brackets`: Blue - `say small`: Pink **Top-Right Annotation:** - A single pink node at level **L11** labeled: `small 10.12%` **Node Types:** - **Colored Nodes:** Represent cluster assignments (green, orange, blue, pink). - **White Nodes:** Unlabeled or unclustered elements. - **Gray Connection Lines:** Dense web linking nodes across levels, indicating relationships or influence. ### Detailed Analysis **Spatial Layout & Node Distribution:** - **Bottom Row (L0):** All nine token groups have nodes. The `sentence` group has green nodes; the `large` group has orange nodes; the final `.` group has blue nodes. Other groups have white nodes. - **Level L1:** Contains green nodes (above `sentence`), orange nodes (above `large`), and blue nodes (above final `.`). - **Level L2:** Contains white nodes above `to` and blue nodes above final `.`. - **Level L3:** Contains orange nodes above `large`. - **Level L5:** Contains orange nodes above `large`. - **Level L7:** Contains a single orange node above `large`. - **Levels L8–L10:** Contain blue nodes above the final `.` group, arranged in rows of 5–6 nodes per level. - **Level L11 (Top):** Contains the single pink node labeled `small 10.12%`. **Connection Patterns:** - Gray lines emanate from nodes in lower levels (especially L0–L2) and converge toward higher levels, particularly toward the blue nodes at L8–L10 and the pink node at L11. - The density of connections increases dramatically from left to right, with the rightmost token groups (`large`, `,`, `is`, `.`) having the most connections to higher levels. - The orange cluster (`large`) shows connections primarily to levels L1, L3, L5, and L7. - The blue cluster (`brackets`) dominates levels L8–L10, with many nodes and dense connections. - The pink cluster (`say small`) appears only at the apex (L11). ### Key Observations 1. **Hierarchical Clustering:** The diagram shows a clear bottom-up flow, with lower-level tokens feeding into higher-level clusters. 2. **Cluster Specialization:** - Green (`opposite`) is confined to low levels (L0–L1). - Orange (`large`) spans mid-levels (L1–L7). - Blue (`brackets`) occupies high levels (L8–L10). - Pink (`say small`) is the singular top-level cluster (L11). 3. **Asymmetry:** The right side of the diagram (tokens `large` through final `.`) is far more connected and complex than the left side. 4. **Quantitative Annotation:** The pink node includes a percentage (`10.12%`), suggesting a measured proportion or significance. 5. **Token Types:** The bottom labels include both words (`introduction`, `The`, `sentence`, `to`, `large`, `is`) and punctuation (`.`, `,`), indicating the diagram analyzes grammatical or syntactic elements. ### Interpretation This diagram likely visualizes the output of a **text analysis or natural language processing (NLP) model** that clusters words and punctuation based on semantic or functional similarity across hierarchical layers. The levels (L0–L11) may represent layers in a neural network, stages of abstraction, or steps in a parsing algorithm. **What the Data Suggests:** - The model identifies four primary clusters, each associated with different levels of abstraction. The `opposite` cluster (green) operates at the most basic level, possibly handling antonym relationships or low-level contrasts. The `large` cluster (orange) functions in mid-level processing, perhaps related to size descriptors or intensifiers. The `brackets` cluster (blue) dominates higher levels, potentially managing structural or syntactic grouping (like clauses or phrases). The `say small` cluster (pink) sits at the apex, possibly representing a summary or meta-concept derived from the entire structure. - The dense connections to the right side suggest that tokens like `large`, `,`, `is`, and `.` are more central to the model's processing than tokens on the left (`introduction`, `The`, `sentence`). This could indicate that function words and punctuation play a critical role in the clustering logic. - The `small 10.12%` annotation implies that the top-level cluster accounts for approximately 10.12% of the model's focus, variance, or output—a significant but minority proportion. **Underlying Patterns:** - The progression from green → orange → blue → pink mirrors a potential **linguistic hierarchy**: from basic contrasts (`opposite`) to descriptors (`large`) to structural elements (`brackets`) to a synthesized concept (`say small`). - The absence of green and orange nodes at high levels suggests these clusters are specialized for lower-level tasks and do not directly contribute to the highest abstraction. - The diagram's asymmetry may reflect the inherent structure of the analyzed text, where certain words or punctuation marks carry more weight in determining meaning or relationships. **Anomalies & Uncertainties:** - The exact meaning of the cluster labels (`opposite`, `large`, `brackets`, `say small`) is ambiguous without additional context. They may be arbitrary names assigned by the model or researcher. - The percentage (`10.12%`) lacks a clear denominator—is it the proportion of nodes, connections, variance explained, or something else? - The small square icons below each bottom label are unlabeled and their purpose is unclear (possibly indicating token type or part-of-speech tags). **Conclusion:** This visualization effectively maps how textual elements are organized into functional clusters across a hierarchical processing pipeline. It highlights the importance of punctuation and function words in higher-level abstraction and suggests a structured, layered approach to text understanding. The diagram would be valuable for debugging NLP models, interpreting layer-wise representations, or communicating how a system parses language. </details> Figure 17: Circuit-tracing interface example for the ’The opposite of ”large” is ”’ with GPT2-sparse. ## Appendix H Graph: The opposite of ”large” is ” We obtain a replacement score of 0.82, with 459 features identified before pruning and 82 features remaining after pruning. The majority of features in the resulting attribution graph fall into four dominant clusters: - Opposition cluster: features associated with opposition and comparison, primarily localized at the token position corresponding to opposite. - Magnitude cluster: features related to notions of size (e.g., large, big, full, medium), predominantly located in the residual stream at the large token position. - Bracket cluster: features that activate on tokens enclosed in brackets. - Final-logit cluster: mainly the final logit itself and a couple of features that activate before the token ”small” or related terms. In the boxes below, we present the top activations of representative feature sets for each cluster. Feature 1117 ”Opposite” cluster in Washington has now adopted the wider measure of student debt outstanding. This new the situation in Syria, Iran and the wider region. ”The recharged by the wider dense forests of Sanjay Van and its overflow drained public, with interesting accounts of Oswald’s demeanor at this significant moment has a slightly wider range. Specifically, the Atom-powered NANO 56 becoming part of the wider Seven Years’ War in which Britain and France Feature 1337 ”Opposite” cluster opposite, piece of Mexico’s cultural identity. I made the hour opposite shows, or something bigger, “where there’s villains opposite sides of Mars in 2004 and used their instruments to discover geologic evidence opposite, but not anymore. Now everything he says to me is some kind opposite direction, and had little trouble finding space at the campsites. always seem to be just the opposite. show a growing trend to cast “no” votes , opposing how much salary and the occupation of the opposing forces was generally limited to mutual observation. work hand in hand for the purpose of opposing all movements of the thinking part the defense’s inability to stop opposing run games. The Bills have ing opposing quarterbacks. The Seahawks not only had depth, they were versatile. to win more hand battles particularly when the opposing tackle neutralizes his initial Feature 901 ”Large” cluster Let’s be honest: When someone advocates for large-scale Muslim robot provides a tragicomic reminder of why RWD needs to consider large as what kind of social safety nets should be in place to protect people from large advocates to limit the power of large, established corporations, analysts say. of large law firms is that they are so great that the only reason anyone that by scaling up tests, the method would be conducive for use on larger Feature 933 ”Large” cluster people healthy and anticipating health issues before they become a problem . Big Data is Big brown bucks with funny accents.” Judy flinched at BIG UP UBUNTU: Ubuntu releases are named after industry.¡—endoftext—¿ BIG LEAGUE: Barron’s Says The they need to submit their content in the same way . Big enough apps and offering alternatives routes . Big data and optical fiber Feature 1004 ”Large” cluster guide said was ? full of drinking saloons, dime museums, small would have 2 mana sources next turn (unless his hand was full of fast ’s house, it ’s full of adventure itself.? statement that all German Catholics had a right to full transparency” glimpsing a lobby full of construction debris. The front hallway was full of Jokubas had recently been reading a newspaper article which was full of Feature 412 “Brackets” cluster group answered either “very” or “somewhat” attached – except some work colleagues. Wilcox said she found it ‘highly unlikely very rare , “very likely ,” “high risk,” she says. ulent. Pentagon spokesman Peter Cook said the sample was “ on of PopMatters called the album “brilliant” and said arlene Lowe, described him as being “one of my biggest supporters”. Feature 518 “Brackets” cluster Kerry said Washington and Hanoi will “continue to have differences in opinions the United States will “take care of it.” He told reporters after the legislation would “provide new enforcement tools for protecting our citizens and will help Gary Ross, said in a statement that the Air Force is currently “short said, and Syrian President Bashar al-Assad would “have to go”. introduces politics into consumer policies,” said Palmor, adding that it would ”

Rendering Paper...