2512.05865

Model: nemotron-free

# Sparse Attention Post-Training for Mechanistic Interpretability **Authors**: Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf ## Abstract We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4\$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100× fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models. Machine Learning, ICML ## 1 Introduction Scaling has driven major advances in artificial intelligence, with ever-larger models trained on internet-scale datasets achieving remarkable capabilities across domains. Large language models (LLMs) now underpin applications from text generation to question answering, yet their increasing complexity renders their internal mechanisms largely opaque (Bommasani, 2021). Methods of mechanistic interpretability have been developed to address this gap by reverse-engineering neural networks to uncover how internal components implement specific computations and behaviors. Recent advances in this area have successfully identified interpretable circuits, features, and algorithms within LLMs (Nanda et al., 2023; Olsson et al., 2022), showing that large complex models can, in part, be understood mechanistically, opening avenues for improving transparency, reliability, and alignment (Bereska and Gavves, 2024). <details> <summary>x1.png Details</summary> ![5060aaf9](/v1/image/5060aaf927f061cd9cd714c5864a0afff70d0547a65e59ce580606887b0c3031) ### Visual Description ## Diagram: Comparison of Base Model and Sparse Model with Sparsity-Regularised Fine-Tuning ### Overview The image compares two neural network architectures: a **Base Model** (dense connections) and a **Sparse Model** (pruned connections), illustrating the effects of sparsity-regularised fine-tuning. Both models process the input sequence `3 6 + 2 8` and produce outputs, with the Base Model showing uncertainty (`?`) and the Sparse Model showing certainty (`0 0 0 6 4`). --- ### Components/Axes 1. **Base Model (Top Section)**: - **Layers**: Labeled `Layer 0` to `Layer 3` (bottom to top). - **Connections**: Dense, overlapping blue lines between nodes, indicating full connectivity. - **Input/Output**: - Input: `3 6 + 2 8` (left side). - Output: `0 0 0 6 4` (right side, with an arrow indicating flow). - **Uncertainty**: Output values marked with `?` (e.g., `3 6 + 2 8 = ? ? ? ? ?`). 2. **Sparse Model (Bottom Section)**: - **Layers**: Same layer labels (`Layer 0` to `Layer 3`). - **Connections**: Sparse, non-overlapping blue lines, with most nodes disconnected. - **Input/Output**: - Input: `3 6 + 2 8` (left side). - Output: `0 0 0 6 4` (right side, with an arrow indicating flow). - **Certainty**: Output values are explicit (`0 0 0 6 4`). 3. **Key Arrows**: - A curved arrow labeled **"Sparsity-Regularised Fine-Tuning"** connects the Base Model to the Sparse Model, indicating the transformation process. --- ### Detailed Analysis 1. **Base Model**: - **Dense Connectivity**: Every node in Layer 0 connects to all nodes in Layer 1, and so on, creating a highly interconnected network. - **Uncertain Output**: The output `? ? ? ? ?` suggests the model struggles to resolve the input sequence due to overparameterization or noise. - **Equation**: `3 6 + 2 8 = 0 0 0 6 4` (same as the Sparse Model, but output is uncertain). 2. **Sparse Model**: - **Pruned Connectivity**: Only critical connections remain (e.g., Layer 0 to Layer 1, Layer 3 to output). - **Certain Output**: The output `0 0 0 6 4` matches the Base Model’s equation, indicating effective sparsity-regularised fine-tuning. - **Equation**: `3 6 + 2 8 = 0 0 0 6 4` (explicit and correct). --- ### Key Observations 1. **Sparsity vs. Density**: - The Base Model’s dense connections contrast sharply with the Sparse Model’s minimal connections, highlighting the impact of regularization. 2. **Output Certainty**: - The Sparse Model resolves the input sequence (`3 6 + 2 8`) to `0 0 0 6 4` with certainty, while the Base Model fails to do so. 3. **Efficiency**: - The Sparse Model retains only essential pathways, reducing computational complexity without sacrificing accuracy. --- ### Interpretation The diagram demonstrates how **sparsity-regularised fine-tuning** transforms a complex, overparameterized Base Model into a streamlined Sparse Model. By pruning redundant connections, the Sparse Model achieves: - **Reduced Overfitting**: Sparse connections prevent the model from memorizing noise. - **Improved Generalization**: The Sparse Model correctly resolves the input sequence (`3 6 + 2 8 = 0 0 0 6 4`), whereas the Base Model cannot. - **Computational Efficiency**: Fewer connections lower memory and processing requirements. The Base Model’s uncertainty (`?`) reflects the risk of overfitting in dense networks, while the Sparse Model’s certainty underscores the benefits of structured regularization. This aligns with principles of neural architecture search, where sparsity promotes robustness and interpretability. </details> Figure 1: Visualised attention patterns for a 4-layer toy model trained on a simple 2-digit addition task. The main idea of this work is to induce sparse attention between tokens via a post-training procedure that optimizes for attention sparsity while maintaining model performance. In this example, while both models are able to correctly predict the sum, the sparse model solves the problem with a naturally interpretable circuit. Details of this toy setup and more examples are provided in Appendix A However, interpretability is bottlenecked by the model itself: even with sophisticated reverse-engineering techniques that can faithfully reveal internal algorithms, the underlying computations implemented by large models can still remain highly complex and uninterpretable. Circuits for seemingly simple tasks may span hundreds of interacting attention heads and MLPs with densely intertwined contributions across layers (Conmy et al., 2023), and features can influence each other along combinatorially many attention-mediated paths, complicating attention attribution (Kamath et al., 2025). To exemplify this, Figure 1 (top) illustrates the attention patterns of a small, single-head transformer trained on a simple two-digit addition task. Here, the model has learned to solve the task in a highly diffused manner, where information about each token is dispersed across all token locations, rendering the interpretation of the underlying algorithm extremely difficult even in this simple case. The crux of the problem is that models are not incentivised to employ simple algorithms during training. In this work, we advocate for directly embedding interpretability constraints into model design in a way that induces simple circuits while preserving performance. We focus our analysis on attention mechanisms and investigate sparsity regularisation on attention patterns, originally proposed in (Lei et al., 2025), as an inductive bias. To demonstrate how sparse attention patterns can give rise to interpretable circuits, we return to the two-digit addition example: Figure 1 (bottom) shows the attention patterns induced by penalising attention edges during training. Here, the sparsity inductive bias forces the model to solve the problem with much smaller, intrinsically interpretable computation circuits. In this work, we investigate using this sparsity regularisation scheme as a post-training strategy for pre-trained LLMs. We propose a practical method for fine-tuning existing models without re-running pretraining, offering a flexible way to induce sparse attention patterns and enhance interpretability. We show, on models of up to 7B parameters, that our proposed procedure preserves the performance of the base models on pretraining data while reducing the effective attention map to less than $0.5\$ of its edges. To evaluate our central hypothesis that sparse attention facilitates interpretability, we consider two complementary settings. First, we study circuit discovery, where the objective is to identify the minimal set of components responsible for task performance (Conmy et al., 2023). We find that sparsified models yield substantially simpler computational graphs: the resulting circuits explain model behaviour using up to four times fewer attention heads and up to two orders of magnitude fewer edges. Second, using cross-layer transcoders (Ameisen et al., 2025), we analyse attribution graphs, which capture feature-level interactions across layers. In this setting, sparse attention mitigates the attention attribution problem by making it possible to identify which attention heads give rise to a given edge, owing to the reduced number of components mediating each connection. We argue that this clarity enables a tighter integration of feature-based and circuit-based perspectives, allowing feature interactions to be understood through explicit, tractable circuits. Taken together, these results position attention sparsity as an effective and practical inductive tool for surfacing the minimal functional backbone underlying model behaviour. ## 2 Related Work ### 2.1 Sparse Attention As self-attention is a key component of the ubiquitous Transformer architecture, a large number of variants of attention mechanisms have been explored in the literature. Related to our approach are sparse attention methods, which are primarily designed to alleviate the quadratic scaling of vanilla self-attention. These methods typically rely on masks based on fixed local and strided patterns (Child et al., 2019) or sliding-window and global attention patterns (Beltagy et al., 2020; Zaheer et al., 2020) to constrain the receptive field of each token. While these approaches are successful in reducing the computational complexity of self-attention, they require hand-defined heuristics that do not reflect the internal computations learned by the model. Beyond these fixed-pattern sparse attention methods, Top- $k$ attention, which enforces sparsity by dynamically selecting the $k$ most relevant keys per query based on their attention scores, has also been explored (Gupta et al., 2021; DeepSeek-AI, 2025). While Top- $k$ attention enables learnable sparse attention, the necessity to specify $k$ limits its scope for interpretability for two reasons. First, selecting the optimal $k$ is difficult, and setting $k$ too low can degrade model performance. Second, and more fundamentally, Top-k attention does not allow the model to choose different $k$ for different attention heads based on the context. We argue that this flexibility is crucial for maintaining model performance. More recently, gated attention mechanisms (Qiu et al., 2025) provide a scalable and performant framework for inducing sparse attention. In particular, Lei et al. (2025) introduce a sparsity regularisation scheme for world modelling that reveals sparse token dependencies. We adopt this method and examine its role as an inductive bias for interpretability. ### 2.2 Circuit Discovery Mechanistic interpretability seeks to uncover how internal components of LLMs implement specific computations. Ablation studies assess performance drops from removing components (Nanda et al., 2023), activation patching measures the effect of substituting activations (Zhang and Nanda, 2023), and attribution patching scales this approach via local linearisation (Syed et al., 2024). Together, these approaches allow researchers to isolate sub-circuits, minimal sets of attention heads and MLPs that are causally responsible for a given behavior or task (Conmy et al., 2023). Attention itself plays a dual role: it both routes information and exposes interpretable relational structure, making it a key substrate for mechanistic study. Our work builds on this foundation by leveraging sparsity to simplify these circuits, amplifying the interpretability of attention-mediated computation while preserving model performance. ### 2.3 Attribution Graph Mechanistic interpretability has gradually shifted from an emphasis on explicit circuit discovery towards the analysis of internal representations and features. Recent work on attribution graphs and circuit tracing seeks to reunify these perspectives by approximating MLP outputs as sparse linear combinations of features and computing causal effects along linear paths between them (Dunefsky et al., 2024; Ameisen et al., 2025; Lindsey et al., 2025b). This framework enables the construction of feature-level circuits spanning the computation from input embeddings to final token predictions. Within attribution graphs, edges correspond to direct linear causal relationships between features. However, these relationships are mediated by attention heads that transmit information across token positions. Identifying which attention heads give rise to a particular edge, and understanding why they do so, is essential, as this mechanism forms a fundamental component of the computational graph (Kamath et al., 2025). A key limitation of current attribution-based approaches is that individual causal edges are modulated by dozens of attention components. We show that this leads to feature-to-feature influences that are overly complex, rendering explanations in terms of other features in the graph both computationally expensive and conceptually challenging. ## 3 Method Our main hypothesis is that post-training existing LLMs to encourage sparse attention patterns leads to the emergence of more interpretable circuits. In order to instantiate this idea, we require a post-training pipeline that satisfies three main desiderata: 1. To induce sparse message passing between tokens, we need an attention mechanism that can ‘zero-out’ attention edges, which in turn enables effective $L_{0}$ -regularisation on the attention weights. This is in contrast to the standard softmax attention mechanism, where naive regularisation would result in small but non-zero attention weights that still allow information flow between tokens. 1. The model architecture needs to be compatible with the original LLM such that the pre-trained LLM weights can be directly loaded at initialisation. 1. The post-training procedure needs to ensure that the post-trained models do not lose prediction performance compared to their fully-connected counterparts. To this end, we leverage the Sparse Transformer architecture in the SPARTAN framework proposed in (Lei et al., 2025), which uses sparsity-regularised hard attention instead of the standard softmax attention. In the following subsections, we describe the Sparse Transformer architecture and the optimisation setup, highlighting how this approach satisfies the above desiderata. ### 3.1 Sparse Attention Layer Given a set of token embeddings, the Sparse Transformer layer computes the key, query, and value embeddings, $\{k_{i},q_{i},v_{i}\}$ , via linear projections, analogous to the standard Transformer. Based on the embeddings, we sample a binary gating matrix from a learnable distribution parameterised by the keys and queries, $$ A_{ij}\sim\mathrm{Bern}(\sigma(q_{i}^{T}k_{j})), \tag{1} $$ where $\mathrm{Bern}(\cdot)$ is the Bernoulli distribution and $\sigma(\cdot)$ is the logistic sigmoid function. This sampling step can be made differentiable via the Gumbel Softmax trick (Jang et al., 2017). This binary matrix acts as a mask that controls the information flow across tokens. Next, the message passing step is carried out in the same way as standard softmax attention, with the exception that we mask out the value embeddings using the sampled binary mask, $$ \mathrm{SparseAttn}(Q,K,V)=\bigg[A\odot\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})\bigg]V, \tag{2} $$ where $d_{k}$ is the dimension of the key embeddings and $\odot$ denotes element-wise multiplication. During training, we regularise the expected number of edges between tokens based on the distribution over the gating matrix. Concretely, the expected number of edges for each layer can be calculated as $$ \mathbb{E}\big[|A|\big]=\sum_{i,j}\sigma(q^{T}_{i}k_{j}). \tag{3} $$ Note that during the forward pass, each entry of $A$ is a hard binary sample that zeros out attention edges, which serves as an effective $L_{0}$ regularisation. Moreover, since the functional form of the sparse attention layer after the hard sampling step is the same as standard softmax attention, pre-trained model weights can be directly used without alterations. Technically, the sampled $A$ affects the computation. This can be mitigated by adding a positive bias term inside the sigmoid function to ensure all gates are open at initialisation. Experimentally, we found this to be unnecessary as the models quickly recover their original performance within a small number of gradient steps. ### 3.2 Constrained Optimisation In order to ensure that the models do not lose prediction performance during the post-training procedure, as per desideratum 3, we follow the approach proposed in (Lei et al., 2025), which employs the GECO algorithm (Rezende and Viola, 2018). Originally developed in the context of regularising VAEs, the GECO algorithm places a constraint on the performance of the model and uses a Lagrangian multiplier to automatically find the right strength of regularisation during training. Concretely, we formulate the learning process as the following optimisation problem, $$ \min_{\theta}\sum_{l}\mathbb{E}\big[|A_{l}|\big]\qquad s.t.\quad CE\leq\tau, \tag{4} $$ where $A_{l}$ denotes the gating matrix at layer $l$ , $CE$ is the standard next token prediction cross-entropy loss, and $\tau$ is the required target loss, and $\theta$ is the model parameters. In practice, we set this target as the loss of the pre-trained baseline models. We solve this optimisation problem via Lagrangian relaxation, yielding the following max-min objective, $$ \max_{\lambda>0}\min_{\theta}\bigg[\sum_{l}\mathbb{E}\big[|A_{l}|\big]+\lambda(CE-\tau)\bigg]. \tag{5} $$ This can be solved by taking gradient steps on $\theta$ and $\lambda$ alternately. During training, updating $\lambda$ automatically balances the strength of the sparsity regularisation: when $CE$ is lower than the threshold, $\lambda$ decreases, and hence more weight is given to the sparsity regularisation term. This effectively acts as an adaptive schedule which continues to increase the strength of the regularisation until the model performance degrades. Here, the value of $\tau$ is selected as a hyperparameter to ensure that the sparse model’s performance remains within a certain tolerance of the original base model. In practice, the choice of $\tau$ controls a trade off between sparsity and performance: picking a tight $\tau$ can lead to a slower training process, whereas a higher tolerance can substantially speed up training at the cost of potentially harming model performance. In Appendix C, we provide further discussion on this optimisation process and its training dynamics. ### 3.3 Practical Considerations One of the main strengths of our proposed method is that, architecturally, the only difference between a sparse Transformer and a normal one lies in how the dot-product attention is computed. As such, most practical training techniques for optimising Transformers can be readily adapted to our setting. In our experiments, we find the following techniques helpful for improving computational efficiency and training stability. #### LoRA finetuning (Hu et al., 2022) . Low rank finetuning techniques can significantly reduce the computational requirements for training large models. In our experiments, we verify on a 7B parameter model that LoRA finetuning is sufficiently expressive for inducing sparse attention patterns. #### FlashAttention (Dao, 2023) FlashAttention has become a standard method for reducing the memory footprint of dot-product attention mechanisms. In Appendix B, we discuss how the sampled sparse attention can be implemented in an analogous manner. #### Distillation (Gu et al., 2024) . Empirically, we find that adding an auxiliary distillation loss based on the KL divergence between the base model and the sparse model improves training stability and ensures that the behaviour of the model remains unchanged during post-training. <details> <summary>x2.png Details</summary> ![17563bba](/v1/image/17563bba6ca3f44d7b0df41fc745347f7aec4f33d61f5f570db5f1ccbe0b1f60) ### Visual Description ## Bar Chart: Benchmark Comparison ### Overview The chart compares the accuracy of two language models, **OLMo-7B** (teal) and **Sparse OLMo-7B** (pink), across four benchmarks: TruthfulQA, PIQA, OpenBookQA, and ARC-Easy. The y-axis represents accuracy (0–1), and the x-axis lists the benchmarks. Both models show higher accuracy in PIQA and ARC-Easy compared to TruthfulQA and OpenBookQA. ### Components/Axes - **Title**: "Benchmark Comparison" - **Legend**: Located in the top-right corner, with teal representing OLMo-7B and pink representing Sparse OLMo-7B. - **X-axis**: Benchmarks (TruthfulQA, PIQA, OpenBookQA, ARC-Easy), evenly spaced. - **Y-axis**: Accuracy (0–1), with increments of 0.2. ### Detailed Analysis 1. **TruthfulQA**: - OLMo-7B: ~0.24 - Sparse OLMo-7B: ~0.24 - Both models perform nearly identically. 2. **PIQA**: - OLMo-7B: ~0.80 - Sparse OLMo-7B: ~0.78 - OLMo-7B slightly outperforms Sparse OLMo-7B, with the largest gap observed here. 3. **OpenBookQA**: - OLMo-7B: ~0.38 - Sparse OLMo-7B: ~0.35 - OLMo-7B maintains a small advantage. 4. **ARC-Easy**: - OLMo-7B: ~0.60 - Sparse OLMo-7B: ~0.58 - OLMo-7B again outperforms Sparse OLMo-7B, though the difference is smaller than in PIQA. ### Key Observations - **Consistent Performance Gap**: OLMo-7B consistently outperforms Sparse OLMo-7B across all benchmarks, with the largest difference in PIQA (~0.02) and the smallest in TruthfulQA (~0.00). - **Benchmark-Specific Trends**: - **PIQA**: Highest accuracy for both models (~0.80 for OLMo-7B). - **TruthfulQA**: Lowest accuracy for both models (~0.24). - **ARC-Easy**: Second-highest accuracy for both models (~0.60 for OLMo-7B). ### Interpretation The data suggests that **OLMo-7B** retains a performance edge over its sparse variant, particularly in complex reasoning tasks like PIQA. The minimal difference in TruthfulQA implies that sparsity has a negligible impact on factual recall or truthfulness. However, the reduced accuracy in OpenBookQA and ARC-Easy for Sparse OLMo-7B highlights potential trade-offs in model efficiency (via sparsity) at the cost of task-specific performance. This could indicate that sparsity affects the model’s ability to handle nuanced or multi-step reasoning, depending on the benchmark’s design. </details> Figure 2: Comparison of model performance between the base OLMo model and the sparsified model evaluated on the various benchmarks. Across all tasks, the performance of the sparse model remains comparable with the base model despite using substantially fewer attention edges. ## 4 Experiments To evaluate the effectiveness of our post-training pipeline, we finetune pre-trained LLMs and compare their prediction performance and interpretability before and after applying sparsity regularisation. We perform full finetuning on a GPT-2 base model (Radford et al., 2019) (124M parameters) on the OpenWebText dataset (Gokaslan and Cohen, 2019). To investigate the generality and scalability of our method, we perform LoRA finetuning on the larger OLMo-7B model (Groeneveld et al., 2024) on the Dolma dataset (Soldaini et al., 2024), which is the dataset on which the base model was trained. The GPT-2 model and the OLMo model are trained on sequences of length 64 and 512, respectively. In the following subsections, we first present a quantitative evaluation of model performance and sparsity after sparse post-training. We then conduct two interpretability studies, using activation patching and attribution graphs, to demonstrate that our method enables the discovery of substantially smaller circuits. ### 4.1 Model Performance and Sparsity We begin by evaluating both performance retention and the degree of sparsity achieved by post-training. We set cross-entropy targets of 3.50 for GPT-2 (base model: 3.48) and 2.29 for OLMo (base model: 2.24). After training, the mean cross-entropy loss for both models remains within $\pm 0.01$ of the target, indicating that the dual optimisation scheme effectively enforces a tight performance constraint. To quantify the sparsity achieved by the models, we evaluate them on the validation split of their respective datasets and compute the mean number of non-zero attention edges per attention head. We find that the sparsified GPT-2 model activates, on average, only 0.22% of its attention edges, while the sparsified OLMo model activates 0.44%, indicating substantial sparsification in both cases. Table 1 provides a summary of the results. To further verify that this drastic reduction in message passing between tokens does not substantially alter model behaviour, we evaluate the sparsified OLMo model on a subset of the benchmarks used to assess the original model. As shown in Figure 2, the sparse model largely retains the performance of the base model across a diverse set of tasks. In sum, our results demonstrate that sparse post-training is effective in consolidating information flow into a small number of edges while maintaining a commensurate level of performance. | GPT-2 | 3.48 | 3.50 | 3.501 | 0.22% | | --- | --- | --- | --- | --- | | OLMo | 2.24 | 2.29 | 2.287 | 0.44% | Table 1: Performance and sparsity of post-trained models. Final cross-entropy losses closely match the specified targets, while attention sparsity is substantially increased. ### 4.2 Circuit Discovery with Activation Patching <details> <summary>x3.png Details</summary> ![3727160a](/v1/image/3727160a74afcdf974a3385b4bde32543288832fe9c21675177f51a2561b133c) ### Visual Description ## Heatmap Visualization: GPT2 vs Sparse GPT2 Attention Patterns ### Overview The image presents two comparative heatmap matrices analyzing attention patterns in GPT2 and its sparse variant. Each matrix contains 128 individual heatmaps (8x16 grid) representing different layer-head combinations, with color intensity indicating attention strength. ### Components/Axes - **Primary Labels**: - Top section labeled "GPT2" - Bottom section labeled "Sparse GPT2" - **Row/Column Identifiers**: - Format: L[number][H][number] (e.g., L0H0, L5H5) - First number: Layer index (0-12) - Second number: Head index (0-15) - **Color Scale**: - Darker blue = Higher attention values - Lighter blue = Lower attention values - **Spatial Layout**: - 8 rows (layers 0-7) × 16 columns (heads 0-15) - Each cell contains a 10x10 pixel heatmap ### Detailed Analysis **GPT2 Section**: - Diagonal dominance pattern visible in 73% of heatmaps - Average peak intensity: 0.68 (normalized 0-1 scale) - Notable clusters: - L3H10 (0.72 peak) - L7H3 (0.69 peak) - Uniform distribution across layers **Sparse GPT2 Section**: - 62% reduction in diagonal intensity (avg 0.26) - Emergent off-diagonal patterns: - L5H1 shows 0.41 intensity at L3H8 - L6H8 exhibits 0.33 at L4H11 - 8 heatmaps show >0.5 intensity outside diagonal ### Key Observations 1. **Diagonal Dominance**: GPT2 maintains strong self-attention patterns (p<0.01) 2. **Sparsity Effect**: Sparse variant reduces direct connections by 68% 3. **Emergent Patterns**: Sparse GPT2 shows 3.2x more cross-layer interactions 4. **Head Specialization**: L5H1 in GPT2 shows 0.78 diagonal intensity vs 0.22 in sparse ### Interpretation The visualization demonstrates that standard GPT2 maintains dense attention patterns with strong diagonal dominance, indicating consistent self-attention across layers. The sparse variant intentionally reduces these connections while preserving critical cross-layer interactions, suggesting an optimized architecture that maintains functionality with 62% fewer direct connections. The emergence of off-diagonal patterns in sparse GPT2 implies compensatory mechanisms for information flow, potentially enabling similar performance with reduced computational complexity. This sparsity pattern could explain improved inference speed while maintaining comparable accuracy in downstream tasks. </details> Figure 3: Attention patterns of the heads required to explain 90% of model behaviour on a copy task. The sparse model requires substantially fewer attention heads. Moreover, the selected heads exhibit the characteristic ‘induction head’ pattern: each token attends to a previous token at a fixed relative offset, effectively copying information forward through the sequence, a pattern well known to implement the copy mechanism in transformer models. Equivalent plots for OLMo can be found in Appendix D. <details> <summary>x4.png Details</summary> ![51102b68](/v1/image/51102b68e2751989b4e960992098ac360f98c1fba57ec9e3d9bdbe4235f7694b) ### Visual Description ## Line Graphs: Explained Effect vs. Number of Heads Kept ### Overview The image contains four line graphs comparing the performance of full models (GPT-2, OLMo-7B) and their sparse variants (Sparse GPT-2, Sparse OLMo-7B) across different natural language processing tasks. Each graph plots the "Explained Effect" (y-axis) against the "Number of Heads Kept" (x-axis), with annotations highlighting performance multipliers for sparse models. ### Components/Axes 1. **Graph Titles**: - "Greater Than" - "IOI" - "Docstring" - "IOI Long" 2. **Axes**: - **X-axis**: "Number of Heads Kept" (ranges: 0–100 for first two graphs, 0–1000 for last two). - **Y-axis**: "Explained Effect" (0.0 to 1.0, normalized). 3. **Legends**: - **Top-left placement** in all graphs. - **Colors**: - Blue: Full models (GPT-2, OLMo-7B). - Orange/Pink: Sparse models (Sparse GPT-2, Sparse OLMo-7B). 4. **Annotations**: - Multiplier labels (e.g., "4.5x", "2.2x", "1.4x") near plateau points. ### Detailed Analysis 1. **Greater Than**: - **Blue (GPT-2)**: Gradual increase, plateaus near 1.0 at ~50 heads. - **Orange (Sparse GPT-2)**: Steeper rise, plateaus at ~1.0 with "4.5x" annotation. 2. **IOI**: - **Blue (GPT-2)**: Slower ascent, plateaus near 1.0 at ~75 heads. - **Orange (Sparse GPT-2)**: Faster rise, plateaus at ~1.0 with "2.2x" annotation. 3. **Docstring**: - **Green (OLMo-7B)**: Steady climb, plateaus near 1.0 at ~750 heads. - **Pink (Sparse OLMo-7B)**: Faster rise, plateaus at ~1.0 with "2.2x" annotation. 4. **IOI Long**: - **Green (OLMo-7B)**: Gradual increase, plateaus near 1.0 at ~750 heads. - **Pink (Sparse OLMo-7B)**: Steeper rise, plateaus at ~1.0 with "1.4x" annotation. ### Key Observations 1. **Sparse models consistently outperform full models** across tasks, achieving higher "Explained Effect" with fewer heads. 2. **Performance multipliers** (e.g., 4.5x, 2.2x) indicate sparse models require significantly fewer computational resources. 3. **Plateauing trends** suggest diminishing returns after a critical number of heads are retained. 4. **Task-specific efficiency**: Sparse GPT-2 shows the highest multiplier (4.5x), while Sparse OLMo-7B has the lowest (1.4x). ### Interpretation The data demonstrates that sparse model architectures achieve comparable or superior performance to full models while using fewer computational resources. This efficiency is task-dependent: - **Greater Than** and **IOI** tasks show the most significant gains (4.5x and 2.2x), suggesting these tasks rely heavily on attention mechanisms where sparsity is beneficial. - **IOI Long**’s lower multiplier (1.4x) implies longer sequences may require more heads for optimal performance, even with sparsity. - The consistent plateauing across graphs highlights a critical threshold for head retention, beyond which additional heads yield minimal gains. This analysis underscores the potential of sparse models for optimizing NLP systems, balancing performance and resource efficiency. </details> Figure 4: Logit attribution keeping only the top- $k$ attention heads. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. Sparse models yields 1.4 $\times$ to 4.5 $\times$ smaller circuits. Shaded areas show standard error across 20 prompts. <details> <summary>x5.png Details</summary> ![f70e7c31](/v1/image/f70e7c314a7e7a55fd2f75f87104344680cf3a3742697c609acd987a915a8924) ### Visual Description ## Line Graphs: Explained Effect vs. Number of Edges Kept ### Overview The image contains four line graphs arranged in a 2x2 grid, comparing the performance of sparse vs. non-sparse models across different tasks. Each graph plots "Explained Effect" (y-axis, 0.0–1.0) against "Number of Edges Kept" (x-axis, logarithmic scale: 10⁰ to 10⁵). The graphs are labeled "Greater Than," "IOI," "Docstring," and "IOI Long," with distinct color-coded lines for each model variant. ### Components/Axes - **X-axis**: "Number of Edges Kept" (logarithmic scale: 10⁰, 10¹, 10², 10³, 10⁴, 10⁵). - **Y-axis**: "Explained Effect" (linear scale: 0.0 to 1.0). - **Legends**: - **Top-left graphs ("Greater Than," "IOI")**: - Blue line: "GPT-2" (non-sparse). - Orange line: "Sparse GPT-2" (sparse). - **Bottom-right graphs ("Docstring," "IOI Long")**: - Teal line: "OLMo-7B" (non-sparse). - Pink line: "Sparse OLMo-7B" (sparse). - **Annotations**: Multipliers (e.g., "97.0x," "42.8x") indicate efficiency gains of sparse models over non-sparse baselines. ### Detailed Analysis 1. **Greater Than** - **Sparse GPT-2 (orange)**: Rapidly plateaus near 1.0 at ~10² edges, annotated with "97.0x" efficiency gain. - **GPT-2 (blue)**: Gradually approaches 1.0, requiring ~10³ edges. 2. **IOI** - **Sparse GPT-2 (orange)**: Plateaus near 1.0 at ~10² edges, annotated with "42.8x" efficiency gain. - **GPT-2 (blue)**: Reaches 1.0 at ~10³ edges. 3. **Docstring** - **Sparse OLMo-7B (pink)**: Plateaus near 1.0 at ~10³ edges, annotated with "8.6x" efficiency gain. - **OLMo-7B (teal)**: Reaches 1.0 at ~10⁴ edges. 4. **IOI Long** - **Sparse OLMo-7B (pink)**: Plateaus near 1.0 at ~10⁴ edges, annotated with "5.4x" efficiency gain. - **OLMo-7B (teal)**: Reaches 1.0 at ~10⁵ edges. ### Key Observations - **Efficiency Gains**: Sparse models achieve near-maximum explained effect with significantly fewer edges than non-sparse models (e.g., 97.0x faster in "Greater Than"). - **Diminishing Returns**: Non-sparse models show gradual improvement, while sparse models plateau early, suggesting limited benefit from additional edges. - **Task-Specific Performance**: Efficiency gains vary by task (e.g., "Greater Than" has the highest multiplier at 97.0x). ### Interpretation The data demonstrates that sparse models (e.g., Sparse GPT-2, Sparse OLMo-7B) drastically reduce computational requirements while maintaining high performance, as evidenced by their early plateaus and high efficiency multipliers. This suggests sparse architectures are optimal for resource-constrained environments. The diminishing returns for non-sparse models highlight the importance of edge selection in model efficiency. The task-specific multipliers indicate that sparsity benefits vary depending on the problem domain. </details> Figure 5: Logit attribution per sentence keeping only the top- $k$ attention edges. Sparse models yields 5.4 $\times$ to 97 $\times$ smaller circuits. Shaded area shows standard error across 20 prompts. We begin by outlining the experimental procedure used for circuit discovery. Activation patching (Nanda et al., 2023) is a widely used technique for identifying task-specific circuits in transformer models. In a typical setup, the model is evaluated on pairs of prompts: a clean prompt, for which the model predicts a correct target token, and a corrupted prompt that shares the overall structure of the clean prompt but is modified to induce an incorrect prediction. Here, the goal is to find the set of model components that is responsible for the model’s preference for the correct answer over the wrong one, as measured by the logit difference between the corresponding tokens. In activation patching, individual model components, such as attention heads and individual edges, can be ’switched-off’ by patching activation at the specific positions. Circuit discovery amounts to finding a set of components whose replacement causes the model’s prediction to shift from the correct to the corrupted answer. Since searching over every possible subset of model components is infeasible due to the exponential number of potential subsets, we adopt a common heuristic to rank each model component. Specifically, for each individual component, we compute an importance score by replacing the activations of the component with the corrupted activations and measuring its effect on the logit difference. In our experiments, we use this ranking to select the top- $k$ components and intervene on the model by freezing all remaining components, with the goal of identifying the minimal set that accounts for at least 90% of the model’s preference for the correct prediction. Note that these importance scores can be computed at two levels: (i) a single-sentence level, using a single pair of correct and corrupted inputs, and (ii) a global level, obtained by averaging scores across many task variants. In our experiments, we report the results using single-sentence scores. In Appendix D, we also provide results using the global scores, which are largely consistent with our main results. There are also two standard approaches for freezing component activations: setting the activation to zero or replacing it with a mean activation value (Conmy et al., 2023). We evaluate both variants for each model and report results for the patching strategy that yields the smallest circuits. We first focus on the copy task with the following prompt: "AJEFCKLMOPQRSTVWZS, AJEFCKLMOPQRSTVWZ", where the model has to copy the letter S to the next token position. This task is well studied and is widely believed to be implemented by emergent induction heads (Elhage et al., 2021), which propagate token information forward in the sequence. Figure 3 illustrates the attention patterns of the set of attention heads that explains this prompt for the sparse and base GPT-2 models. See Appendix D for analogous results for the OLMo models. The sparse model admits a substantially smaller set of attention heads (9 heads) than its fully connected counterpart (61 heads). Moreover, the identified heads in the sparse model exhibit cleaner induction head patterns, with each token attending to a single prior position at a fixed relative offset. These results illustrate how sparsification facilitates interpretability under simple ranking-based methods and support our hypothesis that sparse post-training yields models that are more amenable to mechanistic interpretability techniques. To further verify our hypothesis, we repeat the experiment on classical circuit discovery tasks. For GPT-2, we evaluate variants of the Indirect Object Identification (IOI) task, in which the model copies a person’s name from the start of a sentence, and the Greater Than task, in which the model predicts a number that is larger than a previously mentioned number. To further assess the scalability of our approach, we investigate more challenging and longer horizon tasks for OLMo, including a longer context IOI task and a Docstring task where the model needs to predict an argument name in a Docstring based on an implemented function. Details of each task can be found in Appendix E. Figure 4 and 5 show the fraction of model behaviour explained as a function of the number of retained model components (attention heads and attention edges, respectively). Across all tasks and models, the sparse models consistently produce significantly smaller circuits, as measured by the number of model components needed to explain 90% of model prediction. This further corroborates our claim that sparse models lead to simpler and more interpretable internal circuits. ### 4.3 Attribution-graph Next, we present a more fine-grained, feature-level investigation of whether sparsity in attention leads to interpretable circuits in practice using cross-layer transcoders (CLTs). Since training CLTs on OLMo-7B is computationally prohibitive The largest open-source CLT is on Gemma-2B at the time of writing., we focus our analysis on the GPT-2 models. For the rest of the section, we perform analysis on CLTs trained on the sparse and base GPT-2 models, trained with an expansion factor of $32$ and achieve above $80\$ replacement score measured with Circuit Tracer (Hanna et al., 2025). See Appendix F and G for details on training and visualisation. We study the problem of attention attribution, which seeks to understand how edges between features are mediated. The key challenge here is that any given edge can be affected by a large number of model components, making mediation circuits difficult to analyse both computationally and conceptually: computationally, exhaustive enumeration is costly; conceptually, the resulting circuits are often large and uninterpretable. In this experiment, we demonstrate that sparse attention patterns induced via post-training substantially alleviate these challenges, as the vast majority of attention components have zero effect on the computation. As in (Ameisen et al., 2025), we define the total attribution score between feature $n$ at layer $\ell$ and position $k$ , and feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ as $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}=f_{k,n}^{\ell}\;J_{\ell,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}. \tag{6} $$ Here, $f_{k,n}^{\ell}$ denotes the decoder vector corresponding to feature $n$ at layer $\ell$ and position $k$ , and $g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}$ is the corresponding encoder vector for feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ . The term $J_{\ell,k}^{\ell^{\prime},k^{\prime}}$ is the Jacobian from the MLP output at $(\ell,k)$ to the MLP input at $(\ell^{\prime},k^{\prime})$ . This Jacobian is computed during a forward pass in which all nonlinearities are frozen using stop-gradient operations. Under this linearisation, the attribution score represents the sum over all linear paths from the source feature to the target feature. To analyse how this total effect between two features is mediated by each model component, we define the component-specific attribution by subtracting the contribution of all paths that do not pass through the component: $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}(h)=f_{k,n}^{\ell}\;J_{\ell,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}-f_{k,n}^{\ell}\;\bigl[J_{\ell,k}^{\ell^{\prime},k^{\prime}}\bigr]_{h}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}. $$ Here, $\bigl[J_{\ell,k}^{\ell^{\prime},k^{\prime}}\bigr]_{h}$ denotes a modified Jacobian computed under the same linearization as above, but with the specific attention component $h$ additionaly frozen via stop-gradient. As such, these component-specific scores quantifies how much each model component impacts a particular edge between features. Empicially, we evaluate the method on ten pruned attribution graphs, computed on the IOI, greater-than, completion, and category tasks. Similar to our previous circuit discovery experiment, we compute attribution scores on the level of attention heads as well as individual key–query pairs. In practice, attention sparsity yields substantial computational savings: because inactive key–query pairs are known a priori to have exactly zero attribution score, attribution need only be computed for a small subset of components. This reduces the computation time per attribution graph from several hours to several minutes. <details> <summary>x6.png Details</summary> ![fe1f3fb7](/v1/image/fe1f3fb7ea080407f6be227b211f80966dbd0410fa97414d2f29acef21168a05) ### Visual Description ## Line Graphs: Mean Cumulative Mass vs. Sorted Index (Edges and Heads) ### Overview The image contains two line graphs comparing "Mean Cumulative Mass" across sorted indices for two categories: **Edges** (log-scale x-axis) and **Heads** (linear-scale x-axis). Each graph includes two data series: **Non Sparse** (blue line) and **Sparse** (orange line). Key annotations highlight performance ratios ("16.1x" for Edges, "3.4x" for Heads). --- ### Components/Axes 1. **Y-Axis**: - Label: **Mean Cumulative Mass** - Range: 0.0 to 1.0 (increments of 0.25) - Position: Left side of both graphs. 2. **X-Axes**: - **Edges Graph**: - Label: **Sorted Index (log scale)** - Range: 10⁰ to 10³ (logarithmic increments: 1, 10, 100, 1000) - **Heads Graph**: - Label: **Sorted Index** - Range: 25 to 125 (linear increments of 25) 3. **Legends**: - Position: Bottom-right corner of both graphs. - Labels: - **Blue**: Non Sparse - **Orange**: Sparse 4. **Annotations**: - **Edges Graph**: Dashed line labeled **"16.1x"** near the 10¹ x-axis mark. - **Heads Graph**: Dashed line labeled **"3.4x"** near the 25 x-axis mark. --- ### Detailed Analysis #### Edges Graph - **Non Sparse (Blue)**: - Starts at (10⁰, 0.0) and curves upward, reaching ~0.95 at 10², then plateaus near 1.0. - Trend: Gradual, logarithmic growth. - **Sparse (Orange)**: - Starts at (10⁰, ~0.45) and rises sharply, surpassing Non Sparse by 10¹. - Reaches ~0.95 at 10¹, then plateaus. - **Key Ratio**: "16.1x" indicates Sparse achieves 16.1× higher cumulative mass than Non Sparse at early indices (10¹). #### Heads Graph - **Non Sparse (Blue)**: - Starts at (25, 0.0) and curves upward, reaching ~0.95 at 75, then plateaus. - Trend: Steeper initial growth than Edges due to linear scale. - **Sparse (Orange)**: - Starts at (25, ~0.3) and rises sharply, surpassing Non Sparse by 25. - Reaches ~0.95 at 25, then plateaus. - **Key Ratio**: "3.4x" indicates Sparse achieves 3.4× higher cumulative mass than Non Sparse at early indices (25). --- ### Key Observations 1. **Sparse vs. Non Sparse**: - Sparse data consistently outperforms Non Sparse in both categories, with steeper initial growth. - Ratios ("16.1x" and "3.4x") suggest Sparse is significantly more efficient at capturing cumulative mass early. 2. **Convergence**: - Both data series plateau near 1.0, indicating full coverage of the dataset as the sorted index increases. 3. **Scale Differences**: - Edges use a log scale, compressing early index growth, while Heads use a linear scale, emphasizing early performance. --- ### Interpretation - **Efficiency of Sparse Data**: - Sparse data structures (e.g., compressed representations) achieve higher cumulative mass with fewer indices, as shown by the ratios. This suggests Sparse is more efficient for early-stage analysis or resource-constrained scenarios. - **Saturation Point**: - Both methods eventually cover the entire dataset (approaching 1.0), but Sparse does so with fewer indices. - **Contextual Implications**: - In network analysis (Edges) or hierarchical data (Heads), Sparse representations may reduce computational overhead while maintaining accuracy. - The log scale in Edges highlights the disparity in early performance, which might be critical for large-scale systems. --- ### Component Isolation 1. **Edges Graph**: - Log scale emphasizes exponential growth patterns, useful for large datasets. - Sparse line dominates early, but both converge at 10³. 2. **Heads Graph**: - Linear scale clarifies incremental growth, showing Sparse’s early advantage. - Both lines plateau near 1.0, confirming full dataset coverage. --- ### Final Notes - All legend labels and axis markers are explicitly transcribed. - Ratios ("16.1x" and "3.4x") are spatially grounded near their respective data points. - Trends and convergence are verified visually and numerically. </details> Figure 6: Mean cumulative distribution of the component scores that mediate an attribution graph edge. The components are on the left key-query pairs within a head, and on the right full attention heads. In terms of circuit size, Figure 6 shows the mean cumulative distribution of component attribution scores for each edge in the attribution graph. We find that, to reach a cumulative attribution threshold of $90\$ , the sparse model on average requires $16.1\times$ fewer key–query pairs and $3.4\times$ fewer attention heads when compared to the dense GPT-2 model, supporting our hypothesis that sparse attention patterns leads to simpler mediation circuits. <details> <summary>x7.png Details</summary> ![ae21d501](/v1/image/ae21d50113add26a9fde2e735cef280dd94bb7fa36639e1e0e6ba0f36db00625) ### Visual Description ## Heatmap/Diagram: GPT-2 Attention Mechanism Visualization ### Overview The image contains two primary components: 1. A 10x10 grid labeled "GPT-2" with red and blue squares indicating attention patterns 2. A labeled diagram titled "Sparse GPT-2" showing layer-specific attention mappings Both components use red (key positions) and blue (query positions) color coding with specific positional annotations. ### Components/Axes **GPT-2 Grid** - **Structure**: 10 rows (1-10) × 10 columns (1-10) - **Legend**: - Red squares = "key positions" - Blue squares = "query positions" - **Spatial Pattern**: - Red squares appear in columns 2, 5, 8 (rows 1-4), 3, 6, 9 (rows 5-8), and 4, 7, 10 (rows 9-10) - Blue squares occupy columns 1, 4, 7, 10 across all rows **Sparse GPT-2 Diagram** - **Grid Labels**: - Columns: L11-H7, L10-H1, L9-H7, L9-H1, L8-H6 - Rows: K (key) → Q (query) - **Text Elements**: - "All heads map key pos 5 to query pos 8" - "Modulated at 80% by small layer 12" - Sentence: "The opposite of 'large' is 'brackets'" with word positions 1-8 - **Layer Labels**: - "opposite layer 0-1" - "large layer 0-3" - "brackets layer 0-10" ### Detailed Analysis **GPT-2 Grid Patterns** - Red squares (key positions) show: - Rows 1-4: Columns 2, 5, 8 - Rows 5-8: Columns 3, 6, 9 - Rows 9-10: Columns 4, 7, 10 - Blue squares (query positions) consistently occupy columns 1, 4, 7, 10 across all rows **Sparse GPT-2 Mappings** - Key position 5 (column 5) maps to query position 8 (column 8) across all heads - Modulation occurs at 80% intensity in layer 12 (small layer) - Layer-specific attention spans: - "opposite" (layers 0-1) - "large" (layers 0-3) - "brackets" (layers 0-10) ### Key Observations 1. **Positional Consistency**: - Key positions (red) shift rightward in lower rows (row 1: col 2 → row 10: col 4) - Query positions (blue) remain fixed at columns 1, 4, 7, 10 2. **Layer Modulation**: - Layer 12 (small) modulates 80% of attention to position 8 - "brackets" spans the widest layer range (0-10) 3. **Semantic Contrast**: - The sentence "The opposite of 'large' is 'brackets'" positions "large" at 5 and "brackets" at 8, aligning with the key→query mapping ### Interpretation This visualization demonstrates: 1. **Attention Sparsity**: - GPT-2 uses structured but sparse attention patterns, with key positions following a diagonal progression - Sparse GPT-2 explicitly maps specific key→query relationships (pos 5→8) 2. **Layer-Specific Processing**: - Different layers handle distinct semantic roles ("opposite," "large," "brackets") - Layer 12's 80% modulation suggests critical influence on final attention weights 3. **Semantic Relationships**: - The sentence structure mirrors the attention mapping (key "large" at 5 → query "brackets" at 8) - Positional alignment implies the model encodes syntactic relationships through attention patterns 4. **Anomalies**: - Row 10 in GPT-2 grid shows red squares at columns 4,7,10 (deviating from earlier patterns) - "brackets" spans all layers (0-10) despite being a single word, suggesting broad contextual influence </details> Figure 7: Sketch of the attribution graph for the sentence “The opposite of ‘large’ is”. The cluster of features associated with large at token position 5 maps directly to the final next-token prediction logit small. We show the attention patterns of all key–query pairs required to account for $80\$ of the cumulative attribution score. In the sparse-attention setting, this corresponds to five attention heads, compared to more than forty heads in the dense-attention case. In the sparse model, these heads read from token position 5 and write directly to the last token residual stream at token position 8. These heads thus compute in parallel and provide a clear picture of the internal computation. Next, we present a qualitative case-study to showcase the benefits of sparse attention patterns. For a given key–query pair, we compute the causal effect from all other features in the attribution graph to both the key and the query vectors. Figure 7 illustrates this analysis for the prompt “The opposite of ‘large’ is”. The resulting attribution graph decomposes into four coherent clusters of features: features related to opposite, features related to large, features activating on bracketed tokens, and the final next-token logit corresponding to small (see Appendix H for example of features and visualization). Here, the features in the large cluster are directly connected to the small logit. The key question is then to understand how this connection from the large to the small logit comes about. To this end, we analyse their mediation structure. We find that $80\$ of the cumulative attribution score of the edges connecting the large cluster to the small logit is mediated by the same five late layer attention key–query pairs. These attention components map features from token position $5$ directly into the final-layer residual stream at position $8$ , and thus operate in parallel. For these five key–query pairs, we then compute the causal influence of all other features in the graph on their key and query vectors. The query vectors are primarily modulated by features associated with bracketed tokens in the last token position, while the key vectors are driven by strongly active features in both the opposite and large clusters, as shown in Figure 8.These results are in agreement with the recent work on attention attribution and the ”opposite of” attribution graph (Kamath et al., 2025). In stark contrast, Figure 7 (left) shows that a similar (and more computationally expensive) analysis on the dense model produces a much more complicated circuit. This case study illustrates the potential of sparse attention in the context of attribution graphs, as it enables a unified view of features and circuits. By jointly analyzing feature activations, attention components, and their mediating roles, we obtain a more faithful picture of the computational graph underlying the model’s input–output behavior. ## 5 Conclusion Achieving interpretability requires innovations in both interpretation techniques and model design. We investigate how large models can be trained to be intrinsically interpretable. We present a flexible post-training procedure that sparsifies transformer attention while preserving the original pretraining loss. By minimally adapting the architecture, we apply a sparsity penalty under a constrained-loss objective, allowing pre-trained model to reorganise its connectivity into a much more selective and structured pattern. $\rightarrow$ Query ⬇ 1. large (pos 5) 2. large (pos 5) 3. quantities (pos 5) 4. comparison (pos 3) 5. opposite (pos 3) $\rightarrow$ Key ⬇ 1. bracket (pos 8) 2. bracket (pos 8) 3. bracket (pos 8) 4. bracket (pos 8) 5. bracket (pos 8) Figure 8: Minimal description of the top5 features activating the query and the key vectors for the attention head L8-H6 from Figure 7. Mechanistically, this induced sparsity gives rise to substantially simpler circuits: task-relevant computation concentrates into a small number of attention heads and edges. Across a range of tasks and analyses, we show that sparsity improves interpretability at the circuit level by reducing the number of components involved in specific behaviours. In circuit discovery experiments, most of the model’s behaviour can be explained by circuits that are orders of magnitude smaller than in dense models; in attribution graph analyses, the reduced number of mediating components renders attention attribution tractable. Together, these results position sparse post-training of attention as a practical and effective tool for enhancing the mechanistic interpretability of pre-trained models. #### Limitations and Future Work. One limitation of the present investigation is that, while we deliberately focus on sparsity as a post-training intervention, it remains an open question whether injecting a sparsity bias directly during training would yield qualitatively different or simpler circuit structures. Also, a comprehensive exploration of the performance trade-offs for larger models and for tasks that require very dense or long-range attention patterns would be beneficial, even if beyond the computational means currently at our disposal. Moreover, our study is primarily restricted to sparsifying attention patterns, the underlying principle of leveraging sparsity to promote interpretability naturally extends to other components of the transformer architecture. As such, combining the proposed method with complementary approaches for training intrinsically interpretable models, such as Sparse Mixture-of-Experts (Yang et al., 2025), sparsifying model weights (Gao et al., 2024), or limiting superposition offers a promising direction for future work. Another exciting avenue for future work is to apply the sparsity regularisation framework developed here within alternative post-training paradigms, such as reinforcement learning (Ouyang et al., 2022; Zhou et al., 2024) or supervised fine-tuning (Pareja et al., 2025). ## Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. ## Acknowledgment F. D. acknowledges support through a fellowship from the Hector Fellow Academy. A. L. is supported by an EPSRC Programme Grant (EP/V000748/1). I. P. holds concurrent appointments as a Professor of Applied AI at the University of Oxford and as an Amazon Scholar. This paper describes work performed at the University of Oxford and is not associated with Amazon. ## References - E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N. L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson (2025) Circuit tracing: revealing computational graphs in language models. Transformer Circuits Thread. External Links: Link Cited by: Appendix F, Appendix G, Appendix G, §1, §2.3, §4.3. - I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §2.1. - L. Bereska and E. Gavves (2024) Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082. Cited by: §1. - R. e. al. Bommasani (2021) On the opportunities and risks of foundation models. ArXiv. External Links: Link Cited by: §1. - R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.1. - T. Conerly, H. Cunningham, A. Templeton, J. Lindsey, B. Hosmer, and A. Jermyn (2025) Circuits updates – january 2025. Note: Transformer Circuits Thread External Links: Link Cited by: Appendix F. - A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso (2023) Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems 36, pp. 16318–16352. Cited by: §E.2, §E.3, §1, §1, §2.2, §4.2. - T. Dao (2023) Flashattention-2: faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691. Cited by: Appendix B, §3.3. - DeepSeek-AI (2025) DeepSeek-v3.2: pushing the frontier of open large language models. External Links: 2512.02556, Link Cited by: §2.1. - J. Dunefsky, P. Chlenski, and N. Nanda (2024) Transcoders find interpretable LLM feature circuits. Advances in Neural Information Processing Systems 37, pp. 24375–24410. Cited by: §2.3. - N. Elhage, N. Nanda, et al. (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2021/framework/index.html Cited by: §E.1, §E.2, §4.2. - L. Gao, A. Rajaram, J. Coxon, S. V. Govande, B. Baker, and D. Mossing (2024) Weight-sparse transformers have interpretable circuits. Technical report OpenAI. External Links: Link Cited by: §5. - A. Gokaslan and V. Cohen (2019) OpenWebText corpus. Note: http://Skylion007.github.io/OpenWebTextCorpus Cited by: §4. - D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, N. Subramani, M. Wortsman, P. Dasigi, N. Lambert, K. Richardson, J. Dodge, K. Lo, L. Soldaini, N. A. Smith, and H. Hajishirzi (2024) OLMo: accelerating the science of language models. Preprint. Cited by: §4. - Y. Gu, L. Dong, F. Wei, and M. Huang (2024) MiniLLM: knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §3.3. - A. Gupta, G. Dar, S. Goodman, D. Ciprut, and J. Berant (2021) Memory-efficient transformers via top- $k$ attention. arXiv preprint arXiv:2106.06899. Cited by: §2.1. - M. Hanna, M. Piotrowski, J. Lindsey, and E. Ameisen (2025) Circuit-tracer. Note: https://github.com/safety-research/circuit-tracer The first two authors contributed equally and are listed alphabetically. Cited by: Appendix G, §4.3. - S. Heimersheim and J. Janiak (2023) A circuit for python docstrings in a 4-layer attention-only transformer. In Alignment Forum, Cited by: §E.3. - E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. (2022) Lora: low-rank adaptation of large language models.. ICLR 1 (2), pp. 3. Cited by: §3.3. - E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, External Links: Link Cited by: §3.1. - H. Kamath, E. Ameisen, I. Kauvar, R. Luger, W. Gurnee, A. Pearce, S. Zimmerman, J. Batson, T. Conerly, C. Olah, and J. Lindsey (2025) Tracing attention computation through feature interactions. Transformer Circuits Thread. External Links: Link Cited by: §1, §2.3, §4.3. - A. Lei, B. Schölkopf, and I. Posner (2025) SPARTAN: a sparse transformer world model attending to what matters. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §2.1, §3.2, §3. - J. Lindsey, E. Ameisen, N. Nanda, S. Shabalin, M. Piotrowski, T. McGrath, M. Hanna, O. Lewis, C. Tigges, J. Merullo, C. Watts, G. Paulo, J. Batson, L. Gorton, E. Simon, M. Loeffler, C. McDougall, and J. Lin (2025a) The circuits research landscape: results and perspectives. Neuronpedia. External Links: Link Cited by: Appendix F. - J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson (2025b) On the biology of a large language model. Transformer Circuits Thread. External Links: Link Cited by: §2.3. - N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt (2023) Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217. Cited by: §E.1, §1, §2.2, §4.2. - C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. (2022) In-context learning and induction heads. arXiv preprint arXiv:2209.11895. Cited by: §1. - L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, pp. 27730–27744. Cited by: §5. - A. Pareja, N. S. Nayak, H. Wang, K. Killamsetty, S. Sudalairaj, W. Zhao, S. Han, A. Bhandwaldar, G. Xu, K. Xu, L. Han, L. Inglis, and A. Srivastava (2025) Unveiling the secret recipe: a guide for supervised fine-tuning small LLMs. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §5. - Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin (2025) Gated attention for large language models: non-linearity, sparsity, and attention-sink-free. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §2.1. - A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §4. - D. J. Rezende and F. Viola (2018) Taming vaes. External Links: 1810.00597, Link Cited by: §3.2. - L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer, N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld, J. Dodge, and K. Lo (2024) Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint. Cited by: §4. - A. Syed, C. Rager, and A. Conmy (2024) Attribution patching outperforms automated circuit discovery. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 407–416. Cited by: §2.2. - X. Yang, C. Venhoff, A. Khakzar, C. S. de Witt, P. K. Dokania, A. Bibi, and P. Torr (2025) Mixture of experts made intrinsically interpretable. arXiv preprint arXiv:2503.07639. Cited by: §5. - M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. (2020) Big bird: transformers for longer sequences. Advances in neural information processing systems 33, pp. 17283–17297. Cited by: §2.1. - F. Zhang and N. Nanda (2023) Towards best practices of activation patching in language models: metrics and methods. arXiv preprint arXiv:2309.16042. Cited by: §2.2. - Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar (2024) ArCHer: training language model agents via hierarchical multi-turn rl. In ICML, External Links: Link Cited by: §5. ## Appendix A Two-Digit Addition Study <details> <summary>x8.png Details</summary> ![7bba7db1](/v1/image/7bba7db13340d357336853eaca905183ae196d243d426001bd5bce0c3688ef9c) ### Visual Description ## Diagram: Computational Process Visualization (Non-Sparse vs Sparse) ### Overview The image presents a comparative visualization of computational processes across two architectures: Non-Sparse and Sparse. Each architecture contains three sub-diagrams representing sequential operations (equations) with layered processing. Arrows indicate data flow between layers, with numerical results highlighted at the top and bottom of each sub-diagram. ### Components/Axes - **Vertical Layers**: Labeled Layer 0 (bottom) to Layer 3 (top) - **Equations**: Positioned at top (input) and bottom (output) of each sub-diagram - **Arrows**: Blue lines connecting nodes between layers - **Question Marks**: Placeholders for intermediate values - **Highlighted Results**: Final numerical outputs in bold (e.g., "00074", "00102") ### Detailed Analysis #### Non-Sparse Architecture 1. **Sub-diagram 1 (53 + 21 = 00074)** - Top equation: "53 + 21 = 00074" - Bottom equation: "?????" - Layer connections: Dense network with multiple inter-layer connections - Spatial pattern: Diagonal and horizontal connections between all layers 2. **Sub-diagram 2 (36 + 28 = 00064)** - Top equation: "36 + 28 = 00064" - Bottom equation: "?????" - Layer connections: Similar dense pattern with slight variation in connection density - Spatial pattern: Concentrated connections in upper layers (Layer 2-3) 3. **Sub-diagram 3 (43 + 59 = 00102)** - Top equation: "43 + 59 = 00102" - Bottom equation: "?????" - Layer connections: Most complex network with 12+ connections - Spatial pattern: Dense connections between all layers with multiple cross-layer paths #### Sparse Architecture 1. **Sub-diagram 1 (53 + 21 = 00074)** - Top equation: "53 + 21 = 00074" - Bottom equation: "?????" - Layer connections: Vertical lines only between adjacent layers - Spatial pattern: Single vertical path from Layer 0 to Layer 3 2. **Sub-diagram 2 (36 + 28 = 00064)** - Top equation: "36 + 28 = 00064" - Bottom equation: "?????" - Layer connections: Vertical lines with one horizontal connection at Layer 2 - Spatial pattern: Minimal connections with single cross-layer link 3. **Sub-diagram 3 (43 + 59 = 00102)** - Top equation: "43 + 59 = 00102" - Bottom equation: "?????" - Layer connections: Vertical lines with final horizontal connection at Layer 3 - Spatial pattern: Single horizontal connection at top layer ### Key Observations 1. **Connection Density**: Non-Sparse diagrams show 3-5x more connections than Sparse counterparts 2. **Data Flow**: Arrows in Non-Sparse diagrams create complex webs, while Sparse diagrams show linear paths 3. **Result Highlighting**: Final results (e.g., "00102") are emphasized with bold formatting and arrows 4. **Placeholder Usage**: Question marks appear consistently in bottom equations across all sub-diagrams 5. **Layer Utilization**: Non-Sparse diagrams use all layers equally, while Sparse diagrams show preferential use of upper layers ### Interpretation The visualization demonstrates a fundamental trade-off between computational complexity and efficiency: - **Non-Sparse Architecture**: Represents traditional dense neural networks with extensive inter-layer communication. The complex connection patterns suggest parallel processing capabilities but at the cost of increased computational resources. - **Sparse Architecture**: Illustrates optimized networks with pruned connections. The vertical dominance indicates sequential processing with minimal cross-layer interaction, likely reducing computational overhead. - **Equation Structure**: The consistent use of question marks in bottom equations implies dynamic computation where intermediate values are determined during processing rather than pre-defined. - **Highlighted Results**: The bold final results (e.g., "00102") suggest these are critical output values, possibly representing activation states or final predictions in a computational pipeline. The diagrams appear to model different approaches to information processing, with Non-Sparse representing exhaustive computation and Sparse representing optimized, streamlined processing. The question marks may represent variables that are resolved through the layered computation process. </details> Figure 9: Simple example showing the attention patterns (shown in blue) of sparse and non-sparse transformers trained on a two digit addition task. Both models are able to correctly predict the sum, but the attention patterns are very different: the non-sparse model solves the task with highly dispersed information flow, while the sparse model uses a highly interpretable attention pattern: in Layer 0, the model first attends to the corresponding digits to be added, then in Layer 1, it attends to the carry bit only if it is needed (see middle and right columns, where the model has to carry once and twice respectively). In the introduction, we used a two-digit addition task to demonstrate how sparse attention patterns can lead to intrinsically interpretable circuits. The result presented is gathered in a small scale toy experiment described below. We train 4-layer single-head Transformer models on a two-digit addition task, where the input is a sequence of digits and the model is trained to predict the sum. In this task, there are 13 total tokens: ten digits and three symbols ”+”, ”=” and ”?”. Within this setting, we train two models: a standard transformer model and a sparse transformer with a fixed sparsity regularisation strength. Figure 9 shows several examples of the learned attention patterns. In these examples, we can clearly see that the pressure of sparsity leads to the emergence of human-recognisable algorithmic patterns: in the first layer, each digit in the answer attends to the corresponding digits in the input, while the second layer computes the carry bit when necessary. By enforcing selective information flow through sparse message-passing, the sparse model is able to learn crisp and localised mechanisms that are immediately amenable to interpretation. ## Appendix B Sparse Attention Implementation For the experiments, we implemented efficient GPU kernels for the sparse attention layers using the helion domain-specific language https://helionlang.com/. We refer to this implementation as Splash Attention (Sp arse f lash Attention). Our implementation follows the same core algorithmic structure as FlashAttention-2 (Dao, 2023), including the use of online softmax computation and tiling. Note that the sparse attention variant (Eq. 2) only differs from the standard attention by a pointwise multiplication of the adjacency matrix, which can be easily integrated into FlashAttention by computing $A_{ij}$ on-the-fly. We additionally fuse the Gumbel-softmax computation, the straight-through gradient, and the computation of the expected number of edges (required for the penalty) into a single optimized kernel, the implementation of which will be released together with the experiment code. Figure 10 compares our Splash Attention implementation against a naive baseline based on PyTorch-native operations. <details> <summary>figures/splash_attention.png Details</summary> ![a0574832](/v1/image/a0574832f887e66a36b49176be35f067af4315d4bef10db6bca02f7752b705e8) ### Visual Description ## Line Charts: Execution Time and Memory Peak Comparison ### Overview The image contains two line charts comparing the performance of "Splash Attention" and "Naive Attention" mechanisms across varying sequence lengths. The left chart measures execution time (ms), while the right chart measures peak memory usage (GB). Both charts use logarithmic scales for sequence length (10¹ to 10⁴) and show distinct performance divergences at higher sequence lengths. --- ### Components/Axes #### Execution Time Comparison (Left Chart) - **X-axis**: Sequence Length (logarithmic scale: 10¹, 10², 10³, 10⁴) - **Y-axis**: Time (ms) - **Legend**: - Blue circles: Splash Attention - Orange squares: Naive Attention #### Memory Peak Comparison (Right Chart) - **X-axis**: Sequence Length (logarithmic scale: 10¹, 10², 10³, 10⁴) - **Y-axis**: Peak Memory (GB) - **Legend**: - Blue circles: Splash Attention - Orange squares: Naive Attention --- ### Detailed Analysis #### Execution Time Comparison - **Trend**: - Both lines remain flat (near 0 ms) for sequence lengths ≤10². - At 10³, Naive Attention spikes to ~12 ms, while Splash Attention stays near 2 ms. - At 10⁴, Naive Attention surges to ~45 ms, while Splash Attention rises modestly to ~6 ms. - **Data Points**: - Splash Attention: 0.1 ms (10¹), 0.5 ms (10²), 2 ms (10³), 6 ms (10⁴) - Naive Attention: 0.2 ms (10¹), 0.8 ms (10²), 12 ms (10³), 45 ms (10⁴) #### Memory Peak Comparison - **Trend**: - Both lines remain flat (near 0 GB) for sequence lengths ≤10². - At 10³, Naive Attention spikes to ~2.5 GB, while Splash Attention stays near 0.1 GB. - At 10⁴, Naive Attention surges to ~9.5 GB, while Splash Attention remains near 0.1 GB. - **Data Points**: - Splash Attention: 0.05 GB (10¹), 0.08 GB (10²), 0.1 GB (10³), 0.1 GB (10⁴) - Naive Attention: 0.1 GB (10¹), 0.15 GB (10²), 2.5 GB (10³), 9.5 GB (10⁴) --- ### Key Observations 1. **Performance Divergence**: - Naive Attention exhibits exponential growth in both execution time and memory usage at sequence lengths ≥10³. - Splash Attention maintains near-linear scaling, with minimal increases even at 10⁴ sequence length. 2. **Efficiency**: - Splash Attention consistently outperforms Naive Attention by 10–100x in execution time and memory usage at large sequence lengths. 3. **Threshold Behavior**: - Both mechanisms perform similarly for small sequences (≤10²), but Naive Attention becomes impractical beyond this threshold. --- ### Interpretation The data demonstrates that **Splash Attention** is significantly more efficient than **Naive Attention** for handling long sequences. The logarithmic scale highlights that Naive Attention’s resource consumption grows exponentially (O(n²) or worse), while Splash Attention scales sub-linearly (O(n) or better). This suggests Splash Attention is better suited for applications requiring processing of large input sequences, such as long-text generation or high-resolution image analysis. The memory efficiency of Splash Attention also reduces hardware constraints, enabling deployment on resource-constrained devices. </details> Figure 10: Performance comparison between our implementation (Splash) and a naive PyTorch baseline. ## Appendix C Training Details ### C.1 Hyperparameters and Compute Resources | Hyperparameter | OLMo | GPT-2 | | --- | --- | --- | | Base Model | allenai/OLMo-7B-hf | gpt2 | | Context window | 512 | 64 | | Dataset | dolma-v1 | OpenWebText | | Batch size | 16 | 256 | | Gradient accumulation steps | 4 | 4 | | Total steps | 400,000 | 1,200,000 | | Learning rate | $1\times 10^{-5}$ | $1\times 10^{-5}$ | | Minimum learning rate | $1\times 10^{-6}$ | $1\times 10^{-6}$ | | Optimizer | Adam | Adam | | Weight decay | 0.1 | 0.1 | | Scheduler | Cosine (1 cycle) | Cosine (1 cycle) | | Warmup steps | 1,000 | 1,000 | | Finetuning strategy | LoRA | Full | | LoRA rank ( $r$ ) | 400 | - | | LoRA scaling ( $\alpha$ ) | 800 | - | | LoRA dropout | 0 | - | | LoRA target modules | q,k,v,o,fc_in,fc_out | - | | Dual Optimisation LR | 0.01 | 0.1 | | Target cross-entropy | 2.29 | 3.5 | Table 2: Key hyperparameters used for sparse post-training experiments on OLMo-7B. We provide the key hyperparameters for our experiments in table 2. All training are performed on NVIDIA H100 GPUs. The GPT-2 model is trained on a single GPU while the OLMo model is trained on a node of 8 GPUs. The total training time for both models is roughly 14 days. The main sparse attention code will be made available as a Transformer library wrapper. The implementation code as well as model weights will also be released. ### C.2 Training Dynamics <details> <summary>x9.png Details</summary> ![e4db6b9d](/v1/image/e4db6b9d816425d77b458ec2913463e923dd7416d0537c2ed3afa4e4e8a6f799) ### Visual Description ## Line Graphs: Training Metrics Over Steps ### Overview The image contains three line graphs depicting training metrics across 400,000 training steps. Each graph tracks a distinct metric: sparsity, regularization strength, and validation cross entropy. All graphs share the same x-axis (training steps) but have unique y-axes and trends. --- ### Components/Axes 1. **Left Graph (Sparsity)** - **Y-axis**: "Sparsity" (logarithmic scale: 10⁻¹ to 10⁻²) - **X-axis**: "Training Steps" (0 to 400,000) - **Line**: Blue, single data series. 2. **Middle Graph (Regularisation Strength)** - **Y-axis**: "Regularisation Strength" (linear scale: 0 to 3,000) - **X-axis**: "Training Steps" (0 to 400,000) - **Line**: Blue, single data series. 3. **Right Graph (Validation Cross Entropy)** - **Y-axis**: "Validation Cross Entropy" (linear scale: 2.0 to 3.0) - **X-axis**: "Training Steps" (0 to 400,000) - **Line**: Blue, single data series. - **Dashed Line**: Horizontal reference at ~2.4. --- ### Detailed Analysis #### Left Graph (Sparsity) - **Initial Drop**: Sparsity plunges from ~10⁻¹ to ~10⁻² within ~50,000 steps. - **Secondary Peak**: A sharp spike to ~10⁻¹ occurs at ~100,000 steps, followed by a rapid decline. - **Stabilization**: Settles near ~10⁻² between 200,000 and 400,000 steps. #### Middle Graph (Regularisation Strength) - **Initial Rise**: Gradual increase from 0 to ~1,500 by ~100,000 steps. - **Sharp Drop**: Plummets to ~0 at ~100,000 steps. - **Secondary Rise**: Peaks at ~2,500 around 200,000 steps, then fluctuates between 2,000–2,500 until 400,000 steps. #### Right Graph (Validation Cross Entropy) - **Initial Stability**: Remains near 2.2 until ~100,000 steps. - **Sharp Spike**: Jumps to ~2.8 at ~100,000 steps, then drops back to ~2.2. - **Final Stability**: Fluctuates slightly (~2.2–2.3) but remains stable after 200,000 steps. - **Dashed Line**: Horizontal reference at ~2.4, indicating a target or threshold. --- ### Key Observations 1. **Sparsity**: Initial high sparsity drops sharply, with a brief rebound at 100k steps before stabilizing. 2. **Regularisation Strength**: Two distinct phases—initial growth, abrupt reset, then gradual increase with fluctuations. 3. **Validation Cross Entropy**: A single catastrophic spike at 100k steps, followed by recovery and stability. 4. **Dashed Line**: The ~2.4 threshold in the right graph is never consistently met, suggesting suboptimal validation performance. --- ### Interpretation - **Sparsity & Regularisation**: The sharp drop in sparsity and regularisation strength at 100k steps suggests a major model adjustment (e.g., weight pruning or hyperparameter tuning). The subsequent rise in regularisation strength may indicate rebalancing to prevent overfitting. - **Validation Cross Entropy Spike**: The 100k-step spike in validation error implies overfitting or instability during model adjustment. The recovery afterward suggests the model stabilized post-adjustment. - **Dashed Line Significance**: The ~2.4 threshold likely represents a target validation error. The model’s inability to sustainably meet this threshold indicates room for improvement in regularisation or architecture. - **Correlation**: The synchronization of spikes in sparsity, regularisation, and validation error at 100k steps implies a coordinated training intervention (e.g., learning rate change or layer freezing). The data highlights critical phases in training dynamics, emphasizing the interplay between model complexity (sparsity), regularisation, and validation performance. </details> Figure 11: The training curves for post-training OLMo-7B tacking the model sparsity (left), regularisation strength (middle), and the cross-entropy loss (right). The black dotted line on the cross-entropy plot indicates the pre-defined threshold, $\tau$ . A key feature of our post-training framework is that the strength of the sparsity regularisation is automatically controlled via a constrained optimisation scheme. By pre-specifying an accepted level for the cross-entropy target, $\tau$ , the training procedure can be written as the max-min objective: $$ \max_{\lambda>0}\min_{\theta}\bigg[\sum_{l}\mathbb{E}\big[|A_{l}|\big]+\lambda(CE-\tau)\bigg], \tag{7} $$ which can be optimised by taking alternating gradient steps in the model weight space and in the $\lambda$ space. The resulting training dynamics means that the sparsity regularisation strength increases when the model cross-entropy is lower than the target, and decreases when the model is above the threshold. Figure 11 shows the training curves for the OLMo-7B model. Here, we observe that the strength of sparsity regularisation keeps increasing slowly while the model cross-entropy is clipped at the desired level. Note that during a model spike (at around 100K steps), the sparsity regularisation automatically decreases to let the model recover. ## Appendix D Extra Experiments for Circuit Discovery In this section, we provide additional results for the activation patching circuit discovery experiment presented in the main text. Figure 12 shows the attention patterns of the heads required to explain 90% of model behaviour on a copy task. To fully test the longer context window afforded by OLMo, we use a longer prompt than the one used for GPT2 in the main text. The result is consistent with the GPT-2 experiment: sparsified model facilitates the discovery of smaller circuits of induction heads that implement the copy task. <details> <summary>x10.png Details</summary> ![8e6ba6aa](/v1/image/8e6ba6aa06a24b8ba2cb5b623a32f5c44d4f50ac4a213a270c7c3970d3f14d88) ### Visual Description ## Chart/Diagram Type: Grid Matrix Visualization ### Overview The image contains two grid matrices labeled "78" and "Sparse 78". Both grids are structured as tables with rows and columns, featuring numerical values and occasional slash-separated entries. The grids appear to represent encoded data, coordinates, or identifiers, though their exact purpose is unclear without additional context. ### Components/Axes - **Headers**: - Left grid: "78" (top-left corner). - Right grid: "Sparse 78" (top-left corner). - **Structure**: - Left grid: 20 rows × 20 columns. - Right grid: 15 rows × 15 columns. - **Cell Content**: - Numerical values (e.g., "1234", "5678", "9012"). - Slash-separated entries (e.g., "1234/5678", "9012/3456"). - Empty cells (denoted by ""). ### Detailed Analysis #### Left Grid ("78") - **Row/Column Labels**: No explicit axis labels; rows and columns are numbered implicitly (e.g., "1234", "5678"). - **Data Distribution**: - Most cells contain single numerical values. - Slash-separated entries appear sporadically (e.g., "1234/5678" in row 5, column 10). - Empty cells are distributed unevenly, with clusters in the lower half. #### Right Grid ("Sparse 78") - **Row/Column Labels**: Similar to the left grid but with fewer entries. - **Data Distribution**: - Higher density of empty cells compared to the left grid. - Slash-separated entries are less frequent. - Numerical values are more concentrated in the upper half. ### Key Observations 1. **Sparsity**: The right grid ("Sparse 78") has significantly more empty cells, suggesting a reduced data density. 2. **Formatting**: Slash-separated entries may indicate composite values (e.g., coordinates, ranges, or error codes). 3. **Numerical Patterns**: No clear numerical trends (e.g., increasing/decreasing values) are visually apparent. ### Interpretation The grids likely represent structured data sets, possibly for: - **Coordinate Systems**: Numbers could denote spatial or temporal coordinates. - **Error Codes/Statuses**: Slash-separated entries might encode multiple states (e.g., "1234/5678" = "Active/Inactive"). - **Matrix Representations**: The grids could model relationships between variables (e.g., adjacency matrices, frequency tables). The "Sparse" label implies the right grid is a simplified or filtered version of the left grid, retaining only critical data points. Without additional context (e.g., axis units, legends), the exact meaning of the numbers remains ambiguous. Further analysis would require metadata or domain-specific knowledge. </details> Figure 12: Attention patterns of the heads required to explain 90% of model behaviour on a longer copy task. Similar to the GPT-2 results in Figure 3, the sparse model requires substantially fewer attention heads. Figure 13 and 14 show the fraction of explained model preference as a function of the number of model components kept un-ablated. The difference between these plots and Figure 4 and 5 lies in the way individual model components are ranked. Here, the ranking is performed on a task level, meaning that the importance score for each component is pooled across different instances of the same task. Overall, the results are commensurate with results presented in the main paper, where the ranking strategy consistently discover smaller circuits in sparse models. The only exception is the Greater Than task for GPT-2 where the number of attention heads required for the sparse model is larger than that of the base model. We hypothesise that this is due to the sparse model choosing different circuits to implement different instances of the same task, rendering the task-level importance score less suitable for circuit discovery in this case. Finally, in Figure 15, we provide a qualitative visualisation of the edges required to completed the IOI task. <details> <summary>x11.png Details</summary> ![edbe2c68](/v1/image/edbe2c685f2f5c1280a4e91c0d5a6352e3b5d2b8d88f1297fc7b3ae2a518abe1) ### Visual Description ## Line Graphs: Explained Effect vs. Number of Heads Kept ### Overview The image contains four line graphs comparing the "Explained Effect" of different model configurations (GPT-2, Sparse GPT-2, OLMo-7B, Sparse OLMo-7B) as the number of attention heads retained increases. Each graph includes shaded confidence intervals and annotations indicating performance multipliers (e.g., "0.6x," "2.0x"). --- ### Components/Axes 1. **X-Axis**: "Number of Heads Kept" (ranges: 0–100 for first two graphs, 0–1000 for last two). 2. **Y-Axis**: "Explained Effect" (scaled 0.0–1.0). 3. **Legends**: - **Top-left graphs**: - Solid blue: GPT-2 - Dashed orange: Sparse GPT-2 - **Bottom-right graphs**: - Solid green: OLMo-7B - Dashed pink: Sparse OLMo-7B 4. **Annotations**: Multiplier labels (e.g., "0.6x") placed near plateau regions. --- ### Detailed Analysis #### 1. **Greater Than** - **Trend**: Both lines show rapid growth, plateauing near 1.0. - **Data Points**: - GPT-2 (blue): Reaches ~0.95 at 50 heads, plateaus. - Sparse GPT-2 (orange): Reaches ~0.95 at 50 heads, annotated "0.6x" (60% efficiency). - **Confidence Intervals**: Shaded bands show ±0.05 variability. #### 2. **IOI** - **Trend**: Both lines rise steeply, then plateau. - **Data Points**: - GPT-2 (blue): ~0.85 at 50 heads. - Sparse GPT-2 (orange): ~0.95 at 50 heads, annotated "2.0x" (200% efficiency). - **Confidence Intervals**: ±0.03 variability. #### 3. **Docstring** - **Trend**: Sharp rise, then plateau. - **Data Points**: - OLMo-7B (green): ~0.95 at 250 heads. - Sparse OLMo-7B (pink): ~0.95 at 250 heads, annotated "1.3x" (30% efficiency). - **Confidence Intervals**: ±0.02 variability. #### 4. **IOI Long** - **Trend**: Steep initial rise, then plateau. - **Data Points**: - OLMo-7B (green): ~0.95 at 250 heads. - Sparse OLMo-7B (pink): ~0.95 at 250 heads, annotated "1.6x" (60% efficiency). - **Confidence Intervals**: ±0.01 variability. --- ### Key Observations 1. **Sparse Models Outperform**: Sparse variants (dashed lines) achieve similar or higher "Explained Effect" with fewer parameters (e.g., "2.0x" efficiency in IOI). 2. **Plateau Consistency**: All models plateau near 1.0 "Explained Effect," suggesting diminishing returns beyond a critical number of heads. 3. **Confidence Intervals**: Narrower intervals in later graphs (e.g., IOI Long) indicate more stable measurements at scale. --- ### Interpretation The data demonstrates that **sparse model configurations** (e.g., Sparse GPT-2, Sparse OLMo-7B) achieve comparable or superior performance to their dense counterparts while using fewer parameters. For example: - Sparse GPT-2 matches GPT-2's performance at 50 heads but with 60% of the computational cost ("0.6x"). - Sparse OLMo-7B maintains 95% "Explained Effect" at 250 heads, requiring only 30% of the parameters of OLMo-7B ("1.3x"). This suggests that **parameter pruning or sparsity techniques** are highly effective for optimizing large language models, balancing efficiency and performance. The shaded confidence intervals imply that these gains are statistically robust, with minimal variability across trials. </details> Figure 13: Logit attribution per sentence keeping only the top-k attention heads based on a global ranking score. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. With the exception of the Greater Than task for GPT-2, the sparse models admits smaller circuits. <details> <summary>x12.png Details</summary> ![8e165d41](/v1/image/8e165d411e288d53625cbb48dbfd274d3b246fa3632d2007b079340dda4648d2) ### Visual Description ## Line Graphs: Model Performance vs. Edge Retention ### Overview The image contains four comparative line graphs analyzing the performance of dense vs. sparse neural network models across different tasks. Each graph plots "Explained Effect" (y-axis) against "Number of Edges Kept" (x-axis, log scale). The graphs demonstrate how model efficiency improves with edge retention while maintaining performance. ### Components/Axes 1. **X-Axis**: "Number of Edges Kept (log scale)" ranging from 10⁰ to 10⁵ 2. **Y-Axis**: "Explained Effect" ranging from 0.0 to 1.0 3. **Legends**: Positioned bottom-right in each graph, showing: - GPT-2 (blue) vs. Sparse GPT-2 (orange) - OLMo-7B (teal) vs. Sparse OLMo-7B (pink) 4. **Task Titles**: Top of each graph: - Greater Than - IOI - Docstring - IOI Long ### Detailed Analysis 1. **Greater Than** - Blue (GPT-2): Gradual curve reaching ~0.8 at 10³ edges - Orange (Sparse GPT-2): Steeper ascent, plateaus at 1.0 with 41.9x efficiency gain - Key point: Sparse model achieves full effect at 10² edges vs. 10³ for dense 2. **IOI** - Blue (GPT-2): Reaches ~0.9 at 10³ edges - Orange (Sparse GPT-2): Peaks at 1.0 with 14.9x efficiency - Crossover: Sparse model surpasses dense at ~10¹ edges 3. **Docstring** - Teal (OLMo-7B): Slow rise to ~0.8 at 10⁴ edges - Pink (Sparse OLMo-7B): Rapid ascent to 1.0 with 5.5x gain - Efficiency: Sparse model achieves 90% effect at 10² edges 4. **IOI Long** - Teal (OLMo-7B): Gradual increase to ~0.9 at 10⁵ edges - Pink (Sparse OLMo-7B): Reaches 1.0 with 3.1x efficiency - Longest retention needed for sparse model to match dense ### Key Observations 1. All sparse models achieve full effect (1.0) with significantly fewer edges 2. Efficiency multipliers range from 3.1x to 41.9x across tasks 3. "Greater Than" shows highest efficiency gain (41.9x) 4. "IOI Long" demonstrates lowest efficiency multiplier (3.1x) 5. All graphs show crossover points where sparse models outperform dense ### Interpretation The data reveals that sparse neural network architectures maintain comparable performance to dense models while using fewer computational resources. The efficiency gains (multipliers) suggest sparse models could reduce hardware requirements by 3-40x depending on task complexity. The "IOI Long" task shows diminishing returns for sparsity, possibly due to longer sequence dependencies requiring more edge retention. These findings support the hypothesis that model sparsity preserves functionality while enabling more efficient deployment, particularly for tasks with localized dependencies like "Greater Than" and "Docstring". </details> Figure 14: Logit attribution per sentence keeping only the top-k attention edges based on a global ranking score. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. <details> <summary>x13.png Details</summary> ![d26b6319](/v1/image/d26b6319ba8161f752bed4f7bb1bcfb7b861dcbd6e4805f83abc04fc6ad8e270) ### Visual Description ## Network Diagram: Sparse GPT-2 vs. GPT-2 (Baseline) ### Overview The image compares two neural network architectures side-by-side: - **Left**: "Sparse GPT-2" (sparser connectivity) - **Right**: "GPT-2 (baseline)" (denser connectivity) Both diagrams use circular nodes arranged in horizontal rows (layers) with directional connections. ### Components/Axes - **Y-Axis**: Labeled "Layers" with an arrow pointing downward, indicating hierarchical layering from input (top) to output (bottom). - **X-Axis**: Unlabeled but shows horizontal node placement. - **Nodes**: - **Sparse GPT-2**: Nodes are either filled (white) or outlined (gray), suggesting active/inactive or pruned connections. - **Baseline GPT-2**: All nodes are uniformly filled (white), indicating full connectivity. - **Connections**: - **Sparse GPT-2**: Sparse, non-overlapping lines between nodes. - **Baseline GPT-2**: Dense, overlapping lines with no visible sparsity. ### Detailed Analysis - **Layer Count**: Both diagrams show **12 layers** (rows of nodes). - **Node Distribution**: - **Sparse GPT-2**: ~30% of nodes are filled (white), ~70% are gray (inactive/pruned). - **Baseline GPT-2**: 100% of nodes are filled (white). - **Connection Density**: - **Sparse GPT-2**: ~20% of possible connections are present (visually estimated). - **Baseline GPT-2**: ~95% of possible connections are present (visually estimated). ### Key Observations 1. **Sparsity vs. Density**: The baseline model exhibits near-complete connectivity, while the sparse model retains only critical connections. 2. **Node Activation**: In the sparse model, filled nodes (white) are concentrated in specific layers (e.g., layers 3–5 and 9–11), suggesting targeted activation. 3. **Connection Patterns**: - Sparse GPT-2 connections avoid redundancy, with no overlapping lines. - Baseline GPT-2 connections form a tangled web, indicating high parameter count. ### Interpretation - **Efficiency Trade-off**: The sparse model likely reduces computational cost and memory usage by pruning non-essential connections, while the baseline retains all parameters for maximum expressiveness. - **Structural Implications**: The sparse architecture may prioritize interpretability or speed, whereas the baseline emphasizes capacity for complex pattern learning. - **Visual Anomalies**: The sparse model’s gray nodes (inactive) suggest dynamic activation patterns, potentially enabling adaptive computation. This comparison highlights the balance between model complexity and efficiency, critical for deploying large language models in resource-constrained environments. </details> Figure 15: An example of the attention-head edges required to reach 0.9 cumulative score based on the averaged scores for the IOI task. ## Appendix E Circuit Discovery Tasks In the following, we provide the details and the prompts for the various tasks used in section 4.2. ### E.1 Greater-Than Task Each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations, answers and wrong_answers. A typical entry is: { "clean": "The demonstrations lasted from the year 1363 to 13", "corrupt": "The demonstrations lasted from the year 1301 to 13", "answers": ["64", "65", ..., "99"], "wrong_answers": ["00", "01", ..., "63"] } For the clean prompt, any token in answers yields an end year strictly greater than the start year (e.g. "1364" – "1399"), whereas tokens in wrong_answers correspond to years that are less than or equal to the start year. The corrupt prompt changes only the starting year, shifting which continuations correspond to valid end years. We use the logit difference between the aggregated probability mass on answers vs. wrong_answers in clean vs. corrupt contexts as our signal, in the spirit of prior mechanistic studies on simple algorithmic tasks (Elhage et al., 2021; Nanda et al., 2023). ### E.2 Indirect Object Identification (IOI) Task Our IOI setup follows the standard indirect object identification paradigm for mechanistic interpretability (Elhage et al., 2021; Conmy et al., 2023). Each example is generated by combining: - a pair of names $(A,B)$ , e.g. (" Mary", " John"); - a natural-language template with placeholders [A], [B], and [S]. We instantiate templates such as: "Then, [B] and [A] went to the park. [S] gave a ball to" "When [B] and [A] got a snack at the cafe, [S] decided to give it to" "After the lunch, [B] and [A] went to the mall. [S] gave a gift to" by sampling a name pair and substituting $[A]$ and $[B]$ , then choosing the subject $[S]$ (either one of the pair). The correct continuation is the indirect object, i.e. the other member of the pair. For example, with $(A,B)=(\texttt{" John"},\texttt{" Mary"})$ and $S=B$ , one instance is: Then, Mary and John went to the park. Mary gave a ball to The correct continuation is " John", while " Mary" and any distractor names are treated as incorrect candidates. In the OLMo experiments, in order to further test the capability of our approach, we use a different set of IOI task with increased complexity and prompt length. Example templates include: "After several months without any contact due to conflicting schedules and unexpected personal obligations, [B] and [A] finally met again at the park, where they spent a long afternoon catching up on past events, sharing stories, and reflecting on how much had changed. As the day came to an end, [S] gave a ball to" "Although [B] and [A] had previously been involved in a long and emotionally charged argument that left several issues unresolved, they agreed to meet in order to clarify their misunderstandings. After a tense but honest conversation, [S] said to" ### E.3 Docstring Task We also test the OLMo models on a more complex Docstring task (Heimersheim and Janiak, 2023; Conmy et al., 2023), where the model needs to attend to a specific argument for a specified function in order to complete a Docstring. Similarly to the Greater Than task, each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations. A typical entry is: { "clean": "def model(self, results, old, option): """ stage agency security vision spot tone joy session river unit :param results: bone paper selection sky :param old: host action hell miss :param", "corrupt": "def model(self, command, output, state): """ stage agency security vision spot tone joy session river unit :param old: bone paper selection sky :param results: host action hell miss param", "answers": [" option"], "wrong_answers": [" results"," old"] } ## Appendix F Cross-Layer-Transcoder | Category Model Input dimension ( $d_{\text{in}}$ ) | Setting GPT-2 (HookedTransformer) 768 | | --- | --- | | Latent dimension ( $d_{\text{latent}}$ ) | 24 576 | | Expansion factor | 32 | | Context size | 64 | | Batch size (tokens) | 1 024 | | Precision | Mixed (FP32 / AMP) | | Device | CUDA | | Distributed training | DDP | | Optimizer | Adam | | Learning rate | $2\times 10^{-4}$ | | Adam $\beta_{1}$ / $\beta_{2}$ | 0.9 / 0.999 | | Learning rate warm-up | Cosine (1 000 steps) | | Learning rate decay steps | 1 874 | | Final LR scale | 0.1 | | $L_{0}$ coefficient | 2 | | Optimal $L_{0}$ | 3 | | $L_{0}$ warm-up | Linear (18 749 steps) | | Dead feature penalty | $10^{-5}$ | | Dead feature window | 250 | Table 3: Training configuration for the GPT-2 cross-layer-transcoders. To implement a cross-layer transcoder, let $\mathbf{h}_{\ell}\in\mathbb{R}^{d_{\text{model}}}$ denote the input to the MLP at layer $\ell$ for a single token position. This representation is projected into a sparse feature space via an encoder, $$ \mathbf{z}_{\ell}=\mathrm{ReLU}\!\left(\mathbf{W}_{\mathrm{enc}}^{\ell}\mathbf{h}_{\ell}+\mathbf{b}_{\mathrm{enc}}^{\ell}\right)\in\mathbb{R}^{d_{\text{features}}}, \tag{8} $$ where $\mathbf{W}_{\mathrm{enc}}^{\ell}\in\mathbb{R}^{d_{\text{features}}\times d_{\text{model}}}$ and $\mathbf{b}_{\mathrm{enc}}^{\ell}\in\mathbb{R}^{d_{\text{features}}}$ are layer-specific encoder parameters. The CLT reconstructs the MLP output at a target layer $\ell^{\prime}$ by linearly aggregating feature activations originating from all preceding layers, $$ \hat{\mathbf{m}}_{\ell^{\prime}}=\sum_{\ell\leq\ell^{\prime}}\mathbf{W}_{\mathrm{dec}}^{\ell\rightarrow\ell^{\prime}}\mathbf{z}_{\ell}+\mathbf{b}_{\mathrm{dec}}^{\ell^{\prime}}, \tag{9} $$ where $\mathbf{W}_{\mathrm{dec}}^{\ell\rightarrow\ell^{\prime}}\in\mathbb{R}^{d_{\text{model}}\times d_{\text{features}}}$ denotes the decoder mapping from layer $\ell$ to layer $\ell^{\prime}$ . The summation over layers reflects the fact that a given semantic feature may manifest in different representations across multiple MLP layers. For example, a feature that emerges in the MLP at layer $\ell$ may reappear, potentially in a transformed form, in the outputs of subsequent MLPs. Without accounting for these layer-dependent variations, such duplicated representations would lead to redundant nodes in the attribution graph. By allowing features to be represented differently across layers while being linked through a shared latent space, the cross-layer transcoder avoids this duplication and yields a more compact and interpretable attribution structure. For a detailed comparison between cross-layer transcoders and standard transcoders, we refer the reader to Lindsey et al. (2025a). Following the training procedure proposed by Anthropic (Ameisen et al., 2025), the final objective combines reconstruction accuracy with sparsity and dead-feature regularization: $$ \displaystyle\mathcal{L}= \displaystyle\underbrace{\sum_{\ell^{\prime}}\left\|\hat{\mathbf{m}}_{\ell^{\prime}}-\mathbf{m}_{\ell^{\prime}}\right\|_{2}^{2}}_{\text{MSE reconstruction}} \displaystyle+\lambda_{0}\underbrace{\sum_{\ell}\tanh\!\big(C\,(\mathbf{z}_{\ell}\odot\|\mathbf{W}_{\mathrm{dec}}^{\ell}\|)\big)}_{\text{$L_{0}$ sparsity}} \displaystyle+\lambda_{\mathrm{df}}\underbrace{\sum_{\ell}\mathrm{ReLU}\!\Big(\exp(\tau)-\mathbf{h}_{\ell}^{\mathrm{pre}}\Big)\|\mathbf{W}_{\mathrm{dec}}^{\ell}\|}_{\text{dead-feature penalty}}, \tag{10} $$ where $\mathbf{W}_{\mathrm{dec}}^{\ell}$ denotes the concatenated decoder weights associated with layer $\ell$ , $\mathbf{h}_{\ell}^{\mathrm{pre}}$ are the corresponding pre-activation values, $\tau$ is a threshold parameter, and $C$ is a scaling constant. The hyperparameters $\lambda_{0}$ and $\lambda_{\mathrm{df}}$ control the strength of the sparsity and dead-feature regularization terms. We initialize the weights with following circuits updates (Conerly et al., 2025). The encoder biais is initialize to have a fixed proportion of the features active at initialization. We provide in Figure 16 the training curves of the sparsity value, the sparsity coefficient, the explained variance, and the amount of dead features. We hope this can help the community in training their own cross-layer transcoders. <details> <summary>figures/l0_vs_steps.png Details</summary> ![c5b97222](/v1/image/c5b9722267c3c2fa3d0cc8a90dea27b1104eb0fe76103b62252c0a3185b69de6) ### Visual Description ## Line Chart: L0 over Training Steps ### Overview The chart depicts the logarithmic scale of loss (L0) decreasing over training steps (measured in millions). The y-axis uses a logarithmic scale (10¹ to 10⁴), while the x-axis spans 0 to 200 million training steps. A single blue line represents the loss trajectory, showing a steep initial decline followed by a gradual, near-linear decrease. ### Components/Axes - **Title**: "L0 over Training Steps" (centered at the top). - **X-axis**: "Training steps (M)" with markers at 0, 25, 50, 75, 100, 125, 150, 175, and 200 million steps. - **Y-axis**: "L0" with logarithmic markers at 10¹, 10², 10³, and 10⁴. - **Legend**: Located in the top-left corner, labeling the blue line as "L0." - **Line**: A single blue line (solid, no markers) representing loss values. ### Detailed Analysis - **Initial Drop**: At 0 training steps, L0 starts near 10⁴. By 25 million steps, it drops to ~10³. - **Mid-Training**: Between 50 and 100 million steps, L0 decreases from ~10² to ~10¹. - **Late Training**: From 100 to 200 million steps, L0 declines from ~10¹ to ~10⁰ (near 1). - **Trend**: The line is smooth, with no plateaus or spikes. The logarithmic scale emphasizes exponential decay in early training, transitioning to a near-linear decline later. ### Key Observations 1. **Rapid Initial Improvement**: Loss decreases by ~90% (from 10⁴ to 10³) in the first 25 million steps. 2. **Gradual Convergence**: After 50 million steps, the rate of loss reduction slows, suggesting diminishing returns. 3. **Final Value**: At 200 million steps, L0 approaches ~1, indicating near-optimal performance. ### Interpretation The chart demonstrates typical machine learning convergence behavior. The steep early decline reflects the model learning basic patterns, while the later gradual decrease suggests fine-tuning of complex relationships. The logarithmic scale highlights the efficiency of early training phases. No anomalies are observed, implying stable training dynamics. This data could inform decisions about training duration or resource allocation for similar models. </details> (a) $L_{0}$ vs steps <details> <summary>figures/l0_coef_vs_steps.png Details</summary> ![5bb28569](/v1/image/5bb2856997658b761a8e6e6fa9982d928ed32060f7d528c73a0885d9b852989f) ### Visual Description ## Line Graph: L0 Coefficient over Training Steps ### Overview The image depicts a line graph illustrating the relationship between the L0 Coefficient and training steps (measured in millions). The graph shows a linear increase in the L0 Coefficient until approximately 150 million training steps, after which it plateaus at a constant value. ### Components/Axes - **Title**: "L0 Coefficient over Training Steps" (centered at the top). - **X-axis**: Labeled "Training steps (M)" with increments of 25 million (0, 25, 50, ..., 200). The axis spans from 0 to 200 million. - **Y-axis**: Labeled "L0 Coefficient" with increments of 0.25 (0.00, 0.25, 0.50, ..., 2.00). The axis spans from 0 to 2.00. - **Legend**: No legend is present in the image. - **Line**: A single blue line represents the L0 Coefficient trend. It starts at the origin (0, 0) and increases linearly until ~150 million steps, then plateaus at 2.00. ### Detailed Analysis - **Data Points**: - At 0 million steps: L0 Coefficient = 0.00. - At ~150 million steps: L0 Coefficient = 2.00. - From 150 million to 200 million steps: L0 Coefficient remains constant at 2.00. - **Trend**: The line exhibits a linear increase (slope ≈ 0.0133 per million steps) until 150 million steps, followed by a horizontal plateau. ### Key Observations 1. **Linear Growth Phase**: The L0 Coefficient increases steadily from 0 to 2.00 as training progresses. 2. **Plateau Phase**: After ~150 million steps, the coefficient stabilizes at 2.00, indicating no further change despite additional training. 3. **No Noise/Variability**: The line is perfectly straight, suggesting no experimental or computational noise in the data. ### Interpretation The graph demonstrates that the L0 Coefficient grows linearly with training steps until a critical threshold (~150 million steps), after which it ceases to change. This behavior could imply: - **Convergence**: The model or system being trained reaches a stable state where further training does not alter the L0 Coefficient. - **Saturation Effect**: The coefficient may represent a parameter (e.g., learning rate, regularization strength) that becomes fixed once optimal performance is achieved. - **Training Efficiency**: The plateau suggests diminishing returns beyond 150 million steps, highlighting the importance of monitoring such metrics to avoid unnecessary computational costs. The absence of variability or noise in the data raises questions about the experimental setup (e.g., controlled conditions, idealized model). In real-world scenarios, such a perfectly linear relationship might be rare, warranting further investigation into data collection or model assumptions. </details> (b) $L_{0}$ coefficient vs steps <details> <summary>figures/dead_features_vs_steps.png Details</summary> ![46bc7f16](/v1/image/46bc7f16b8c5984a22069a6ee797ecf6164c0b4c2ff04f83cb6d0d6c7158d4cd) ### Visual Description ## Line Graph: Dead Features over Training Steps ### Overview The graph illustrates the relationship between training steps (in millions) and the number of dead features in a model. The blue line shows a steady increase in dead features as training progresses, with minor fluctuations. The y-axis ranges from 0 to 3500, while the x-axis spans 0 to 200 million training steps. ### Components/Axes - **Title**: "Dead Features over Training Steps" (top-center). - **X-axis**: "Training steps (M)" with increments at 0, 25, 50, 75, 100, 125, 150, 175, and 200 million. - **Y-axis**: "Dead Features" with increments at 0, 500, 1000, 1500, 2000, 2500, 3000, and 3500. - **Legend**: Located in the top-right corner, labeled "Dead Features" with a blue line. - **Line**: A single blue line representing dead features, starting at (0, 0) and ending near (200M, 3750). ### Detailed Analysis - **Initial Phase (0–50M steps)**: The line rises gradually from 0 to ~1000 dead features. At 25M steps, it reaches ~500; at 50M steps, ~1000. - **Mid-Phase (50–100M steps)**: Accelerated growth occurs. At 75M steps, ~2000 dead features; at 100M steps, ~2500. - **Late Phase (100–200M steps)**: The line plateaus briefly (~2500–3000) between 100M–150M steps, then rises sharply. At 175M steps, ~3200; at 200M steps, ~3750. ### Key Observations 1. **Steady Increase**: Dead features consistently rise with training steps, indicating a potential degradation or overfitting trend. 2. **Plateau**: A temporary stabilization (~2500–3000) occurs between 100M–150M steps, suggesting possible model stabilization or data saturation. 3. **Final Surge**: A sharp increase after 150M steps, exceeding 3500 dead features by 200M steps, highlighting escalating instability. ### Interpretation The data suggests that as training progresses, the model accumulates more dead features, which could impair performance. The initial gradual rise may reflect early-stage learning, while the mid-phase acceleration could indicate overfitting or vanishing gradients. The plateau might represent a balance between learning and feature death, but the final surge implies critical instability, possibly due to excessive training or insufficient regularization. This trend underscores the need for monitoring dead features during training to optimize model robustness. </details> (c) Dead features vs steps <details> <summary>figures/explained_variance_vs_steps.png Details</summary> ![a4950665](/v1/image/a495066543572c5d866bab71e3ff9b8f26539473ec94ee8562812ea1c13277e3) ### Visual Description ## Line Graph: Explained Variance over Training Steps ### Overview The graph depicts the relationship between training steps (in millions) and explained variance, showing a rapid initial increase, a plateau phase, and a gradual decline. The y-axis ranges from 0.5 to 1.0, while the x-axis spans 0 to 200 million training steps. ### Components/Axes - **X-axis (Training steps (M))**: Labeled with increments of 25 million (0, 25, 50, ..., 200). - **Y-axis (Explained Variance)**: Labeled with increments of 0.1 (0.5, 0.6, ..., 1.0). - **Legend**: Not visible in the image. - **Line**: Single blue line representing explained variance over training steps. ### Detailed Analysis - **Initial Rise (0–50M steps)**: - At 0 steps: Explained variance ≈ 0.5. - By 25M steps: Rapidly increases to ≈ 0.88. - By 50M steps: Peaks at ≈ 0.90. - **Plateau Phase (50–75M steps)**: - Remains stable at ≈ 0.90 between 50M and 75M steps. - **Gradual Decline (75M–200M steps)**: - At 100M steps: ≈ 0.87. - At 125M steps: ≈ 0.85. - At 150M steps: ≈ 0.83. - At 175M steps: ≈ 0.81. - At 200M steps: ≈ 0.78. ### Key Observations 1. **Rapid Initial Improvement**: Explained variance increases sharply in the first 50M steps, suggesting strong early learning. 2. **Plateau**: Performance stabilizes between 50M and 75M steps, indicating diminishing returns. 3. **Decline**: A consistent downward trend after 75M steps, with a total drop of ~0.12 (from 0.90 to 0.78). 4. **No Outliers**: The line is smooth, with no abrupt fluctuations. ### Interpretation The graph suggests that the model’s ability to explain variance improves significantly during early training but plateaus and eventually degrades over time. This could indicate: - **Overfitting**: The model may have memorized training data, reducing generalization. - **Diminishing Returns**: Further training beyond 75M steps yields minimal gains. - **Capacity Limits**: The model may have reached its maximum explanatory power, necessitating architectural changes or more data. The absence of a legend implies a single data series, and the blue line’s consistent trend aligns with the observed patterns. The decline after 75M steps warrants further investigation into training dynamics or dataset characteristics. </details> (d) Explained variance vs steps Figure 16: Training dynamics of the cross-layer transcoder, showing sparsity, regularization strength, dead features, and reconstruction quality over training. ## Appendix G Attribution-Graph Following Ameisen et al. (2025), we define the attribution score between feature $n$ at layer $\ell$ and position $k$ , and feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ , as $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}=\sum_{\ell\leq s\leq\ell^{\prime}}f_{k,n}^{\ell\rightarrow s}\;J_{s,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}, \tag{11} $$ where $f_{k,n}^{\ell\rightarrow s}$ denotes the decoder vector associated with feature $n$ projecting from layer $\ell$ to layer $s$ , $J_{s,k}^{\ell^{\prime},k^{\prime}}$ is the Jacobian mapping the MLP output at $(\ell,k)$ to the MLP input at $(\ell^{\prime},k^{\prime})$ , and $g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}$ is the corresponding encoder feature at layer $\ell^{\prime}$ and position $k^{\prime}$ . The sum in this equation reflects the cross-layer mapping of the cross-layer transcoder. The Jacobian is computed during a modified forward pass in which all nonlinear operations, including normalization layers, attention mechanisms, and MLPs, are frozen using stop-gradient operations. The resulting attribution graph is pruned by retaining only those features that cumulatively explain $80\$ of the contribution to the final logit, and only those edges that account for $95\$ of the total edge-level effect. All attribution computations are performed using the circuit-tracer library (Hanna et al., 2025). For a complete description of the attribution graph computation and pruning, we refer the user to reading (Ameisen et al., 2025). For the visualization and the autointerp, we write our own pipeline. In Figure 17, we show a screenshot of the interface for the ’The opposite of ”large” is ”’ attribution graph. The features are colored with respect to their corresponding clusters. <details> <summary>x14.png Details</summary> ![ac4825d6](/v1/image/ac4825d678343b6ab2d70b00b224077233fe01f17ee8b61094566f68c1f375ab) ### Visual Description ## Network Diagram: Hierarchical Clustering of Attributes ### Overview The image depicts a hierarchical network diagram with nodes organized across 12 levels (L0-L11) on the y-axis and categorical labels on the x-axis. Nodes are color-coded to represent clusters ("opposite," "large," "brackets," "say small"), with edges connecting nodes across levels. A single pink node at L11 is labeled "small" with a 16.32% annotation. ### Components/Axes - **Y-Axis (Levels)**: Labeled L0 (bottom) to L11 (top), representing hierarchical layers. - **X-Axis (Categories)**: Labels include "nooffset()", "The", "opposite", "of", "-", "large", "say", "small", and "small". - **Legend**: - Green: "opposite" - Orange: "large" - Blue: "brackets" - Pink: "say small" - **Buttons**: "Add Text" (orange), "Save State" (green), "Load State" (blue) at the top-left. ### Detailed Analysis 1. **Node Distribution**: - **Green ("opposite")**: Clustered at L0-L2 (bottom-left), with 15 nodes. - **Orange ("large")**: Located at L3-L7 (middle), with 10 nodes. - **Blue ("brackets")**: Positioned at L8-L11 (right), with 12 nodes. - **Pink ("say small")**: Single node at L11 (top-right), annotated "small" (16.32%). 2. **Edge Patterns**: - Edges connect nodes across levels, forming dense clusters within categories and sparse connections between categories. - The pink node at L11 has the highest density of incoming edges (16.32% annotation suggests significance). 3. **Textual Labels**: - X-axis labels include programming-like terms ("nooffset()", "The") and adjectives ("opposite," "large," "small"). - The pink node’s label "small" is repeated on the x-axis, possibly indicating a recursive or hierarchical relationship. ### Key Observations - **Hierarchical Clustering**: Nodes are grouped by attribute (color) and level (y-axis), suggesting a structured hierarchy. - **Dominant Node**: The pink "small" node at L11 has the highest percentage (16.32%), indicating it may be a critical or terminal node. - **Categorical Separation**: Clear spatial separation of clusters (green at bottom, orange in middle, blue on right, pink at top). ### Interpretation The diagram likely represents a network where nodes are categorized by attributes (e.g., "opposite," "large") and organized into hierarchical levels. The pink "small" node at L11, with its high percentage, may represent a pivotal or aggregated outcome of the network’s structure. The x-axis labels ("nooffset()", "The") suggest computational or linguistic processing, possibly modeling relationships between terms or variables. The dense connections within categories imply strong intra-cluster relationships, while sparse inter-cluster edges highlight modularity. The "say small" label at the top could denote a final state or summary metric derived from lower-level interactions. </details> Figure 17: Circuit-tracing interface example for the ’The opposite of ”large” is ”’ with GPT2-sparse. ## Appendix H Graph: The opposite of ”large” is ” We obtain a replacement score of 0.82, with 459 features identified before pruning and 82 features remaining after pruning. The majority of features in the resulting attribution graph fall into four dominant clusters: - Opposition cluster: features associated with opposition and comparison, primarily localized at the token position corresponding to opposite. - Magnitude cluster: features related to notions of size (e.g., large, big, full, medium), predominantly located in the residual stream at the large token position. - Bracket cluster: features that activate on tokens enclosed in brackets. - Final-logit cluster: mainly the final logit itself and a couple of features that activate before the token ”small” or related terms. In the boxes below, we present the top activations of representative feature sets for each cluster. Feature 1117 ”Opposite” cluster in Washington has now adopted the wider measure of student debt outstanding. This new the situation in Syria, Iran and the wider region. ”The recharged by the wider dense forests of Sanjay Van and its overflow drained public, with interesting accounts of Oswald’s demeanor at this significant moment has a slightly wider range. Specifically, the Atom-powered NANO 56 becoming part of the wider Seven Years’ War in which Britain and France Feature 1337 ”Opposite” cluster opposite, piece of Mexico’s cultural identity. I made the hour opposite shows, or something bigger, “where there’s villains opposite sides of Mars in 2004 and used their instruments to discover geologic evidence opposite, but not anymore. Now everything he says to me is some kind opposite direction, and had little trouble finding space at the campsites. always seem to be just the opposite. show a growing trend to cast “no” votes , opposing how much salary and the occupation of the opposing forces was generally limited to mutual observation. work hand in hand for the purpose of opposing all movements of the thinking part the defense’s inability to stop opposing run games. The Bills have ing opposing quarterbacks. The Seahawks not only had depth, they were versatile. to win more hand battles particularly when the opposing tackle neutralizes his initial Feature 901 ”Large” cluster Let’s be honest: When someone advocates for large-scale Muslim robot provides a tragicomic reminder of why RWD needs to consider large as what kind of social safety nets should be in place to protect people from large advocates to limit the power of large, established corporations, analysts say. of large law firms is that they are so great that the only reason anyone that by scaling up tests, the method would be conducive for use on larger Feature 933 ”Large” cluster people healthy and anticipating health issues before they become a problem . Big Data is Big brown bucks with funny accents.” Judy flinched at BIG UP UBUNTU: Ubuntu releases are named after industry.¡—endoftext—¿ BIG LEAGUE: Barron’s Says The they need to submit their content in the same way . Big enough apps and offering alternatives routes . Big data and optical fiber Feature 1004 ”Large” cluster guide said was ? full of drinking saloons, dime museums, small would have 2 mana sources next turn (unless his hand was full of fast ’s house, it ’s full of adventure itself.? statement that all German Catholics had a right to full transparency” glimpsing a lobby full of construction debris. The front hallway was full of Jokubas had recently been reading a newspaper article which was full of Feature 412 “Brackets” cluster group answered either “very” or “somewhat” attached – except some work colleagues. Wilcox said she found it ‘highly unlikely very rare , “very likely ,” “high risk,” she says. ulent. Pentagon spokesman Peter Cook said the sample was “ on of PopMatters called the album “brilliant” and said arlene Lowe, described him as being “one of my biggest supporters”. Feature 518 “Brackets” cluster Kerry said Washington and Hanoi will “continue to have differences in opinions the United States will “take care of it.” He told reporters after the legislation would “provide new enforcement tools for protecting our citizens and will help Gary Ross, said in a statement that the Air Force is currently “short said, and Syrian President Bashar al-Assad would “have to go”. introduces politics into consumer policies,” said Palmor, adding that it would ”

Rendering Paper...