2512.05865v2

Model: gemini-2.0-flash

# Sparse Attention Post-Training for Mechanistic Interpretability **Authors**: Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf Abstract We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $≈ 0.4\%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100× fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models. Machine Learning, ICML 1 Introduction Scaling has driven major advances in artificial intelligence, with ever-larger models trained on internet-scale datasets achieving remarkable capabilities across domains. Large language models (LLMs) now underpin applications from text generation to question answering, yet their increasing complexity renders their internal mechanisms largely opaque (Bommasani, 2021). Methods of mechanistic interpretability have been developed to address this gap by reverse-engineering neural networks to uncover how internal components implement specific computations and behaviors. Recent advances in this area have successfully identified interpretable circuits, features, and algorithms within LLMs (Nanda et al., 2023; Olsson et al., 2022), showing that large complex models can, in part, be understood mechanistically, opening avenues for improving transparency, reliability, and alignment (Bereska and Gavves, 2024). <details> <summary>x1.png Details</summary> ![5060aaf9](/v1/image/5060aaf927f061cd9cd714c5864a0afff70d0547a65e59ce580606887b0c3031) ### Visual Description ## Neural Network Diagram: Base Model vs. Sparse Model ### Overview The image presents two diagrams illustrating the architecture and connectivity of a "Base Model" and a "Sparse Model" neural network. Both models are depicted as solving the arithmetic problem "36 + 28". The Base Model shows dense connections between layers, while the Sparse Model shows significantly fewer connections, achieved through sparsity-regularized fine-tuning. An arrow indicates the transformation from the Base Model to the Sparse Model. ### Components/Axes * **Titles:** "Base Model" (top), "Sparse Model" (bottom) * **Layers:** Both models have layers labeled "Layer 0", "Layer 1", "Layer 2", and "Layer 3". These labels are positioned vertically to the left of each model's network diagram. * **Nodes:** Each layer consists of a series of nodes, represented as small circles. * **Connections:** The connections between nodes in adjacent layers are represented by blue lines. The Base Model has many connections, while the Sparse Model has very few. * **Input/Output:** Both models show the input "36 + 28 =" at the bottom. The Base Model shows the output "?????", while the Sparse Model shows the output "?????". * **Solution:** Above the "Layer 3" layer, both models show the solution "36 + 28 = 0 0 0 6 4". An arrow indicates the direction of the solution. * **Transformation Arrow:** A curved arrow on the left side of the image indicates the transformation from the Base Model to the Sparse Model, labeled "Sparsity-Regularised Finetuning". ### Detailed Analysis **Base Model:** * **Input:** "36 + 28 = ?????". * **Layer 0:** Seven nodes. * **Layer 1:** Seven nodes. * **Layer 2:** Seven nodes. * **Layer 3:** Seven nodes. * **Output:** "36 + 28 = 0 0 0 6 4". * **Connections:** Dense connections between all nodes in adjacent layers. Each node in a layer is connected to almost every node in the next layer. **Sparse Model:** * **Input:** "36 + 28 = ?????". * **Layer 0:** Seven nodes. * **Layer 1:** Seven nodes. * **Layer 2:** Seven nodes. * **Layer 3:** Seven nodes. * **Output:** "36 + 28 = 0 0 0 6 4". * **Connections:** Sparse connections between nodes. Only a few nodes in each layer are connected to nodes in the next layer. The connections are concentrated in the lower layers (Layer 0 and Layer 1). **Transformation:** * The arrow labeled "Sparsity-Regularised Finetuning" indicates that the Sparse Model is derived from the Base Model through a process of sparsity regularization and fine-tuning. ### Key Observations * The Base Model has a fully connected architecture, while the Sparse Model has a highly sparse architecture. * The sparsity in the Sparse Model is achieved through sparsity-regularized fine-tuning. * Both models are designed to solve the same arithmetic problem. * The Sparse Model retains the ability to solve the problem despite having significantly fewer connections. ### Interpretation The diagrams illustrate the concept of sparsity in neural networks. Sparsity regularization is a technique used to reduce the number of connections in a neural network, which can lead to several benefits, including: * **Reduced computational cost:** Fewer connections mean fewer parameters to train and fewer operations to perform during inference. * **Improved generalization:** Sparse models are less likely to overfit the training data, which can lead to better performance on unseen data. * **Increased interpretability:** Sparse models are often easier to interpret because the connections that remain are more likely to be important. The image demonstrates that it is possible to create a sparse model that performs as well as a dense model, while also enjoying the benefits of sparsity. The "Sparsity-Regularised Finetuning" process is crucial for achieving this result. The fact that both models arrive at the same solution "0 0 0 6 4" suggests that the sparse model has successfully retained the essential information needed to solve the problem. </details> Figure 1: Visualised attention patterns for a 4-layer toy model trained on a simple 2-digit addition task. The main idea of this work is to induce sparse attention between tokens via a post-training procedure that optimizes for attention sparsity while maintaining model performance. In this example, while both models are able to correctly predict the sum, the sparse model solves the problem with a naturally interpretable circuit. Details of this toy setup and more examples are provided in Appendix A However, interpretability is bottlenecked by the model itself: even with sophisticated reverse-engineering techniques that can faithfully reveal internal algorithms, the underlying computations implemented by large models can still remain highly complex and uninterpretable. Circuits for seemingly simple tasks may span hundreds of interacting attention heads and MLPs with densely intertwined contributions across layers (Conmy et al., 2023), and features can influence each other along combinatorially many attention-mediated paths, complicating attention attribution (Kamath et al., 2025). To exemplify this, Figure 1 (top) illustrates the attention patterns of a small, single-head transformer trained on a simple two-digit addition task. Here, the model has learned to solve the task in a highly diffused manner, where information about each token is dispersed across all token locations, rendering the interpretation of the underlying algorithm extremely difficult even in this simple case. The crux of the problem is that models are not incentivised to employ simple algorithms during training. In this work, we advocate for directly embedding interpretability constraints into model design in a way that induces simple circuits while preserving performance. We focus our analysis on attention mechanisms and investigate sparsity regularisation on attention patterns, originally proposed in (Lei et al., 2025), as an inductive bias. To demonstrate how sparse attention patterns can give rise to interpretable circuits, we return to the two-digit addition example: Figure 1 (bottom) shows the attention patterns induced by penalising attention edges during training. Here, the sparsity inductive bias forces the model to solve the problem with much smaller, intrinsically interpretable computation circuits. In this work, we investigate using this sparsity regularisation scheme as a post-training strategy for pre-trained LLMs. We propose a practical method for fine-tuning existing models without re-running pretraining, offering a flexible way to induce sparse attention patterns and enhance interpretability. We show, on models of up to 7B parameters, that our proposed procedure preserves the performance of the base models on pretraining data while reducing the effective attention map to less than $0.5\%$ of its edges. To evaluate our central hypothesis that sparse attention facilitates interpretability, we consider two complementary settings. First, we study circuit discovery, where the objective is to identify the minimal set of components responsible for task performance (Conmy et al., 2023). We find that sparsified models yield substantially simpler computational graphs: the resulting circuits explain model behaviour using up to four times fewer attention heads and up to two orders of magnitude fewer edges. Second, using cross-layer transcoders (Ameisen et al., 2025), we analyse attribution graphs, which capture feature-level interactions across layers. In this setting, sparse attention mitigates the attention attribution problem by making it possible to identify which attention heads give rise to a given edge, owing to the reduced number of components mediating each connection. We argue that this clarity enables a tighter integration of feature-based and circuit-based perspectives, allowing feature interactions to be understood through explicit, tractable circuits. Taken together, these results position attention sparsity as an effective and practical inductive tool for surfacing the minimal functional backbone underlying model behaviour. 2 Related Work 2.1 Sparse Attention As self-attention is a key component of the ubiquitous Transformer architecture, a large number of variants of attention mechanisms have been explored in the literature. Related to our approach are sparse attention methods, which are primarily designed to alleviate the quadratic scaling of vanilla self-attention. These methods typically rely on masks based on fixed local and strided patterns (Child et al., 2019) or sliding-window and global attention patterns (Beltagy et al., 2020; Zaheer et al., 2020) to constrain the receptive field of each token. While these approaches are successful in reducing the computational complexity of self-attention, they require hand-defined heuristics that do not reflect the internal computations learned by the model. Beyond these fixed-pattern sparse attention methods, Top- $k$ attention, which enforces sparsity by dynamically selecting the $k$ most relevant keys per query based on their attention scores, has also been explored (Gupta et al., 2021; DeepSeek-AI, 2025). While Top- $k$ attention enables learnable sparse attention, the necessity to specify $k$ limits its scope for interpretability for two reasons. First, selecting the optimal $k$ is difficult, and setting $k$ too low can degrade model performance. Second, and more fundamentally, Top-k attention does not allow the model to choose different $k$ for different attention heads based on the context. We argue that this flexibility is crucial for maintaining model performance. More recently, gated attention mechanisms (Qiu et al., 2025) provide a scalable and performant framework for inducing sparse attention. In particular, Lei et al. (2025) introduce a sparsity regularisation scheme for world modelling that reveals sparse token dependencies. We adopt this method and examine its role as an inductive bias for interpretability. 2.2 Circuit Discovery Mechanistic interpretability seeks to uncover how internal components of LLMs implement specific computations. Ablation studies assess performance drops from removing components (Nanda et al., 2023), activation patching measures the effect of substituting activations (Zhang and Nanda, 2023), and attribution patching scales this approach via local linearisation (Syed et al., 2024). Together, these approaches allow researchers to isolate sub-circuits, minimal sets of attention heads and MLPs that are causally responsible for a given behavior or task (Conmy et al., 2023). Attention itself plays a dual role: it both routes information and exposes interpretable relational structure, making it a key substrate for mechanistic study. Our work builds on this foundation by leveraging sparsity to simplify these circuits, amplifying the interpretability of attention-mediated computation while preserving model performance. 2.3 Attribution Graph Mechanistic interpretability has gradually shifted from an emphasis on explicit circuit discovery towards the analysis of internal representations and features. Recent work on attribution graphs and circuit tracing seeks to reunify these perspectives by approximating MLP outputs as sparse linear combinations of features and computing causal effects along linear paths between them (Dunefsky et al., 2024; Ameisen et al., 2025; Lindsey et al., 2025b). This framework enables the construction of feature-level circuits spanning the computation from input embeddings to final token predictions. Within attribution graphs, edges correspond to direct linear causal relationships between features. However, these relationships are mediated by attention heads that transmit information across token positions. Identifying which attention heads give rise to a particular edge, and understanding why they do so, is essential, as this mechanism forms a fundamental component of the computational graph (Kamath et al., 2025). A key limitation of current attribution-based approaches is that individual causal edges are modulated by dozens of attention components. We show that this leads to feature-to-feature influences that are overly complex, rendering explanations in terms of other features in the graph both computationally expensive and conceptually challenging. 3 Method Our main hypothesis is that post-training existing LLMs to encourage sparse attention patterns leads to the emergence of more interpretable circuits. In order to instantiate this idea, we require a post-training pipeline that satisfies three main desiderata: 1. To induce sparse message passing between tokens, we need an attention mechanism that can ‘zero-out’ attention edges, which in turn enables effective $L_{0}$ -regularisation on the attention weights. This is in contrast to the standard softmax attention mechanism, where naive regularisation would result in small but non-zero attention weights that still allow information flow between tokens. 1. The model architecture needs to be compatible with the original LLM such that the pre-trained LLM weights can be directly loaded at initialisation. 1. The post-training procedure needs to ensure that the post-trained models do not lose prediction performance compared to their fully-connected counterparts. To this end, we leverage the Sparse Transformer architecture in the SPARTAN framework proposed in (Lei et al., 2025), which uses sparsity-regularised hard attention instead of the standard softmax attention. In the following subsections, we describe the Sparse Transformer architecture and the optimisation setup, highlighting how this approach satisfies the above desiderata. 3.1 Sparse Attention Layer Given a set of token embeddings, the Sparse Transformer layer computes the key, query, and value embeddings, $\{k_{i},q_{i},v_{i}\}$ , via linear projections, analogous to the standard Transformer. Based on the embeddings, we sample a binary gating matrix from a learnable distribution parameterised by the keys and queries, $$ A_{ij}\sim\mathrm{Bern}(\sigma(q_{i}^{T}k_{j})), \tag{1} $$ where $\mathrm{Bern}(·)$ is the Bernoulli distribution and $\sigma(·)$ is the logistic sigmoid function. This sampling step can be made differentiable via the Gumbel Softmax trick (Jang et al., 2017). This binary matrix acts as a mask that controls the information flow across tokens. Next, the message passing step is carried out in the same way as standard softmax attention, with the exception that we mask out the value embeddings using the sampled binary mask, $$ \mathrm{SparseAttn}(Q,K,V)=\bigg[A\odot\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})\bigg]V, \tag{2} $$ where $d_{k}$ is the dimension of the key embeddings and $\odot$ denotes element-wise multiplication. During training, we regularise the expected number of edges between tokens based on the distribution over the gating matrix. Concretely, the expected number of edges for each layer can be calculated as $$ \mathbb{E}\big[|A|\big]=\sum_{i,j}\sigma(q^{T}_{i}k_{j}). \tag{3} $$ Note that during the forward pass, each entry of $A$ is a hard binary sample that zeros out attention edges, which serves as an effective $L_{0}$ regularisation. Moreover, since the functional form of the sparse attention layer after the hard sampling step is the same as standard softmax attention, pre-trained model weights can be directly used without alterations. Technically, the sampled $A$ affects the computation. This can be mitigated by adding a positive bias term inside the sigmoid function to ensure all gates are open at initialisation. Experimentally, we found this to be unnecessary as the models quickly recover their original performance within a small number of gradient steps. 3.2 Constrained Optimisation In order to ensure that the models do not lose prediction performance during the post-training procedure, as per desideratum 3, we follow the approach proposed in (Lei et al., 2025), which employs the GECO algorithm (Rezende and Viola, 2018). Originally developed in the context of regularising VAEs, the GECO algorithm places a constraint on the performance of the model and uses a Lagrangian multiplier to automatically find the right strength of regularisation during training. Concretely, we formulate the learning process as the following optimisation problem, $$ \min_{\theta}\sum_{l}\mathbb{E}\big[|A_{l}|\big]\qquad s.t.\quad CE\leq\tau, \tag{4} $$ where $A_{l}$ denotes the gating matrix at layer $l$ , $CE$ is the standard next token prediction cross-entropy loss, and $\tau$ is the required target loss, and $\theta$ is the model parameters. In practice, we set this target as the loss of the pre-trained baseline models. We solve this optimisation problem via Lagrangian relaxation, yielding the following max-min objective, $$ \max_{\lambda>0}\min_{\theta}\bigg[\sum_{l}\mathbb{E}\big[|A_{l}|\big]+\lambda(CE-\tau)\bigg]. \tag{5} $$ This can be solved by taking gradient steps on $\theta$ and $\lambda$ alternately. During training, updating $\lambda$ automatically balances the strength of the sparsity regularisation: when $CE$ is lower than the threshold, $\lambda$ decreases, and hence more weight is given to the sparsity regularisation term. This effectively acts as an adaptive schedule which continues to increase the strength of the regularisation until the model performance degrades. Here, the value of $\tau$ is selected as a hyperparameter to ensure that the sparse model’s performance remains within a certain tolerance of the original base model. In practice, the choice of $\tau$ controls a trade off between sparsity and performance: picking a tight $\tau$ can lead to a slower training process, whereas a higher tolerance can substantially speed up training at the cost of potentially harming model performance. In Appendix C, we provide further discussion on this optimisation process and its training dynamics. 3.3 Practical Considerations One of the main strengths of our proposed method is that, architecturally, the only difference between a sparse Transformer and a normal one lies in how the dot-product attention is computed. As such, most practical training techniques for optimising Transformers can be readily adapted to our setting. In our experiments, we find the following techniques helpful for improving computational efficiency and training stability. LoRA finetuning (Hu et al., 2022). Low rank finetuning techniques can significantly reduce the computational requirements for training large models. In our experiments, we verify on a 7B parameter model that LoRA finetuning is sufficiently expressive for inducing sparse attention patterns. FlashAttention (Dao, 2023) FlashAttention has become a standard method for reducing the memory footprint of dot-product attention mechanisms. In Appendix B, we discuss how the sampled sparse attention can be implemented in an analogous manner. Distillation (Gu et al., 2024). Empirically, we find that adding an auxiliary distillation loss based on the KL divergence between the base model and the sparse model improves training stability and ensures that the behaviour of the model remains unchanged during post-training. <details> <summary>x2.png Details</summary> ![17563bba](/v1/image/17563bba6ca3f44d7b0df41fc745347f7aec4f33d61f5f570db5f1ccbe0b1f60) ### Visual Description ## Bar Chart: Benchmark Comparison ### Overview The image is a bar chart comparing the accuracy of two models, OLMo-7B and Sparse OLMo-7B, across four different benchmarks: TruthfulQA, PIQA, OpenBookQA, and ARC-Easy. The y-axis represents accuracy, ranging from 0.0 to 1.0. The x-axis represents the benchmark categories. ### Components/Axes * **Title:** Benchmark Comparison * **Y-axis:** * Label: Accuracy * Scale: 0.0 to 1.0, with increments of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) * **X-axis:** * Categories: TruthfulQA, PIQA, OpenBookQA, ARC-Easy * **Legend:** Located in the top-right corner. * OLMo-7B (Green) * Sparse OLMo-7B (Pink) ### Detailed Analysis Here's a breakdown of the accuracy for each model on each benchmark: * **TruthfulQA:** * OLMo-7B (Green): Approximately 0.25 * Sparse OLMo-7B (Pink): Approximately 0.24 * **PIQA:** * OLMo-7B (Green): Approximately 0.80 * Sparse OLMo-7B (Pink): Approximately 0.79 * **OpenBookQA:** * OLMo-7B (Green): Approximately 0.37 * Sparse OLMo-7B (Pink): Approximately 0.35 * **ARC-Easy:** * OLMo-7B (Green): Approximately 0.60 * Sparse OLMo-7B (Pink): Approximately 0.57 ### Key Observations * OLMo-7B consistently performs slightly better than Sparse OLMo-7B across all benchmarks. * Both models achieve the highest accuracy on the PIQA benchmark. * Both models perform the worst on the TruthfulQA benchmark. * The difference in accuracy between OLMo-7B and Sparse OLMo-7B is relatively small for all benchmarks. ### Interpretation The bar chart provides a direct comparison of the performance of OLMo-7B and its sparse variant on different question-answering benchmarks. The data suggests that while sparsity might offer benefits in terms of model size or computational efficiency, it comes at a slight cost in accuracy. The PIQA benchmark appears to be the easiest for both models, while TruthfulQA poses the greatest challenge. The relatively small differences in accuracy between the two models suggest that the sparse version retains most of the performance of the original model. </details> Figure 2: Comparison of model performance between the base OLMo model and the sparsified model evaluated on the various benchmarks. Across all tasks, the performance of the sparse model remains comparable with the base model despite using substantially fewer attention edges. 4 Experiments To evaluate the effectiveness of our post-training pipeline, we finetune pre-trained LLMs and compare their prediction performance and interpretability before and after applying sparsity regularisation. We perform full finetuning on a GPT-2 base model (Radford et al., 2019) (124M parameters) on the OpenWebText dataset (Gokaslan and Cohen, 2019). To investigate the generality and scalability of our method, we perform LoRA finetuning on the larger OLMo-7B model (Groeneveld et al., 2024) on the Dolma dataset (Soldaini et al., 2024), which is the dataset on which the base model was trained. The GPT-2 model and the OLMo model are trained on sequences of length 64 and 512, respectively. In the following subsections, we first present a quantitative evaluation of model performance and sparsity after sparse post-training. We then conduct two interpretability studies, using activation patching and attribution graphs, to demonstrate that our method enables the discovery of substantially smaller circuits. 4.1 Model Performance and Sparsity We begin by evaluating both performance retention and the degree of sparsity achieved by post-training. We set cross-entropy targets of 3.50 for GPT-2 (base model: 3.48) and 2.29 for OLMo (base model: 2.24). After training, the mean cross-entropy loss for both models remains within $± 0.01$ of the target, indicating that the dual optimisation scheme effectively enforces a tight performance constraint. To quantify the sparsity achieved by the models, we evaluate them on the validation split of their respective datasets and compute the mean number of non-zero attention edges per attention head. We find that the sparsified GPT-2 model activates, on average, only 0.22% of its attention edges, while the sparsified OLMo model activates 0.44%, indicating substantial sparsification in both cases. Table 1 provides a summary of the results. To further verify that this drastic reduction in message passing between tokens does not substantially alter model behaviour, we evaluate the sparsified OLMo model on a subset of the benchmarks used to assess the original model. As shown in Figure 2, the sparse model largely retains the performance of the base model across a diverse set of tasks. In sum, our results demonstrate that sparse post-training is effective in consolidating information flow into a small number of edges while maintaining a commensurate level of performance. | GPT-2 | 3.48 | 3.50 | 3.501 | 0.22% | | --- | --- | --- | --- | --- | | OLMo | 2.24 | 2.29 | 2.287 | 0.44% | Table 1: Performance and sparsity of post-trained models. Final cross-entropy losses closely match the specified targets, while attention sparsity is substantially increased. 4.2 Circuit Discovery with Activation Patching <details> <summary>x3.png Details</summary> ![3727160a](/v1/image/3727160a74afcdf974a3385b4bde32543288832fe9c21675177f51a2561b133c) ### Visual Description ## Chart Type: Heatmap Grid of Attention Patterns ### Overview The image presents a grid of heatmaps, visualizing attention patterns within two different GPT2 models: a standard GPT2 model and a "Sparse GPT2" model. Each heatmap represents the attention pattern of a specific layer and head within the model. The heatmaps use color intensity to indicate the strength of attention between different parts of the input sequence. The standard GPT2 model's attention patterns are displayed in a 6x6 grid, while the Sparse GPT2 model's attention patterns are displayed in a single row of 8 heatmaps. ### Components/Axes * **Titles:** * Top Section: "GPT2" * Bottom Section: "Sparse GPT2" * **Heatmap Labels:** Each heatmap is labeled with a string in the format "L[Layer Number]H[Head Number]", where: * Layer Number ranges from 0 to 11. * Head Number ranges from 0 to 11. * **Color Scale:** The heatmaps use a color gradient, presumably from light to dark blue, to represent the strength of attention. Lighter shades indicate weaker attention, while darker shades indicate stronger attention. * **Axes:** Each heatmap has an x and y axis, representing the input sequence positions. ### Detailed Analysis **GPT2 (Top Section):** The heatmaps are arranged in a 6x6 grid. The labels for each heatmap are as follows: * **Row 1:** * Column 1: L0H0 * Column 2: L0H2 * Column 3: L5H5 * Column 4: L0H7 * Column 5: L0H1 * Column 6: L0H10 * **Row 2:** * Column 1: L7H6 * Column 2: L3H10 * Column 3: L4H8 * Column 4: L2H11 * Column 5: L5H8 * Column 6: L6H7 * **Row 3:** * Column 1: L3H0 * Column 2: L5H9 * Column 3: L7H1 * Column 4: L2H10 * Column 5: L7H2 * Column 6: L8H10 * **Row 4:** * Column 1: L6H6 * Column 2: L7H11 * Column 3: L2H9 * Column 4: L1H4 * Column 5: L6H11 * Column 6: L3H8 * **Row 5:** * Column 1: L9H5 * Column 2: L6H9 * Column 3: L4H9 * Column 4: L1H2 * Column 5: L11H10 * Column 6: L4H3 * **Row 6:** * Column 1: L3H11 * Column 2: L6H8 * Column 3: L4H7 * Column 4: L0H6 * Column 5: L3H1 * Column 6: L5H7 **Sparse GPT2 (Bottom Section):** The heatmaps are arranged in a single row. The labels for each heatmap are as follows: * Column 1: L0H5 * Column 2: L5H1 * Column 3: L4H11 * Column 4: L6H8 * Column 5: L5H5 * Column 6: L1H0 * Column 7: L6H9 * Column 8: L3H4 ### Key Observations * **Diagonal Dominance:** Many of the heatmaps, especially in the standard GPT2 model, show a strong diagonal pattern, indicating that each position in the input sequence attends strongly to itself. * **Vertical/Horizontal Lines:** Some heatmaps show vertical or horizontal lines, suggesting that certain positions attend to all other positions or that all positions attend to a specific position. * **Sparsity in Sparse GPT2:** The Sparse GPT2 model exhibits significantly sparser attention patterns compared to the standard GPT2 model. This is evident from the presence of fewer off-diagonal elements with high attention scores. * **Layer and Head Variation:** The attention patterns vary significantly across different layers and heads, indicating that different parts of the model learn different types of relationships between input positions. ### Interpretation The heatmaps visualize the internal workings of the GPT2 models, showing how different parts of the model attend to different parts of the input sequence. The diagonal dominance suggests that self-attention is a crucial mechanism in these models. The variations in attention patterns across layers and heads indicate that the model learns a hierarchy of relationships, with some layers focusing on local dependencies and others focusing on long-range dependencies. The sparsity in the Sparse GPT2 model suggests that it learns to focus on a smaller subset of relationships, potentially leading to more efficient computation or different generalization properties. The image provides insights into the attention mechanisms of GPT2 models and how sparsity can affect these mechanisms. </details> Figure 3: Attention patterns of the heads required to explain 90% of model behaviour on a copy task. The sparse model requires substantially fewer attention heads. Moreover, the selected heads exhibit the characteristic ‘induction head’ pattern: each token attends to a previous token at a fixed relative offset, effectively copying information forward through the sequence, a pattern well known to implement the copy mechanism in transformer models. Equivalent plots for OLMo can be found in Appendix D. <details> <summary>x4.png Details</summary> ![51102b68](/v1/image/51102b68e2751989b4e960992098ac360f98c1fba57ec9e3d9bdbe4235f7694b) ### Visual Description ## Chart: Explained Effect vs. Number of Heads Kept for Different Models and Tasks ### Overview The image presents four line charts comparing the "Explained Effect" against the "Number of Heads Kept" for different models (GPT-2 and OLMo-7B) and tasks (Greater Than, IOI, Docstring, and IOI Long). Each chart compares the performance of a standard model against a sparse version of the same model. The charts show how the explained effect changes as more heads are kept in the model. ### Components/Axes * **X-axis (Horizontal):** "Number of Heads Kept". The scale varies across the charts. * Greater Than and IOI: 0 to 100 * Docstring and IOI Long: 0 to 1000 * **Y-axis (Vertical):** "Explained Effect". The scale is consistent across all charts, ranging from 0.0 to 1.0. * **Legends:** * **Greater Than and IOI:** * Blue line: "GPT-2" * Tan line: "Sparse GPT-2" * **Docstring and IOI Long:** * Green line: "OLMo-7B" * Pink line: "Sparse OLMo-7B" * **Titles:** * Top-left: "Greater Than" * Top-middle-left: "IOI" * Top-middle-right: "Docstring" * Top-right: "IOI Long" * **Annotations:** Each chart has a dashed horizontal line indicating the explained effect of the sparse model, along with a label indicating the multiplicative factor by which the standard model exceeds the sparse model's performance at that point. ### Detailed Analysis **1. Greater Than** * **GPT-2 (Blue):** The explained effect increases rapidly from 0 to approximately 0.9 between 0 and 50 heads kept, then plateaus around 0.95 as the number of heads kept increases to 100. * **Sparse GPT-2 (Tan):** The explained effect increases rapidly from 0 to approximately 0.95 between 0 and 20 heads kept, then plateaus around 0.95 as the number of heads kept increases to 100. * **Annotation:** "4.5x" is shown above a dashed line at approximately y=0.9, indicating that the GPT-2 model's explained effect is 4.5 times greater than the Sparse GPT-2 model's effect at the point where the Sparse GPT-2 model plateaus. **2. IOI** * **GPT-2 (Blue):** The explained effect increases from 0 to approximately 0.9 between 0 and 75 heads kept, then plateaus around 0.9 as the number of heads kept increases to 100. * **Sparse GPT-2 (Tan):** The explained effect increases rapidly from 0 to approximately 0.9 between 0 and 25 heads kept, then plateaus around 0.95 as the number of heads kept increases to 100. * **Annotation:** "2.2x" is shown above a dashed line at approximately y=0.9, indicating that the GPT-2 model's explained effect is 2.2 times greater than the Sparse GPT-2 model's effect at the point where the Sparse GPT-2 model plateaus. **3. Docstring** * **OLMo-7B (Green):** The explained effect increases from 0 to approximately 0.9 between 0 and 250 heads kept, then fluctuates between 0.8 and 1.0 as the number of heads kept increases to 1000. * **Sparse OLMo-7B (Pink):** The explained effect increases from 0 to approximately 0.9 between 0 and 250 heads kept, then plateaus around 0.95 as the number of heads kept increases to 1000. * **Annotation:** "2.2x" is shown above a dashed line at approximately y=0.9, indicating that the OLMo-7B model's explained effect is 2.2 times greater than the Sparse OLMo-7B model's effect at the point where the Sparse OLMo-7B model plateaus. **4. IOI Long** * **OLMo-7B (Green):** The explained effect remains near 0 until approximately 400 heads are kept, then increases rapidly to approximately 0.95 between 400 and 600 heads kept, then plateaus around 1.0 as the number of heads kept increases to 1000. * **Sparse OLMo-7B (Pink):** The explained effect remains near 0 until approximately 250 heads are kept, then increases rapidly to approximately 0.95 between 250 and 500 heads kept, then plateaus around 1.0 as the number of heads kept increases to 1000. * **Annotation:** "1.4x" is shown above a dashed line at approximately y=0.9, indicating that the OLMo-7B model's explained effect is 1.4 times greater than the Sparse OLMo-7B model's effect at the point where the Sparse OLMo-7B model plateaus. ### Key Observations * In all four tasks, the sparse models reach a similar or slightly higher plateau in "Explained Effect" compared to their non-sparse counterparts. * The "Greater Than" and "IOI" tasks show a more pronounced difference in the number of heads required to reach the plateau between the standard and sparse GPT-2 models. * The "IOI Long" task shows a significant shift in the number of heads required for the OLMo-7B model to achieve a substantial explained effect compared to its sparse version. * The annotations highlight the multiplicative factor by which the standard model's explained effect exceeds that of the sparse model at the point where the sparse model plateaus. ### Interpretation The charts suggest that sparse models can achieve comparable or even slightly better performance (in terms of "Explained Effect") compared to their non-sparse counterparts, but often require fewer heads to reach a similar level of performance. This implies that sparsity can lead to more efficient models, potentially reducing computational costs without sacrificing performance. The multiplicative factors indicated by the annotations quantify the performance gap between the standard and sparse models at a specific point, providing a measure of the benefit gained by using the standard model. The differences in the number of heads required to reach the plateau between standard and sparse models may indicate that sparsity is more effective for certain tasks or model architectures. The "IOI Long" task, in particular, highlights the potential for sparsity to significantly shift the point at which the model begins to exhibit a substantial explained effect. </details> Figure 4: Logit attribution keeping only the top- $k$ attention heads. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. Sparse models yields 1.4 $×$ to 4.5 $×$ smaller circuits. Shaded areas show standard error across 20 prompts. <details> <summary>x5.png Details</summary> ![f70e7c31](/v1/image/f70e7c314a7e7a55fd2f75f87104344680cf3a3742697c609acd987a915a8924) ### Visual Description ## Chart: Explained Effect vs. Number of Edges Kept for Different Models and Tasks ### Overview The image presents four line charts comparing the "Explained Effect" as a function of the "Number of Edges Kept" for different models (GPT-2 and OLMo-7B) and tasks ("Greater Than", "IOI", "Docstring", and "IOI Long"). Each chart compares a standard model with its sparse counterpart. The x-axis (Number of Edges Kept) is on a logarithmic scale. The charts show how well the models perform as the number of edges is reduced. ### Components/Axes * **Titles (Top of each chart):** * Chart 1: "Greater Than" * Chart 2: "IOI" * Chart 3: "Docstring" * Chart 4: "IOI Long" * **Y-axis (Shared):** * Label: "Explained Effect" * Scale: 0.0 to 1.0, with tick marks at 0.0 and 0.5. * **X-axis (Shared):** * Label: "Number of Edges Kept" * Scale: Logarithmic, ranging from 10^0 to 10^4 (Charts 1 & 2) and 10^0 to 10^5 (Charts 3 & 4). * **Legends (Bottom-left of each chart):** * Chart 1 & 2: * Blue line: "GPT-2" * Orange line: "Sparse GPT-2" * Chart 3 & 4: * Green line: "OLMo-7B" * Pink line: "Sparse OLMo-7B" * **Annotations:** Each chart has an annotation indicating the "x" factor, representing the ratio of edges kept between the dense and sparse models at the point where the explained effect plateaus. ### Detailed Analysis **Chart 1: Greater Than** * **GPT-2 (Blue):** The explained effect increases slowly from 0 to 1 as the number of edges kept increases from 10^0 to approximately 10^3. * At 10^0 edges, the explained effect is approximately 0.1. * At 10^1 edges, the explained effect is approximately 0.2. * At 10^2 edges, the explained effect is approximately 0.6. * At 10^3 edges, the explained effect is approximately 0.95. * At 10^4 edges, the explained effect is approximately 1.0. * **Sparse GPT-2 (Orange):** The explained effect increases rapidly from approximately 0.2 to 1 as the number of edges kept increases from 10^0 to approximately 10^2. * At 10^0 edges, the explained effect is approximately 0.2. * At 10^1 edges, the explained effect is approximately 0.9. * At 10^2 edges, the explained effect is approximately 1.0. * **Annotation:** "97.0x" is displayed above a dashed horizontal line at an explained effect of approximately 0.92. **Chart 2: IOI** * **GPT-2 (Blue):** The explained effect increases slowly from approximately 0.1 to 1 as the number of edges kept increases from 10^0 to approximately 10^3. * At 10^0 edges, the explained effect is approximately 0.1. * At 10^1 edges, the explained effect is approximately 0.3. * At 10^2 edges, the explained effect is approximately 0.8. * At 10^3 edges, the explained effect is approximately 0.95. * At 10^4 edges, the explained effect is approximately 1.0. * **Sparse GPT-2 (Orange):** The explained effect increases rapidly from approximately 0.3 to 1 as the number of edges kept increases from 10^0 to approximately 10^2. * At 10^0 edges, the explained effect is approximately 0.3. * At 10^1 edges, the explained effect is approximately 0.8. * At 10^2 edges, the explained effect is approximately 1.0. * **Annotation:** "42.8x" is displayed above a dashed horizontal line at an explained effect of approximately 0.92. **Chart 3: Docstring** * **OLMo-7B (Green):** The explained effect increases slowly from approximately 0 to 1 as the number of edges kept increases from 10^0 to approximately 10^5. * At 10^0 edges, the explained effect is approximately 0.0. * At 10^1 edges, the explained effect is approximately 0.02. * At 10^2 edges, the explained effect is approximately 0.05. * At 10^3 edges, the explained effect is approximately 0.15. * At 10^4 edges, the explained effect is approximately 0.6. * At 10^5 edges, the explained effect is approximately 0.95. * **Sparse OLMo-7B (Pink):** The explained effect increases rapidly from approximately 0 to 1 as the number of edges kept increases from 10^0 to approximately 10^4. * At 10^0 edges, the explained effect is approximately 0.0. * At 10^1 edges, the explained effect is approximately 0.05. * At 10^2 edges, the explained effect is approximately 0.1. * At 10^3 edges, the explained effect is approximately 0.3. * At 10^4 edges, the explained effect is approximately 0.9. * At 10^5 edges, the explained effect is approximately 1.0. * **Annotation:** "8.6x" is displayed above a dashed horizontal line at an explained effect of approximately 0.92. **Chart 4: IOI Long** * **OLMo-7B (Green):** The explained effect increases slowly from approximately 0 to 1 as the number of edges kept increases from 10^0 to approximately 10^5. * At 10^0 edges, the explained effect is approximately 0.0. * At 10^1 edges, the explained effect is approximately 0.05. * At 10^2 edges, the explained effect is approximately 0.1. * At 10^3 edges, the explained effect is approximately 0.2. * At 10^4 edges, the explained effect is approximately 0.6. * At 10^5 edges, the explained effect is approximately 0.95. * **Sparse OLMo-7B (Pink):** The explained effect increases rapidly from approximately 0 to 1 as the number of edges kept increases from 10^0 to approximately 10^4. * At 10^0 edges, the explained effect is approximately 0.0. * At 10^1 edges, the explained effect is approximately 0.1. * At 10^2 edges, the explained effect is approximately 0.2. * At 10^3 edges, the explained effect is approximately 0.4. * At 10^4 edges, the explained effect is approximately 0.8. * At 10^5 edges, the explained effect is approximately 1.0. * **Annotation:** "5.4x" is displayed above a dashed horizontal line at an explained effect of approximately 0.92. ### Key Observations * In all four charts, the sparse models (Sparse GPT-2, Sparse OLMo-7B) achieve a similar "Explained Effect" to their dense counterparts (GPT-2, OLMo-7B) with significantly fewer edges. * The "x" factor annotations indicate the ratio of edges required by the dense model compared to the sparse model to achieve a similar level of "Explained Effect". * The "Greater Than" task shows the largest difference between the dense and sparse models (97.0x), while "IOI Long" shows the smallest difference (5.4x). ### Interpretation The charts demonstrate that sparse models can achieve comparable performance to dense models while using significantly fewer parameters (edges). This suggests that many connections in dense models are redundant and can be pruned without significantly impacting performance. The "x" factor annotations quantify the degree of redundancy for each task. The "Greater Than" task appears to be the most amenable to sparsification, while "IOI Long" is the least. This could be due to the inherent complexity of each task and the specific architecture of the models. The data suggests that sparse models are a promising approach for reducing the computational cost and memory footprint of large language models. </details> Figure 5: Logit attribution per sentence keeping only the top- $k$ attention edges. Sparse models yields 5.4 $×$ to 97 $×$ smaller circuits. Shaded area shows standard error across 20 prompts. We begin by outlining the experimental procedure used for circuit discovery. Activation patching (Nanda et al., 2023) is a widely used technique for identifying task-specific circuits in transformer models. In a typical setup, the model is evaluated on pairs of prompts: a clean prompt, for which the model predicts a correct target token, and a corrupted prompt that shares the overall structure of the clean prompt but is modified to induce an incorrect prediction. Here, the goal is to find the set of model components that is responsible for the model’s preference for the correct answer over the wrong one, as measured by the logit difference between the corresponding tokens. In activation patching, individual model components, such as attention heads and individual edges, can be ’switched-off’ by patching activation at the specific positions. Circuit discovery amounts to finding a set of components whose replacement causes the model’s prediction to shift from the correct to the corrupted answer. Since searching over every possible subset of model components is infeasible due to the exponential number of potential subsets, we adopt a common heuristic to rank each model component. Specifically, for each individual component, we compute an importance score by replacing the activations of the component with the corrupted activations and measuring its effect on the logit difference. In our experiments, we use this ranking to select the top- $k$ components and intervene on the model by freezing all remaining components, with the goal of identifying the minimal set that accounts for at least 90% of the model’s preference for the correct prediction. Note that these importance scores can be computed at two levels: (i) a single-sentence level, using a single pair of correct and corrupted inputs, and (ii) a global level, obtained by averaging scores across many task variants. In our experiments, we report the results using single-sentence scores. In Appendix D, we also provide results using the global scores, which are largely consistent with our main results. There are also two standard approaches for freezing component activations: setting the activation to zero or replacing it with a mean activation value (Conmy et al., 2023). We evaluate both variants for each model and report results for the patching strategy that yields the smallest circuits. We first focus on the copy task with the following prompt: "AJEFCKLMOPQRSTVWZS, AJEFCKLMOPQRSTVWZ", where the model has to copy the letter S to the next token position. This task is well studied and is widely believed to be implemented by emergent induction heads (Elhage et al., 2021), which propagate token information forward in the sequence. Figure 3 illustrates the attention patterns of the set of attention heads that explains this prompt for the sparse and base GPT-2 models. See Appendix D for analogous results for the OLMo models. The sparse model admits a substantially smaller set of attention heads (9 heads) than its fully connected counterpart (61 heads). Moreover, the identified heads in the sparse model exhibit cleaner induction head patterns, with each token attending to a single prior position at a fixed relative offset. These results illustrate how sparsification facilitates interpretability under simple ranking-based methods and support our hypothesis that sparse post-training yields models that are more amenable to mechanistic interpretability techniques. To further verify our hypothesis, we repeat the experiment on classical circuit discovery tasks. For GPT-2, we evaluate variants of the Indirect Object Identification (IOI) task, in which the model copies a person’s name from the start of a sentence, and the Greater Than task, in which the model predicts a number that is larger than a previously mentioned number. To further assess the scalability of our approach, we investigate more challenging and longer horizon tasks for OLMo, including a longer context IOI task and a Docstring task where the model needs to predict an argument name in a Docstring based on an implemented function. Details of each task can be found in Appendix E. Figure 4 and 5 show the fraction of model behaviour explained as a function of the number of retained model components (attention heads and attention edges, respectively). Across all tasks and models, the sparse models consistently produce significantly smaller circuits, as measured by the number of model components needed to explain 90% of model prediction. This further corroborates our claim that sparse models lead to simpler and more interpretable internal circuits. 4.3 Attribution-graph Next, we present a more fine-grained, feature-level investigation of whether sparsity in attention leads to interpretable circuits in practice using cross-layer transcoders (CLTs). Since training CLTs on OLMo-7B is computationally prohibitive The largest open-source CLT is on Gemma-2B at the time of writing., we focus our analysis on the GPT-2 models. For the rest of the section, we perform analysis on CLTs trained on the sparse and base GPT-2 models, trained with an expansion factor of $32$ and achieve above $80\%$ replacement score measured with Circuit Tracer (Hanna et al., 2025). See Appendix F and G for details on training and visualisation. We study the problem of attention attribution, which seeks to understand how edges between features are mediated. The key challenge here is that any given edge can be affected by a large number of model components, making mediation circuits difficult to analyse both computationally and conceptually: computationally, exhaustive enumeration is costly; conceptually, the resulting circuits are often large and uninterpretable. In this experiment, we demonstrate that sparse attention patterns induced via post-training substantially alleviate these challenges, as the vast majority of attention components have zero effect on the computation. As in (Ameisen et al., 2025), we define the total attribution score between feature $n$ at layer $\ell$ and position $k$ , and feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ as $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}=f_{k,n}^{\ell}\;J_{\ell,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}. \tag{6} $$ Here, $f_{k,n}^{\ell}$ denotes the decoder vector corresponding to feature $n$ at layer $\ell$ and position $k$ , and $g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}$ is the corresponding encoder vector for feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ . The term $J_{\ell,k}^{\ell^{\prime},k^{\prime}}$ is the Jacobian from the MLP output at $(\ell,k)$ to the MLP input at $(\ell^{\prime},k^{\prime})$ . This Jacobian is computed during a forward pass in which all nonlinearities are frozen using stop-gradient operations. Under this linearisation, the attribution score represents the sum over all linear paths from the source feature to the target feature. To analyse how this total effect between two features is mediated by each model component, we define the component-specific attribution by subtracting the contribution of all paths that do not pass through the component: $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}(h)=f_{k,n}^{\ell}\;J_{\ell,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}-f_{k,n}^{\ell}\;\bigl[J_{\ell,k}^{\ell^{\prime},k^{\prime}}\bigr]_{h}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}. $$ Here, $\bigl[J_{\ell,k}^{\ell^{\prime},k^{\prime}}\bigr]_{h}$ denotes a modified Jacobian computed under the same linearization as above, but with the specific attention component $h$ additionaly frozen via stop-gradient. As such, these component-specific scores quantifies how much each model component impacts a particular edge between features. Empicially, we evaluate the method on ten pruned attribution graphs, computed on the IOI, greater-than, completion, and category tasks. Similar to our previous circuit discovery experiment, we compute attribution scores on the level of attention heads as well as individual key–query pairs. In practice, attention sparsity yields substantial computational savings: because inactive key–query pairs are known a priori to have exactly zero attribution score, attribution need only be computed for a small subset of components. This reduces the computation time per attribution graph from several hours to several minutes. <details> <summary>x6.png Details</summary> ![fe1f3fb7](/v1/image/fe1f3fb7ea080407f6be227b211f80966dbd0410fa97414d2f29acef21168a05) ### Visual Description ## Chart Type: Cumulative Mass Distribution Plots ### Overview The image presents two cumulative distribution plots, labeled "Edges" and "Heads," comparing "Non Sparse" and "Sparse" data. The plots show the mean cumulative mass against the sorted index. The x-axis for "Edges" is on a logarithmic scale, while the x-axis for "Heads" is linear. ### Components/Axes **Edges Plot:** * **Title:** Edges * **Y-axis:** Mean Cumulative Mass, ranging from 0.50 to 1.00 in increments of 0.25. * **X-axis:** Sorted Index (log scale), ranging from 100 to 103. * **Data Series:** * Non Sparse (Blue): Starts at approximately 0.45 and increases to 1.00. * Sparse (Orange): Starts at approximately 0.50 and increases to 1.00. * **Annotation:** "16.1x" with a dashed line indicating the difference in x-axis values where the curves reach a certain cumulative mass. * **Legend:** Located in the bottom-right of the combined image. **Heads Plot:** * **Title:** Heads * **Y-axis:** Mean Cumulative Mass, ranging from 0.50 to 1.00 in increments of 0.25. * **X-axis:** Sorted Index, ranging from 25 to 125 in increments of 25. * **Data Series:** * Non Sparse (Blue): Starts at approximately 0.1 and increases to 1.00. * Sparse (Orange): Starts at approximately 0.4 and increases to 1.00. * **Annotation:** "3.4x" with a dashed line indicating the difference in x-axis values where the curves reach a certain cumulative mass. * **Legend:** Located in the bottom-right of the combined image. **Legend:** * Located in the bottom-right of the combined image. * Non Sparse: Blue line * Sparse: Orange line ### Detailed Analysis **Edges Plot:** * **Non Sparse (Blue):** The line starts at approximately 0.45 at x=100 (1), increases rapidly until approximately x=101 (10), and then gradually approaches 1.00. * **Sparse (Orange):** The line starts at approximately 0.50 at x=100 (1), increases rapidly until approximately x=101.5 (31.6), and then approaches 1.00. * **16.1x Annotation:** The dashed line spans from approximately x=6.2 for the Sparse line to approximately x=100 for the Non-Sparse line, indicating a 16.1x difference in the sorted index at a certain cumulative mass (approximately 0.9). **Heads Plot:** * **Non Sparse (Blue):** The line starts at approximately 0.1 at x=0, increases rapidly until approximately x=25, and then gradually approaches 1.00. * **Sparse (Orange):** The line starts at approximately 0.4 at x=0, increases rapidly until approximately x=7, and then approaches 1.00. * **3.4x Annotation:** The dashed line spans from approximately x=2 for the Sparse line to approximately x=7 for the Non-Sparse line, indicating a 3.4x difference in the sorted index at a certain cumulative mass (approximately 0.9). ### Key Observations * In both plots, the "Sparse" data reaches a cumulative mass of 1.00 faster than the "Non Sparse" data. * The "Edges" plot uses a logarithmic scale for the x-axis, indicating a wider range of sorted index values compared to the "Heads" plot. * The "Edges" plot shows a 16.1x difference in sorted index values between the "Sparse" and "Non Sparse" data at a certain cumulative mass, while the "Heads" plot shows a 3.4x difference. ### Interpretation The plots compare the cumulative mass distribution of "Sparse" and "Non Sparse" data for "Edges" and "Heads." The "Sparse" data achieves a higher cumulative mass at lower sorted index values, suggesting that the mass is concentrated in a smaller number of elements compared to the "Non Sparse" data. The annotations "16.1x" and "3.4x" quantify this difference, indicating how much larger the sorted index needs to be for the "Non Sparse" data to reach a similar cumulative mass as the "Sparse" data. The logarithmic scale in the "Edges" plot suggests that the differences in sorted index are more pronounced for "Edges" than for "Heads." </details> Figure 6: Mean cumulative distribution of the component scores that mediate an attribution graph edge. The components are on the left key-query pairs within a head, and on the right full attention heads. In terms of circuit size, Figure 6 shows the mean cumulative distribution of component attribution scores for each edge in the attribution graph. We find that, to reach a cumulative attribution threshold of $90\%$ , the sparse model on average requires $16.1×$ fewer key–query pairs and $3.4×$ fewer attention heads when compared to the dense GPT-2 model, supporting our hypothesis that sparse attention patterns leads to simpler mediation circuits. <details> <summary>x7.png Details</summary> ![ae21d501](/v1/image/ae21d50113add26a9fde2e735cef280dd94bb7fa36639e1e0e6ba0f36db00625) ### Visual Description ## Diagram: GPT-2 and Sparse GPT-2 Attention Patterns ### Overview The image presents a comparison of attention patterns in GPT-2 and Sparse GPT-2 models. It visualizes how different heads in various layers attend to specific positions in the input sequence. The left side shows a grid-like representation of attention patterns in GPT-2, while the right side focuses on specific attention heads in Sparse GPT-2, highlighting how they map key position 5 to query position 8. The diagram also indicates which layers modulate the attention at 80% for specific words. ### Components/Axes * **Left Side:** * Label: "GPT-2" at the bottom. * Grid: Represents attention patterns across different layers and heads. Blue indicates attention strength, with darker shades indicating stronger attention. Red squares indicate specific attention points. * **Top-Right:** * Label: "All heads map key pos 5 to query pos 8" * Grids: Five small grids representing attention patterns for specific layer-head combinations (L11-H7, L10-H1, L9-H7, L9-H1, L8-H6). * Axes: Labeled "K" (Key) horizontally and "Q" (Query) vertically with arrows indicating direction. * **Bottom-Right:** * Label: "Sparse GPT-2" * Text Sequence: "The opposite of 'large' is 'brackets'" with numerical positions 1 to 8 below each word. * Callouts: Stacked boxes indicating the words "opposite", "large", and "brackets" are modulated by specific layers: * "opposite" (layer 0-1) * "large" (layer 0-3) * "brackets" (layer 0-10) * Callout: "Modulated at 80% by" pointing to "small" (layer 12) ### Detailed Analysis * **GPT-2 Attention Patterns:** The grid on the left shows a complex pattern of attention. Vertical blue lines suggest some heads attend strongly to specific positions across multiple layers. Red squares indicate focused attention at particular layer-head-position combinations. * **Sparse GPT-2 Attention Heads:** The five grids at the top-right show specific attention patterns where key position 5 maps to query position 8. * L11-H7: A red square at approximately (5,8) and a blue square at approximately (5,7). * L10-H1: A red square at approximately (5,8). * L9-H7: A red square at approximately (5,8). * L9-H1: A red square at approximately (5,8) and a blue square at approximately (5,7). * L8-H6: A red square at approximately (5,8). * **Modulation:** The word "small" in layer 12 modulates the attention at 80%. ### Key Observations * The Sparse GPT-2 model focuses attention on specific key-query relationships, as demonstrated by the highlighted heads mapping key position 5 to query position 8. * Different layers are responsible for modulating different words in the sequence. * The attention patterns in GPT-2 are more distributed compared to the focused attention in Sparse GPT-2. ### Interpretation The image illustrates the difference in attention mechanisms between a standard GPT-2 model and a Sparse GPT-2 model. The Sparse GPT-2 model appears to have a more targeted attention mechanism, focusing on specific key-query relationships. The modulation of specific words by different layers suggests a hierarchical processing of the input sequence, where different layers are responsible for different aspects of understanding the context. The 80% modulation by "small" in layer 12 indicates that this layer plays a significant role in determining the relationship between the words in the sequence, specifically in the context of mapping key position 5 to query position 8. </details> Figure 7: Sketch of the attribution graph for the sentence “The opposite of ‘large’ is”. The cluster of features associated with large at token position 5 maps directly to the final next-token prediction logit small. We show the attention patterns of all key–query pairs required to account for $80\%$ of the cumulative attribution score. In the sparse-attention setting, this corresponds to five attention heads, compared to more than forty heads in the dense-attention case. In the sparse model, these heads read from token position 5 and write directly to the last token residual stream at token position 8. These heads thus compute in parallel and provide a clear picture of the internal computation. Next, we present a qualitative case-study to showcase the benefits of sparse attention patterns. For a given key–query pair, we compute the causal effect from all other features in the attribution graph to both the key and the query vectors. Figure 7 illustrates this analysis for the prompt “The opposite of ‘large’ is”. The resulting attribution graph decomposes into four coherent clusters of features: features related to opposite, features related to large, features activating on bracketed tokens, and the final next-token logit corresponding to small (see Appendix H for example of features and visualization). Here, the features in the large cluster are directly connected to the small logit. The key question is then to understand how this connection from the large to the small logit comes about. To this end, we analyse their mediation structure. We find that $80\%$ of the cumulative attribution score of the edges connecting the large cluster to the small logit is mediated by the same five late layer attention key–query pairs. These attention components map features from token position $5$ directly into the final-layer residual stream at position $8$ , and thus operate in parallel. For these five key–query pairs, we then compute the causal influence of all other features in the graph on their key and query vectors. The query vectors are primarily modulated by features associated with bracketed tokens in the last token position, while the key vectors are driven by strongly active features in both the opposite and large clusters, as shown in Figure 8.These results are in agreement with the recent work on attention attribution and the ”opposite of” attribution graph (Kamath et al., 2025). In stark contrast, Figure 7 (left) shows that a similar (and more computationally expensive) analysis on the dense model produces a much more complicated circuit. This case study illustrates the potential of sparse attention in the context of attribution graphs, as it enables a unified view of features and circuits. By jointly analyzing feature activations, attention components, and their mediating roles, we obtain a more faithful picture of the computational graph underlying the model’s input–output behavior. 5 Conclusion Achieving interpretability requires innovations in both interpretation techniques and model design. We investigate how large models can be trained to be intrinsically interpretable. We present a flexible post-training procedure that sparsifies transformer attention while preserving the original pretraining loss. By minimally adapting the architecture, we apply a sparsity penalty under a constrained-loss objective, allowing pre-trained model to reorganise its connectivity into a much more selective and structured pattern. $→$ Query ⬇ 1. large (pos 5) 2. large (pos 5) 3. quantities (pos 5) 4. comparison (pos 3) 5. opposite (pos 3) $→$ Key ⬇ 1. bracket (pos 8) 2. bracket (pos 8) 3. bracket (pos 8) 4. bracket (pos 8) 5. bracket (pos 8) Figure 8: Minimal description of the top5 features activating the query and the key vectors for the attention head L8-H6 from Figure 7. Mechanistically, this induced sparsity gives rise to substantially simpler circuits: task-relevant computation concentrates into a small number of attention heads and edges. Across a range of tasks and analyses, we show that sparsity improves interpretability at the circuit level by reducing the number of components involved in specific behaviours. In circuit discovery experiments, most of the model’s behaviour can be explained by circuits that are orders of magnitude smaller than in dense models; in attribution graph analyses, the reduced number of mediating components renders attention attribution tractable. Together, these results position sparse post-training of attention as a practical and effective tool for enhancing the mechanistic interpretability of pre-trained models. Limitations and Future Work. One limitation of the present investigation is that, while we deliberately focus on sparsity as a post-training intervention, it remains an open question whether injecting a sparsity bias directly during training would yield qualitatively different or simpler circuit structures. Also, a comprehensive exploration of the performance trade-offs for larger models and for tasks that require very dense or long-range attention patterns would be beneficial, even if beyond the computational means currently at our disposal. Moreover, our study is primarily restricted to sparsifying attention patterns, the underlying principle of leveraging sparsity to promote interpretability naturally extends to other components of the transformer architecture. As such, combining the proposed method with complementary approaches for training intrinsically interpretable models, such as Sparse Mixture-of-Experts (Yang et al., 2025), sparsifying model weights (Gao et al., 2024), or limiting superposition offers a promising direction for future work. Another exciting avenue for future work is to apply the sparsity regularisation framework developed here within alternative post-training paradigms, such as reinforcement learning (Ouyang et al., 2022; Zhou et al., 2024) or supervised fine-tuning (Pareja et al., 2025). Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgment F. D. acknowledges support through a fellowship from the Hector Fellow Academy. A. L. is supported by an EPSRC Programme Grant (EP/V000748/1). I. P. holds concurrent appointments as a Professor of Applied AI at the University of Oxford and as an Amazon Scholar. This paper describes work performed at the University of Oxford and is not associated with Amazon. References - E. Ameisen, J. Lindsey, A. Pearce, W. Gurnee, N. L. Turner, B. Chen, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. Ben Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson (2025) Circuit tracing: revealing computational graphs in language models. Transformer Circuits Thread. External Links: Link Cited by: Appendix F, Appendix G, Appendix G, §1, §2.3, §4.3. - I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §2.1. - L. Bereska and E. Gavves (2024) Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082. Cited by: §1. - R. e. al. Bommasani (2021) On the opportunities and risks of foundation models. ArXiv. External Links: Link Cited by: §1. - R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.1. - T. Conerly, H. Cunningham, A. Templeton, J. Lindsey, B. Hosmer, and A. Jermyn (2025) Circuits updates – january 2025. Note: Transformer Circuits Thread External Links: Link Cited by: Appendix F. - A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso (2023) Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems 36, pp. 16318–16352. Cited by: §E.2, §E.3, §1, §1, §2.2, §4.2. - T. Dao (2023) Flashattention-2: faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691. Cited by: Appendix B, §3.3. - DeepSeek-AI (2025) DeepSeek-v3.2: pushing the frontier of open large language models. External Links: 2512.02556, Link Cited by: §2.1. - J. Dunefsky, P. Chlenski, and N. Nanda (2024) Transcoders find interpretable LLM feature circuits. Advances in Neural Information Processing Systems 37, pp. 24375–24410. Cited by: §2.3. - N. Elhage, N. Nanda, et al. (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2021/framework/index.html Cited by: §E.1, §E.2, §4.2. - L. Gao, A. Rajaram, J. Coxon, S. V. Govande, B. Baker, and D. Mossing (2024) Weight-sparse transformers have interpretable circuits. Technical report OpenAI. External Links: Link Cited by: §5. - A. Gokaslan and V. Cohen (2019) OpenWebText corpus. Note: http://Skylion007.github.io/OpenWebTextCorpus Cited by: §4. - D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, N. Subramani, M. Wortsman, P. Dasigi, N. Lambert, K. Richardson, J. Dodge, K. Lo, L. Soldaini, N. A. Smith, and H. Hajishirzi (2024) OLMo: accelerating the science of language models. Preprint. Cited by: §4. - Y. Gu, L. Dong, F. Wei, and M. Huang (2024) MiniLLM: knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, External Links: Link Cited by: §3.3. - A. Gupta, G. Dar, S. Goodman, D. Ciprut, and J. Berant (2021) Memory-efficient transformers via top- $k$ attention. arXiv preprint arXiv:2106.06899. Cited by: §2.1. - M. Hanna, M. Piotrowski, J. Lindsey, and E. Ameisen (2025) Circuit-tracer. Note: https://github.com/safety-research/circuit-tracer The first two authors contributed equally and are listed alphabetically. Cited by: Appendix G, §4.3. - S. Heimersheim and J. Janiak (2023) A circuit for python docstrings in a 4-layer attention-only transformer. In Alignment Forum, Cited by: §E.3. - E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. (2022) Lora: low-rank adaptation of large language models.. ICLR 1 (2), pp. 3. Cited by: §3.3. - E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, External Links: Link Cited by: §3.1. - H. Kamath, E. Ameisen, I. Kauvar, R. Luger, W. Gurnee, A. Pearce, S. Zimmerman, J. Batson, T. Conerly, C. Olah, and J. Lindsey (2025) Tracing attention computation through feature interactions. Transformer Circuits Thread. External Links: Link Cited by: §1, §2.3, §4.3. - A. Lei, B. Schölkopf, and I. Posner (2025) SPARTAN: a sparse transformer world model attending to what matters. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §2.1, §3.2, §3. - J. Lindsey, E. Ameisen, N. Nanda, S. Shabalin, M. Piotrowski, T. McGrath, M. Hanna, O. Lewis, C. Tigges, J. Merullo, C. Watts, G. Paulo, J. Batson, L. Gorton, E. Simon, M. Loeffler, C. McDougall, and J. Lin (2025a) The circuits research landscape: results and perspectives. Neuronpedia. External Links: Link Cited by: Appendix F. - J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro, D. Abrahams, S. Carter, B. Hosmer, J. Marcus, M. Sklar, A. Templeton, T. Bricken, C. McDougall, H. Cunningham, T. Henighan, A. Jermyn, A. Jones, A. Persic, Z. Qi, T. B. Thompson, S. Zimmerman, K. Rivoire, T. Conerly, C. Olah, and J. Batson (2025b) On the biology of a large language model. Transformer Circuits Thread. External Links: Link Cited by: §2.3. - N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt (2023) Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217. Cited by: §E.1, §1, §2.2, §4.2. - C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. (2022) In-context learning and induction heads. arXiv preprint arXiv:2209.11895. Cited by: §1. - L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, pp. 27730–27744. Cited by: §5. - A. Pareja, N. S. Nayak, H. Wang, K. Killamsetty, S. Sudalairaj, W. Zhao, S. Han, A. Bhandwaldar, G. Xu, K. Xu, L. Han, L. Inglis, and A. Srivastava (2025) Unveiling the secret recipe: a guide for supervised fine-tuning small LLMs. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §5. - Z. Qiu, Z. Wang, B. Zheng, Z. Huang, K. Wen, S. Yang, R. Men, L. Yu, F. Huang, S. Huang, D. Liu, J. Zhou, and J. Lin (2025) Gated attention for large language models: non-linearity, sparsity, and attention-sink-free. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §2.1. - A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §4. - D. J. Rezende and F. Viola (2018) Taming vaes. External Links: 1810.00597, Link Cited by: §3.2. - L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer, N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld, J. Dodge, and K. Lo (2024) Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint. Cited by: §4. - A. Syed, C. Rager, and A. Conmy (2024) Attribution patching outperforms automated circuit discovery. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 407–416. Cited by: §2.2. - X. Yang, C. Venhoff, A. Khakzar, C. S. de Witt, P. K. Dokania, A. Bibi, and P. Torr (2025) Mixture of experts made intrinsically interpretable. arXiv preprint arXiv:2503.07639. Cited by: §5. - M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. (2020) Big bird: transformers for longer sequences. Advances in neural information processing systems 33, pp. 17283–17297. Cited by: §2.1. - F. Zhang and N. Nanda (2023) Towards best practices of activation patching in language models: metrics and methods. arXiv preprint arXiv:2309.16042. Cited by: §2.2. - Y. Zhou, A. Zanette, J. Pan, S. Levine, and A. Kumar (2024) ArCHer: training language model agents via hierarchical multi-turn rl. In ICML, External Links: Link Cited by: §5. Appendix A Two-Digit Addition Study <details> <summary>x8.png Details</summary> ![7bba7db1](/v1/image/7bba7db13340d357336853eaca905183ae196d243d426001bd5bce0c3688ef9c) ### Visual Description ## Diagram: Sparse vs. Non-Sparse Neural Networks for Addition ### Overview The image presents a comparative diagram illustrating the difference between non-sparse and sparse neural networks in the context of performing addition. It shows how connections between layers are structured in both types of networks for three different addition problems: 5+3+2+1, 3+6+2+8, and 4+3+5+9. The diagram highlights the reduced connectivity in sparse networks compared to non-sparse networks. ### Components/Axes * **Title:** The image is implicitly titled by the labels "Non-Sparse" and "Sparse". * **Layers:** Each network diagram has four layers labeled "Layer 0", "Layer 1", "Layer 2", and "Layer 3". * **Nodes:** Each layer consists of 10 nodes, represented by small circles. * **Connections:** Blue lines represent connections between nodes in adjacent layers. * **Input:** Each network takes four single-digit numbers as input (e.g., 5, 3, 2, 1). These are presented as an addition problem at the bottom of each network diagram (e.g., "5+3+2+1=?????"). * **Output:** The output of each network is a five-digit number, representing the sum of the input digits. This is shown at the top of each network diagram (e.g., "5+3+2+1=00074"). * **Arrows:** Curved arrows above the output indicate the direction of carry-over operations. ### Detailed Analysis **Top Row: Non-Sparse Networks** * **General Structure:** In the non-sparse networks, each node in a layer is connected to every node in the adjacent layers. This creates a dense network of connections. * **Addition Problem 1: 5+3+2+1=00074** * Input: 5, 3, 2, 1 * Output: 00074 * **Addition Problem 2: 3+6+2+8=00064** * Input: 3, 6, 2, 8 * Output: 00064 * Carry-over: A single curved arrow indicates a carry-over operation. * **Addition Problem 3: 4+3+5+9=00102** * Input: 4, 3, 5, 9 * Output: 00102 * Carry-over: Two curved arrows indicate two carry-over operations. **Bottom Row: Sparse Networks** * **General Structure:** In the sparse networks, each node in a layer is connected to only a few nodes in the adjacent layers. This creates a sparse network of connections. * **Addition Problem 1: 5+3+2+1=00074** * Input: 5, 3, 2, 1 * Output: 00074 * Connections: The input nodes (5, 3, 2, 1) in Layer 0 are connected to a limited number of nodes in Layer 1. * **Addition Problem 2: 3+6+2+8=00064** * Input: 3, 6, 2, 8 * Output: 00064 * Carry-over: A single curved arrow indicates a carry-over operation. * Connections: The input nodes (3, 6, 2, 8) in Layer 0 are connected to a limited number of nodes in Layer 1. * **Addition Problem 3: 4+3+5+9=00102** * Input: 4, 3, 5, 9 * Output: 00102 * Carry-over: Two curved arrows indicate two carry-over operations. * Connections: The input nodes (4, 3, 5, 9) in Layer 0 are connected to a limited number of nodes in Layer 1. ### Key Observations * **Connectivity:** The primary difference between the two types of networks is the density of connections. Non-sparse networks have full connectivity, while sparse networks have limited connectivity. * **Complexity:** Non-sparse networks are more complex due to the higher number of connections. * **Efficiency:** Sparse networks are potentially more efficient in terms of computation and memory usage due to the reduced number of connections. * **Carry-over Operations:** The curved arrows indicate carry-over operations, which are essential for accurate addition. ### Interpretation The diagram illustrates the concept of sparsity in neural networks and its potential benefits. Sparse networks can achieve similar results to non-sparse networks with fewer connections, which can lead to improved efficiency and reduced computational cost. The specific example of addition highlights how sparsity can be applied to a simple arithmetic task. The carry-over arrows indicate the complexity of even simple addition when implemented in a neural network architecture. The diagram suggests that sparse networks can be a viable alternative to dense networks in certain applications, particularly where computational resources are limited. </details> Figure 9: Simple example showing the attention patterns (shown in blue) of sparse and non-sparse transformers trained on a two digit addition task. Both models are able to correctly predict the sum, but the attention patterns are very different: the non-sparse model solves the task with highly dispersed information flow, while the sparse model uses a highly interpretable attention pattern: in Layer 0, the model first attends to the corresponding digits to be added, then in Layer 1, it attends to the carry bit only if it is needed (see middle and right columns, where the model has to carry once and twice respectively). In the introduction, we used a two-digit addition task to demonstrate how sparse attention patterns can lead to intrinsically interpretable circuits. The result presented is gathered in a small scale toy experiment described below. We train 4-layer single-head Transformer models on a two-digit addition task, where the input is a sequence of digits and the model is trained to predict the sum. In this task, there are 13 total tokens: ten digits and three symbols ”+”, ”=” and ”?”. Within this setting, we train two models: a standard transformer model and a sparse transformer with a fixed sparsity regularisation strength. Figure 9 shows several examples of the learned attention patterns. In these examples, we can clearly see that the pressure of sparsity leads to the emergence of human-recognisable algorithmic patterns: in the first layer, each digit in the answer attends to the corresponding digits in the input, while the second layer computes the carry bit when necessary. By enforcing selective information flow through sparse message-passing, the sparse model is able to learn crisp and localised mechanisms that are immediately amenable to interpretation. Appendix B Sparse Attention Implementation For the experiments, we implemented efficient GPU kernels for the sparse attention layers using the helion domain-specific language https://helionlang.com/. We refer to this implementation as Splash Attention (Sp arse f lash Attention). Our implementation follows the same core algorithmic structure as FlashAttention-2 (Dao, 2023), including the use of online softmax computation and tiling. Note that the sparse attention variant (Eq. 2) only differs from the standard attention by a pointwise multiplication of the adjacency matrix, which can be easily integrated into FlashAttention by computing $A_{ij}$ on-the-fly. We additionally fuse the Gumbel-softmax computation, the straight-through gradient, and the computation of the expected number of edges (required for the penalty) into a single optimized kernel, the implementation of which will be released together with the experiment code. Figure 10 compares our Splash Attention implementation against a naive baseline based on PyTorch-native operations. <details> <summary>figures/splash_attention.png Details</summary> ![a0574832](/v1/image/a0574832f887e66a36b49176be35f067af4315d4bef10db6bca02f7752b705e8) ### Visual Description ## Chart: Execution Time and Memory Peak Comparison ### Overview The image presents two line charts comparing the performance of "Splash Attention" and "Naive Attention" mechanisms. The left chart compares execution time (in milliseconds) against sequence length (logarithmic scale), while the right chart compares peak memory usage (in GB) against sequence length (logarithmic scale). ### Components/Axes **Left Chart: Execution Time Comparison** * **Title:** Execution Time Comparison * **Y-axis:** Time (ms), linear scale from 0 to 40, with tick marks at 0, 10, 20, 30, and 40. * **X-axis:** Sequence Length, logarithmic scale from approximately 10^1 to 10^3. Tick marks are present at 10^1, 10^2, and 10^3. * **Legend (Top-Left):** * Blue line with circle markers: Splash Attention * Orange line with square markers: Naive Attention **Right Chart: Memory Peak Comparison** * **Title:** Memory Peak Comparison * **Y-axis:** Peak Memory (GB), linear scale from 0 to 10, with tick marks at 0, 2, 4, 6, 8, and 10. * **X-axis:** Sequence Length, logarithmic scale from approximately 10^1 to 10^3. Tick marks are present at 10^1, 10^2, and 10^3. * **Legend (Top-Left):** * Blue line with circle markers: Splash Attention * Orange line with square markers: Naive Attention ### Detailed Analysis **Left Chart: Execution Time Comparison** * **Splash Attention (Blue):** The execution time remains relatively flat and low as sequence length increases. * Sequence Length ~10^1: Time ~0 ms * Sequence Length ~10^2: Time ~1 ms * Sequence Length ~10^3: Time ~6 ms * **Naive Attention (Orange):** The execution time increases significantly with sequence length. * Sequence Length ~10^1: Time ~0 ms * Sequence Length ~10^2: Time ~1 ms * Sequence Length ~3*10^2: Time ~3 ms * Sequence Length ~10^3: Time ~12 ms * Sequence Length ~2*10^3: Time ~48 ms **Right Chart: Memory Peak Comparison** * **Splash Attention (Blue):** The peak memory usage remains very low and almost constant as sequence length increases. * Sequence Length ~10^1: Memory ~0.3 GB * Sequence Length ~10^2: Memory ~0 GB * Sequence Length ~10^3: Memory ~0.1 GB * **Naive Attention (Orange):** The peak memory usage increases significantly with sequence length. * Sequence Length ~10^1: Memory ~0 GB * Sequence Length ~10^2: Memory ~0.5 GB * Sequence Length ~10^3: Memory ~2.5 GB * Sequence Length ~2*10^3: Memory ~9.8 GB ### Key Observations * Splash Attention consistently outperforms Naive Attention in both execution time and memory usage, especially as sequence length increases. * Naive Attention's performance degrades significantly with increasing sequence length, showing exponential growth in both time and memory. * Splash Attention maintains a relatively stable and low resource footprint regardless of sequence length. ### Interpretation The data strongly suggests that Splash Attention is a more efficient attention mechanism compared to Naive Attention, particularly for longer sequences. The exponential increase in execution time and memory usage for Naive Attention makes it less scalable and potentially impractical for large sequence lengths. Splash Attention's consistent performance indicates a more optimized and resource-friendly approach. The charts highlight the importance of choosing the right attention mechanism based on the expected sequence lengths and resource constraints. </details> Figure 10: Performance comparison between our implementation (Splash) and a naive PyTorch baseline. Appendix C Training Details C.1 Hyperparameters and Compute Resources | Hyperparameter | OLMo | GPT-2 | | --- | --- | --- | | Base Model | allenai/OLMo-7B-hf | gpt2 | | Context window | 512 | 64 | | Dataset | dolma-v1 | OpenWebText | | Batch size | 16 | 256 | | Gradient accumulation steps | 4 | 4 | | Total steps | 400,000 | 1,200,000 | | Learning rate | $1× 10^{-5}$ | $1× 10^{-5}$ | | Minimum learning rate | $1× 10^{-6}$ | $1× 10^{-6}$ | | Optimizer | Adam | Adam | | Weight decay | 0.1 | 0.1 | | Scheduler | Cosine (1 cycle) | Cosine (1 cycle) | | Warmup steps | 1,000 | 1,000 | | Finetuning strategy | LoRA | Full | | LoRA rank ( $r$ ) | 400 | - | | LoRA scaling ( $\alpha$ ) | 800 | - | | LoRA dropout | 0 | - | | LoRA target modules | q,k,v,o,fc_in,fc_out | - | | Dual Optimisation LR | 0.01 | 0.1 | | Target cross-entropy | 2.29 | 3.5 | Table 2: Key hyperparameters used for sparse post-training experiments on OLMo-7B. We provide the key hyperparameters for our experiments in table 2. All training are performed on NVIDIA H100 GPUs. The GPT-2 model is trained on a single GPU while the OLMo model is trained on a node of 8 GPUs. The total training time for both models is roughly 14 days. The main sparse attention code will be made available as a Transformer library wrapper. The implementation code as well as model weights will also be released. C.2 Training Dynamics <details> <summary>x9.png Details</summary> ![e4db6b9d](/v1/image/e4db6b9d816425d77b458ec2913463e923dd7416d0537c2ed3afa4e4e8a6f799) ### Visual Description ## Line Charts: Training Metrics ### Overview The image presents three line charts displaying the training metrics of a model over 400,000 training steps. The charts depict Sparsity, Regularisation Strength, and Validation Cross Entropy. All three charts share the same x-axis: "Training Steps". ### Components/Axes **Chart 1: Sparsity** * **Y-axis:** Sparsity (log scale) * Scale ranges from approximately 0.005 to 0.2. * Markers at 10^-2 and 10^-1. * **X-axis:** Training Steps * Scale ranges from 0 to 400,000. * Markers at 0, 100,000, 200,000, 300,000, and 400,000. **Chart 2: Regularisation Strength** * **Y-axis:** Regularisation Strength * Scale ranges from 0 to 3000. * Markers at 0, 500, 1000, 1500, 2000, 2500, and 3000. * **X-axis:** Training Steps * Scale ranges from 0 to 400,000. * Markers at 0, 100,000, 200,000, 300,000, and 400,000. **Chart 3: Validation Cross Entropy** * **Y-axis:** Validation Cross Entropy * Scale ranges from 2.0 to 3.0. * Markers at 2.0, 2.2, 2.4, 2.6, 2.8, and 3.0. * **X-axis:** Training Steps * Scale ranges from 0 to 400,000. * Markers at 0, 100,000, 200,000, 300,000, and 400,000. * A dashed horizontal line is present at approximately y = 2.3. ### Detailed Analysis **Chart 1: Sparsity** * **Trend:** The sparsity starts high (around 0.2), rapidly decreases until approximately 80,000 training steps, then experiences a sharp increase around 100,000 steps, followed by a gradual decrease and stabilization around 0.005 for the remainder of the training. * **Data Points:** * Initial Sparsity: ~0.2 * Sparsity at 80,000 steps: ~0.006 * Peak after increase: ~0.03 * Final Sparsity: ~0.005 **Chart 2: Regularisation Strength** * **Trend:** The regularisation strength starts at 0, remains low until around 50,000 training steps, then rapidly increases until approximately 100,000 steps, drops back to 0, then increases again until approximately 200,000 steps, and then continues to increase more gradually with some fluctuations until the end of training. * **Data Points:** * Initial Regularisation Strength: 0 * Regularisation Strength at 100,000 steps: ~2000 * Regularisation Strength at 200,000 steps: ~2000 * Final Regularisation Strength: ~3000 **Chart 3: Validation Cross Entropy** * **Trend:** The validation cross entropy starts around 2.2, decreases slightly until approximately 50,000 training steps, then increases sharply until approximately 100,000 steps, followed by a decrease and stabilization around 2.3 for the remainder of the training. * **Data Points:** * Initial Validation Cross Entropy: ~2.2 * Minimum Validation Cross Entropy: ~2.2 * Peak after increase: ~2.3 * Final Validation Cross Entropy: ~2.3 ### Key Observations * All three metrics exhibit significant changes around 50,000-100,000 training steps. * Sparsity and Regularisation Strength appear to be inversely related in the initial phase of training. * Validation Cross Entropy stabilizes after the initial fluctuations. ### Interpretation The charts suggest that the model undergoes a significant adjustment phase during the first 100,000 training steps. The initial decrease in sparsity corresponds to an increase in regularisation strength, indicating that the model is learning to prioritize important features. The subsequent increase in validation cross entropy suggests a potential overfitting issue, which is then corrected as the model continues to train and the validation cross entropy stabilizes. The dashed line on the Validation Cross Entropy chart likely represents a target or acceptable level of validation error. The model appears to achieve and maintain this level after the initial adjustment period. </details> Figure 11: The training curves for post-training OLMo-7B tacking the model sparsity (left), regularisation strength (middle), and the cross-entropy loss (right). The black dotted line on the cross-entropy plot indicates the pre-defined threshold, $\tau$ . A key feature of our post-training framework is that the strength of the sparsity regularisation is automatically controlled via a constrained optimisation scheme. By pre-specifying an accepted level for the cross-entropy target, $\tau$ , the training procedure can be written as the max-min objective: $$ \max_{\lambda>0}\min_{\theta}\bigg[\sum_{l}\mathbb{E}\big[|A_{l}|\big]+\lambda(CE-\tau)\bigg], \tag{7} $$ which can be optimised by taking alternating gradient steps in the model weight space and in the $\lambda$ space. The resulting training dynamics means that the sparsity regularisation strength increases when the model cross-entropy is lower than the target, and decreases when the model is above the threshold. Figure 11 shows the training curves for the OLMo-7B model. Here, we observe that the strength of sparsity regularisation keeps increasing slowly while the model cross-entropy is clipped at the desired level. Note that during a model spike (at around 100K steps), the sparsity regularisation automatically decreases to let the model recover. Appendix D Extra Experiments for Circuit Discovery In this section, we provide additional results for the activation patching circuit discovery experiment presented in the main text. Figure 12 shows the attention patterns of the heads required to explain 90% of model behaviour on a copy task. To fully test the longer context window afforded by OLMo, we use a longer prompt than the one used for GPT2 in the main text. The result is consistent with the GPT-2 experiment: sparsified model facilitates the discovery of smaller circuits of induction heads that implement the copy task. <details> <summary>x10.png Details</summary> ![8e6ba6aa](/v1/image/8e6ba6aa06a24b8ba2cb5b623a32f5c44d4f50ac4a213a270c7c3970d3f14d88) ### Visual Description ## Dot Plot Matrix: Genome Comparison ### Overview The image presents two dot plot matrices, each comparing the similarity between different segments of a genome. The matrix on the left, labeled "OLMe-78", appears denser, indicating higher sequence similarity across the genome. The matrix on the right, labeled "Sparse OLMe-78", shows a sparser pattern, suggesting lower sequence similarity or more significant rearrangements. Each cell in the matrix represents a comparison between two genomic segments, with dots indicating regions of similarity. ### Components/Axes * **Title (Left Matrix):** OLMe-78 * **Title (Right Matrix):** Sparse OLMe-78 * **Axes Labels:** Each row and column in both matrices is labeled with "L#", where # is a number ranging from 1 to 112 for the left matrix and 1 to 102 for the right matrix. These labels represent different segments of the genome being compared. ### Detailed Analysis **Left Matrix (OLMe-78):** * The matrix is a 112x112 grid. * Each cell contains a dot plot comparing two segments of the genome. * The diagonal cells (L1 vs L1, L2 vs L2, etc.) generally show a strong diagonal line, indicating high similarity between a segment and itself. * Off-diagonal cells show varying degrees of similarity, with some showing diagonal lines or clusters of dots, indicating regions of shared sequence. * The density of dots in the off-diagonal cells suggests a relatively high degree of sequence similarity across different segments of the genome. **Right Matrix (Sparse OLMe-78):** * The matrix is a 102x102 grid. * Similar to the left matrix, the diagonal cells show strong diagonal lines. * However, the off-diagonal cells are much sparser, with fewer dots and shorter diagonal lines. * This indicates a lower degree of sequence similarity between different segments of the genome compared to the left matrix. * Some cells show vertical or horizontal lines, suggesting potential duplications or rearrangements. ### Key Observations * The "OLMe-78" matrix shows a higher degree of sequence similarity across the genome compared to the "Sparse OLMe-78" matrix. * The "Sparse OLMe-78" matrix exhibits more distinct patterns, such as vertical or horizontal lines, suggesting potential genomic rearrangements. * The diagonal lines in both matrices confirm the self-similarity of each genomic segment. ### Interpretation The dot plot matrices provide a visual representation of the sequence similarity between different segments of a genome. The "OLMe-78" matrix suggests a relatively stable genome with high sequence conservation across different segments. In contrast, the "Sparse OLMe-78" matrix indicates a genome with lower sequence similarity and potential rearrangements, such as duplications or translocations. The "Sparse OLMe-78" genome may have undergone more evolutionary changes or experienced more genomic instability compared to the "OLMe-78" genome. The sparsity in the "Sparse OLMe-78" matrix could also be due to the presence of unique sequences or insertions that are not shared between different segments of the genome. </details> Figure 12: Attention patterns of the heads required to explain 90% of model behaviour on a longer copy task. Similar to the GPT-2 results in Figure 3, the sparse model requires substantially fewer attention heads. Figure 13 and 14 show the fraction of explained model preference as a function of the number of model components kept un-ablated. The difference between these plots and Figure 4 and 5 lies in the way individual model components are ranked. Here, the ranking is performed on a task level, meaning that the importance score for each component is pooled across different instances of the same task. Overall, the results are commensurate with results presented in the main paper, where the ranking strategy consistently discover smaller circuits in sparse models. The only exception is the Greater Than task for GPT-2 where the number of attention heads required for the sparse model is larger than that of the base model. We hypothesise that this is due to the sparse model choosing different circuits to implement different instances of the same task, rendering the task-level importance score less suitable for circuit discovery in this case. Finally, in Figure 15, we provide a qualitative visualisation of the edges required to completed the IOI task. <details> <summary>x11.png Details</summary> ![edbe2c68](/v1/image/edbe2c685f2f5c1280a4e91c0d5a6352e3b5d2b8d88f1297fc7b3ae2a518abe1) ### Visual Description ## Chart Type: Multiple Line Graphs ### Overview The image presents four line graphs comparing the "Explained Effect" of different models (GPT-2, Sparse GPT-2, OLMo-7B, Sparse OLMo-7B) against the "Number of Heads Kept." Each graph represents a different task: "Greater Than," "IOI," "Docstring," and "IOI Long." The graphs show how the explained effect changes as the number of heads kept increases. Each graph also includes a horizontal dashed line indicating a specific performance level, with a label indicating a multiplicative factor (e.g., "0.6x", "2.0x"). ### Components/Axes * **X-axis (Horizontal):** "Number of Heads Kept." The scales vary across the graphs. * "Greater Than" and "IOI": 0 to 100, with tick marks at 50 and 100. * "Docstring" and "IOI Long": 0 to 1000, with tick marks at 250, 500, 750, and 1000. * **Y-axis (Vertical):** "Explained Effect." The scale is consistent across all graphs, ranging from 0.0 to 1.0, with a tick mark at 0.5. * **Legends (Bottom of the first and third graphs):** * "Greater Than" and "IOI": * Blue line: "GPT-2" * Orange line: "Sparse GPT-2" * "Docstring" and "IOI Long": * Green line: "OLMo-7B" * Pink line: "Sparse OLMo-7B" * **Titles (Top of each graph):** "Greater Than," "IOI," "Docstring," "IOI Long." * **Horizontal Dashed Lines:** Each graph has a horizontal dashed line indicating a specific performance level. * "Greater Than": Dashed line at approximately y=0.9, labeled "0.6x" above the line. * "IOI": Dashed line at approximately y=0.85, labeled "2.0x" above the line. * "Docstring": Dashed line at approximately y=0.9, labeled "1.3x" above the line. * "IOI Long": Dashed line at approximately y=0.9, labeled "1.6x" above the line. ### Detailed Analysis **1. Greater Than** * **GPT-2 (Blue):** The line starts at approximately 0.0 and increases rapidly to approximately 0.9 around x=50. It then plateaus and fluctuates slightly between 0.9 and 1.0. * **Sparse GPT-2 (Orange):** The line starts at approximately 0.0 and increases more gradually than GPT-2, reaching approximately 0.9 around x=75. It then plateaus and fluctuates slightly between 0.9 and 1.0. * **Dashed Line:** The dashed line is at approximately y=0.9, labeled "0.6x". **2. IOI** * **GPT-2 (Blue):** The line starts at approximately 0.0 and increases rapidly to approximately 0.8 around x=75. It then plateaus and fluctuates slightly between 0.8 and 0.9. * **Sparse GPT-2 (Orange):** The line starts at approximately 0.0 and increases more gradually than GPT-2, reaching approximately 0.8 around x=50. It then increases to approximately 1.0. * **Dashed Line:** The dashed line is at approximately y=0.85, labeled "2.0x". **3. Docstring** * **OLMo-7B (Green):** The line starts at approximately 0.0 and increases rapidly to approximately 0.9 around x=250. It then plateaus and fluctuates slightly between 0.9 and 1.0. * **Sparse OLMo-7B (Pink):** The line starts at approximately 0.0 and increases rapidly to approximately 0.9 around x=250. It then plateaus and fluctuates slightly between 0.9 and 1.0. * **Dashed Line:** The dashed line is at approximately y=0.9, labeled "1.3x". **4. IOI Long** * **OLMo-7B (Green):** The line starts at approximately 0.0 and increases rapidly to approximately 0.9 around x=250. It then plateaus and fluctuates slightly between 0.9 and 1.0. * **Sparse OLMo-7B (Pink):** The line starts at approximately 0.0 and increases rapidly to approximately 0.9 around x=250. It then plateaus and fluctuates slightly between 0.9 and 1.0. * **Dashed Line:** The dashed line is at approximately y=0.9, labeled "1.6x". ### Key Observations * In the "Greater Than" task, GPT-2 reaches a high explained effect faster than Sparse GPT-2. * In the "IOI" task, Sparse GPT-2 reaches a higher explained effect than GPT-2. * In the "Docstring" and "IOI Long" tasks, OLMo-7B and Sparse OLMo-7B perform similarly. * The "Number of Heads Kept" required to reach a high explained effect varies significantly between the tasks. "Greater Than" and "IOI" require fewer heads than "Docstring" and "IOI Long." * The dashed lines indicate a performance threshold, and the labels (e.g., "0.6x", "2.0x") likely represent a factor related to the performance at that threshold. ### Interpretation The graphs illustrate the trade-off between model sparsity and performance on different tasks. Sparsity, achieved by reducing the number of heads kept, can impact the explained effect. The results suggest that the optimal level of sparsity depends on the specific task. For example, Sparse GPT-2 outperforms GPT-2 on the "IOI" task, indicating that sparsity can sometimes improve performance. The "Docstring" and "IOI Long" tasks show that OLMo-7B and Sparse OLMo-7B perform similarly, suggesting that sparsity does not significantly impact performance on these tasks. The multiplicative factors associated with the dashed lines likely represent a comparison of performance relative to a baseline or another model, but without further context, their precise meaning is unclear. The data suggests that model architecture and sparsity should be carefully considered and tuned based on the specific task requirements. </details> Figure 13: Logit attribution per sentence keeping only the top-k attention heads based on a global ranking score. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. With the exception of the Greater Than task for GPT-2, the sparse models admits smaller circuits. <details> <summary>x12.png Details</summary> ![8e165d41](/v1/image/8e165d411e288d53625cbb48dbfd274d3b246fa3632d2007b079340dda4648d2) ### Visual Description ## Chart: Explained Effect vs. Number of Edges Kept for Different Models and Tasks ### Overview The image presents four line charts comparing the "Explained Effect" against the "Number of Edges Kept (log scale)" for different models (GPT-2 and OLMo-7B) and tasks (Greater Than, IOI, Docstring, and IOI Long). Each chart compares a dense model with its sparse counterpart. The x-axis is logarithmic, and the y-axis represents the explained effect, ranging from 0.0 to 1.0. The charts also indicate the multiplicative factor by which the number of edges kept needs to be increased in the sparse model to achieve the same explained effect as the dense model. ### Components/Axes * **X-axis:** Number of Edges Kept (log scale). The scale ranges from 10^0 to 10^4 or 10^5 depending on the chart. * **Y-axis:** Explained Effect. The scale ranges from 0.0 to 1.0. * **Titles:** * Top-left: Greater Than * Top-middle-left: IOI * Top-middle-right: Docstring * Top-right: IOI Long * **Legends:** * Greater Than and IOI charts: * Blue line: GPT-2 * Orange line: Sparse GPT-2 * Docstring and IOI Long charts: * Green line: OLMo-7B * Pink line: Sparse OLMo-7B * **Multiplicative Factors:** * Greater Than: 41.9x * IOI: 14.9x * Docstring: 5.5x * IOI Long: 3.1x ### Detailed Analysis **1. Greater Than** * **GPT-2 (Blue):** The explained effect increases sharply between 10^1 and 10^2 edges, reaching near 1.0 by 10^3 edges. * At 10^0 edges, the explained effect is approximately 0.1. * At 10^1 edges, the explained effect is approximately 0.2. * At 10^2 edges, the explained effect is approximately 0.9. * At 10^3 edges, the explained effect is approximately 1.0. * **Sparse GPT-2 (Orange):** The explained effect increases gradually between 10^0 and 10^3 edges, reaching near 1.0 by 10^4 edges. * At 10^0 edges, the explained effect is approximately 0.0. * At 10^1 edges, the explained effect is approximately 0.2. * At 10^2 edges, the explained effect is approximately 0.7. * At 10^3 edges, the explained effect is approximately 0.95. * At 10^4 edges, the explained effect is approximately 1.0. * **Multiplicative Factor:** 41.9x. This indicates that the sparse model needs 41.9 times more edges to achieve the same explained effect as the dense model. **2. IOI** * **GPT-2 (Blue):** The explained effect increases sharply between 10^1 and 10^2 edges, reaching near 1.0 by 10^3 edges. * At 10^0 edges, the explained effect is approximately 0.05. * At 10^1 edges, the explained effect is approximately 0.2. * At 10^2 edges, the explained effect is approximately 0.8. * At 10^3 edges, the explained effect is approximately 0.95. * **Sparse GPT-2 (Orange):** The explained effect increases gradually between 10^0 and 10^3 edges, reaching near 1.0 by 10^4 edges. * At 10^0 edges, the explained effect is approximately 0.15. * At 10^1 edges, the explained effect is approximately 0.3. * At 10^2 edges, the explained effect is approximately 0.85. * At 10^3 edges, the explained effect is approximately 0.95. * At 10^4 edges, the explained effect is approximately 1.0. * **Multiplicative Factor:** 14.9x. **3. Docstring** * **OLMo-7B (Green):** The explained effect increases sharply between 10^2 and 10^4 edges, reaching near 1.0 by 10^5 edges. * At 10^1 edges, the explained effect is approximately 0.0. * At 10^2 edges, the explained effect is approximately 0.1. * At 10^3 edges, the explained effect is approximately 0.5. * At 10^4 edges, the explained effect is approximately 0.9. * At 10^5 edges, the explained effect is approximately 1.0. * **Sparse OLMo-7B (Pink):** The explained effect increases sharply between 10^2 and 10^4 edges, reaching near 1.0 by 10^5 edges. * At 10^1 edges, the explained effect is approximately 0.0. * At 10^2 edges, the explained effect is approximately 0.05. * At 10^3 edges, the explained effect is approximately 0.3. * At 10^4 edges, the explained effect is approximately 0.8. * At 10^5 edges, the explained effect is approximately 1.0. * **Multiplicative Factor:** 5.5x. **4. IOI Long** * **OLMo-7B (Green):** The explained effect increases sharply between 10^2 and 10^4 edges, reaching near 1.0 by 10^5 edges. * At 10^1 edges, the explained effect is approximately 0.0. * At 10^2 edges, the explained effect is approximately 0.1. * At 10^3 edges, the explained effect is approximately 0.4. * At 10^4 edges, the explained effect is approximately 0.8. * At 10^5 edges, the explained effect is approximately 1.0. * **Sparse OLMo-7B (Pink):** The explained effect increases sharply between 10^2 and 10^4 edges, reaching near 1.0 by 10^5 edges. * At 10^1 edges, the explained effect is approximately 0.0. * At 10^2 edges, the explained effect is approximately 0.05. * At 10^3 edges, the explained effect is approximately 0.3. * At 10^4 edges, the explained effect is approximately 0.7. * At 10^5 edges, the explained effect is approximately 1.0. * **Multiplicative Factor:** 3.1x. ### Key Observations * The sparse models consistently require more edges than their dense counterparts to achieve the same level of explained effect. * The "Greater Than" task shows the largest difference between the dense and sparse models (41.9x), while "IOI Long" shows the smallest difference (3.1x). * The x-axis is logarithmic, indicating that the number of edges has a significant impact on the explained effect, especially in the lower range of edge counts. * The explained effect generally plateaus as the number of edges increases, approaching 1.0 for all models and tasks. ### Interpretation The charts demonstrate the trade-off between model sparsity and performance. Sparse models, by definition, have fewer connections (edges) than dense models. The data suggests that while sparse models can achieve comparable performance to dense models, they often require a significantly larger number of edges to do so. The multiplicative factors (41.9x, 14.9x, 5.5x, and 3.1x) quantify this difference, indicating how much more "capacity" (in terms of edges) the sparse model needs to match the dense model's explained effect. The variation in multiplicative factors across different tasks ("Greater Than," "IOI," "Docstring," and "IOI Long") suggests that the impact of sparsity depends on the specific task being performed. For example, the "Greater Than" task appears to be more sensitive to sparsity than the "IOI Long" task. This could be due to differences in the complexity of the tasks or the types of relationships that need to be captured by the model. The logarithmic scale of the x-axis highlights the importance of the initial edges in the model. Adding edges in the lower range of the scale has a much more significant impact on the explained effect than adding edges in the higher range. This suggests that the initial connections in the model are crucial for capturing the essential relationships in the data. </details> Figure 14: Logit attribution per sentence keeping only the top-k attention edges based on a global ranking score. Dotted line annotates the number of attention heads needed to explain 90% of the logit difference. <details> <summary>x13.png Details</summary> ![d26b6319](/v1/image/d26b6319ba8161f752bed4f7bb1bcfb7b861dcbd6e4805f83abc04fc6ad8e270) ### Visual Description ## Diagram: Sparse GPT-2 vs. GPT-2 (baseline) ### Overview The image presents two diagrams side-by-side, visually comparing the connectivity patterns of a "Sparse GPT-2" model and a "GPT-2 (baseline)" model. Each diagram represents the layers of the model as a grid of nodes, with connections between nodes indicated by lines. The "Sparse GPT-2" diagram shows fewer connections than the "GPT-2 (baseline)" diagram, illustrating the sparsity of the former. ### Components/Axes * **Title (Left Diagram):** Sparse GPT-2 * **Title (Right Diagram):** GPT-2 (baseline) * **Y-axis Label:** Layers (with a downward-pointing arrow indicating the direction of increasing layer depth) * **Nodes:** Represented by circles. Some nodes are filled with light gray, while others are white with a black outline. * **Connections:** Represented by thin gray lines connecting the nodes. ### Detailed Analysis **Left Diagram: Sparse GPT-2** * The diagram is an 8x8 grid of nodes. * The Y-axis "Layers" has an arrow pointing downwards, implying the layers increase in depth from top to bottom. * Many nodes are filled with light gray, while some are white with a black outline. * The connections between nodes are sparse, with most nodes having only a few connections. * There appears to be a higher concentration of connections towards the bottom layers. * There are some vertical connections between nodes in adjacent layers. **Right Diagram: GPT-2 (baseline)** * The diagram is an 8x8 grid of nodes. * The Y-axis "Layers" is implied to be the same as the left diagram. * Many nodes are filled with light gray, while some are white with a black outline. * The connections between nodes are dense, with most nodes having many connections. * The connections appear to be more evenly distributed across the layers compared to the "Sparse GPT-2" diagram. ### Key Observations * The "Sparse GPT-2" model has significantly fewer connections than the "GPT-2 (baseline)" model, as indicated by the sparser network of lines. * The "Sparse GPT-2" model seems to have a higher concentration of connections in the lower layers. * The "GPT-2 (baseline)" model has a more uniform distribution of connections across all layers. * The nodes that are white with a black outline seem to be the active nodes, while the gray nodes are inactive. ### Interpretation The diagrams visually demonstrate the difference in connectivity between a sparse GPT-2 model and a baseline GPT-2 model. The sparsity in the "Sparse GPT-2" model suggests a deliberate reduction in the number of connections, potentially to improve efficiency, reduce computational cost, or prevent overfitting. The concentration of connections in the lower layers of the "Sparse GPT-2" model might indicate that these layers are more crucial for feature extraction or initial processing. The "GPT-2 (baseline)" model, with its dense connections, represents a more traditional, fully connected architecture. The comparison highlights the architectural differences and the potential trade-offs between sparsity and performance in language models. </details> Figure 15: An example of the attention-head edges required to reach 0.9 cumulative score based on the averaged scores for the IOI task. Appendix E Circuit Discovery Tasks In the following, we provide the details and the prompts for the various tasks used in section 4.2. E.1 Greater-Than Task Each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations, answers and wrong_answers. A typical entry is: { "clean": "The demonstrations lasted from the year 1363 to 13", "corrupt": "The demonstrations lasted from the year 1301 to 13", "answers": ["64", "65", ..., "99"], "wrong_answers": ["00", "01", ..., "63"] } For the clean prompt, any token in answers yields an end year strictly greater than the start year (e.g. "1364" – "1399"), whereas tokens in wrong_answers correspond to years that are less than or equal to the start year. The corrupt prompt changes only the starting year, shifting which continuations correspond to valid end years. We use the logit difference between the aggregated probability mass on answers vs. wrong_answers in clean vs. corrupt contexts as our signal, in the spirit of prior mechanistic studies on simple algorithmic tasks (Elhage et al., 2021; Nanda et al., 2023). E.2 Indirect Object Identification (IOI) Task Our IOI setup follows the standard indirect object identification paradigm for mechanistic interpretability (Elhage et al., 2021; Conmy et al., 2023). Each example is generated by combining: - a pair of names $(A,B)$ , e.g. (" Mary", " John"); - a natural-language template with placeholders [A], [B], and [S]. We instantiate templates such as: "Then, [B] and [A] went to the park. [S] gave a ball to" "When [B] and [A] got a snack at the cafe, [S] decided to give it to" "After the lunch, [B] and [A] went to the mall. [S] gave a gift to" by sampling a name pair and substituting $[A]$ and $[B]$ , then choosing the subject $[S]$ (either one of the pair). The correct continuation is the indirect object, i.e. the other member of the pair. For example, with $(A,B)=(\texttt{" John"},\texttt{" Mary"})$ and $S=B$ , one instance is: Then, Mary and John went to the park. Mary gave a ball to The correct continuation is " John", while " Mary" and any distractor names are treated as incorrect candidates. In the OLMo experiments, in order to further test the capability of our approach, we use a different set of IOI task with increased complexity and prompt length. Example templates include: "After several months without any contact due to conflicting schedules and unexpected personal obligations, [B] and [A] finally met again at the park, where they spent a long afternoon catching up on past events, sharing stories, and reflecting on how much had changed. As the day came to an end, [S] gave a ball to" "Although [B] and [A] had previously been involved in a long and emotionally charged argument that left several issues unresolved, they agreed to meet in order to clarify their misunderstandings. After a tense but honest conversation, [S] said to" E.3 Docstring Task We also test the OLMo models on a more complex Docstring task (Heimersheim and Janiak, 2023; Conmy et al., 2023), where the model needs to attend to a specific argument for a specified function in order to complete a Docstring. Similarly to the Greater Than task, each example contains a clean prompt, a corrupt prompt, and two disjoint sets of candidate continuations. A typical entry is: { "clean": "def model(self, results, old, option): """ stage agency security vision spot tone joy session river unit :param results: bone paper selection sky :param old: host action hell miss :param", "corrupt": "def model(self, command, output, state): """ stage agency security vision spot tone joy session river unit :param old: bone paper selection sky :param results: host action hell miss param", "answers": [" option"], "wrong_answers": [" results"," old"] } Appendix F Cross-Layer-Transcoder | Category Model Input dimension ( $d_{\text{in}}$ ) | Setting GPT-2 (HookedTransformer) 768 | | --- | --- | | Latent dimension ( $d_{\text{latent}}$ ) | 24 576 | | Expansion factor | 32 | | Context size | 64 | | Batch size (tokens) | 1 024 | | Precision | Mixed (FP32 / AMP) | | Device | CUDA | | Distributed training | DDP | | Optimizer | Adam | | Learning rate | $2× 10^{-4}$ | | Adam $\beta_{1}$ / $\beta_{2}$ | 0.9 / 0.999 | | Learning rate warm-up | Cosine (1 000 steps) | | Learning rate decay steps | 1 874 | | Final LR scale | 0.1 | | $L_{0}$ coefficient | 2 | | Optimal $L_{0}$ | 3 | | $L_{0}$ warm-up | Linear (18 749 steps) | | Dead feature penalty | $10^{-5}$ | | Dead feature window | 250 | Table 3: Training configuration for the GPT-2 cross-layer-transcoders. To implement a cross-layer transcoder, let $\mathbf{h}_{\ell}∈\mathbb{R}^{d_{\text{model}}}$ denote the input to the MLP at layer $\ell$ for a single token position. This representation is projected into a sparse feature space via an encoder, $$ \mathbf{z}_{\ell}=\mathrm{ReLU}\!\left(\mathbf{W}_{\mathrm{enc}}^{\ell}\mathbf{h}_{\ell}+\mathbf{b}_{\mathrm{enc}}^{\ell}\right)\in\mathbb{R}^{d_{\text{features}}}, \tag{8} $$ where $\mathbf{W}_{\mathrm{enc}}^{\ell}∈\mathbb{R}^{d_{\text{features}}× d_{\text{model}}}$ and $\mathbf{b}_{\mathrm{enc}}^{\ell}∈\mathbb{R}^{d_{\text{features}}}$ are layer-specific encoder parameters. The CLT reconstructs the MLP output at a target layer $\ell^{\prime}$ by linearly aggregating feature activations originating from all preceding layers, $$ \hat{\mathbf{m}}_{\ell^{\prime}}=\sum_{\ell\leq\ell^{\prime}}\mathbf{W}_{\mathrm{dec}}^{\ell\rightarrow\ell^{\prime}}\mathbf{z}_{\ell}+\mathbf{b}_{\mathrm{dec}}^{\ell^{\prime}}, \tag{9} $$ where $\mathbf{W}_{\mathrm{dec}}^{\ell→\ell^{\prime}}∈\mathbb{R}^{d_{\text{model}}× d_{\text{features}}}$ denotes the decoder mapping from layer $\ell$ to layer $\ell^{\prime}$ . The summation over layers reflects the fact that a given semantic feature may manifest in different representations across multiple MLP layers. For example, a feature that emerges in the MLP at layer $\ell$ may reappear, potentially in a transformed form, in the outputs of subsequent MLPs. Without accounting for these layer-dependent variations, such duplicated representations would lead to redundant nodes in the attribution graph. By allowing features to be represented differently across layers while being linked through a shared latent space, the cross-layer transcoder avoids this duplication and yields a more compact and interpretable attribution structure. For a detailed comparison between cross-layer transcoders and standard transcoders, we refer the reader to Lindsey et al. (2025a). Following the training procedure proposed by Anthropic (Ameisen et al., 2025), the final objective combines reconstruction accuracy with sparsity and dead-feature regularization: $$ \displaystyle\mathcal{L}= \displaystyle\underbrace{\sum_{\ell^{\prime}}\left\|\hat{\mathbf{m}}_{\ell^{\prime}}-\mathbf{m}_{\ell^{\prime}}\right\|_{2}^{2}}_{\text{MSE reconstruction}} \displaystyle+\lambda_{0}\underbrace{\sum_{\ell}\tanh\!\big(C\,(\mathbf{z}_{\ell}\odot\|\mathbf{W}_{\mathrm{dec}}^{\ell}\|)\big)}_{\text{$L_{0}$ sparsity}} \displaystyle+\lambda_{\mathrm{df}}\underbrace{\sum_{\ell}\mathrm{ReLU}\!\Big(\exp(\tau)-\mathbf{h}_{\ell}^{\mathrm{pre}}\Big)\|\mathbf{W}_{\mathrm{dec}}^{\ell}\|}_{\text{dead-feature penalty}}, \tag{10} $$ where $\mathbf{W}_{\mathrm{dec}}^{\ell}$ denotes the concatenated decoder weights associated with layer $\ell$ , $\mathbf{h}_{\ell}^{\mathrm{pre}}$ are the corresponding pre-activation values, $\tau$ is a threshold parameter, and $C$ is a scaling constant. The hyperparameters $\lambda_{0}$ and $\lambda_{\mathrm{df}}$ control the strength of the sparsity and dead-feature regularization terms. We initialize the weights with following circuits updates (Conerly et al., 2025). The encoder biais is initialize to have a fixed proportion of the features active at initialization. We provide in Figure 16 the training curves of the sparsity value, the sparsity coefficient, the explained variance, and the amount of dead features. We hope this can help the community in training their own cross-layer transcoders. <details> <summary>figures/l0_vs_steps.png Details</summary> ![c5b97222](/v1/image/c5b9722267c3c2fa3d0cc8a90dea27b1104eb0fe76103b62252c0a3185b69de6) ### Visual Description ## Line Chart: L0 over Training Steps ### Overview The image is a line chart displaying the relationship between "L0" (on a logarithmic scale) and "Training steps (M)" on a linear scale. The chart shows a decreasing trend of L0 as the number of training steps increases. ### Components/Axes * **Title:** L0 over Training Steps * **X-axis:** * Label: Training steps (M) * Scale: Linear * Markers: 0, 25, 50, 75, 100, 125, 150, 175, 200 * **Y-axis:** * Label: L0 * Scale: Logarithmic (base 10) * Markers: 101, 102, 103, 104 ### Detailed Analysis * **Data Series:** A single blue line represents the data. * **Trend:** The line shows a steep decrease initially, followed by a gradual decline as the number of training steps increases. * **Values:** * At 0 Training steps, L0 is approximately 8000 - 9000. * At 25 Training steps, L0 is approximately 90 - 100. * At 50 Training steps, L0 is approximately 70 - 80. * At 75 Training steps, L0 is approximately 30 - 40. * At 100 Training steps, L0 is approximately 20 - 30. * At 125 Training steps, L0 is approximately 15 - 20. * At 150 Training steps, L0 is approximately 10 - 15. * At 175 Training steps, L0 is approximately 7 - 10. * At 200 Training steps, L0 is approximately 5 - 7. ### Key Observations * The most significant decrease in L0 occurs within the first 25 training steps. * The rate of decrease slows down considerably after 50 training steps. * The y-axis is on a log scale, which means that equal distances represent multiplicative changes in L0. ### Interpretation The chart illustrates the learning process of a model, where L0 likely represents a loss function or error metric. The decreasing trend indicates that the model is learning and improving as it is trained over more steps. The initial rapid decrease suggests a quick initial learning phase, while the subsequent gradual decline indicates diminishing returns as the model approaches convergence. The logarithmic scale emphasizes the relative changes in L0, highlighting the significant reduction in error during the early stages of training. </details> (a) $L_{0}$ vs steps <details> <summary>figures/l0_coef_vs_steps.png Details</summary> ![5bb28569](/v1/image/5bb2856997658b761a8e6e6fa9982d928ed32060f7d528c73a0885d9b852989f) ### Visual Description ## Line Chart: L0 Coefficient over Training Steps ### Overview The image is a line chart showing the relationship between the L0 Coefficient and the number of Training Steps (M). The L0 Coefficient increases linearly with training steps until it plateaus at a value of 2.00. ### Components/Axes * **Title:** L0 Coefficient over Training Steps * **X-axis:** Training steps (M) * Scale: 0 to 200, with tick marks at intervals of 25 (0, 25, 50, 75, 100, 125, 150, 175, 200) * **Y-axis:** L0 Coefficient * Scale: 0.00 to 2.00, with tick marks at intervals of 0.25 (0.00, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00) * **Data Series:** A single blue line representing the L0 Coefficient. ### Detailed Analysis * **Data Series Trend:** The blue line starts at (0, 0.00). It increases linearly until approximately (150, 2.00). After this point, the line remains constant at 2.00 until (200, 2.00). * **Data Points:** * (0, 0.00) * (25, 0.33) approximately * (50, 0.66) approximately * (75, 1.00) * (100, 1.33) approximately * (125, 1.66) approximately * (150, 2.00) * (175, 2.00) * (200, 2.00) ### Key Observations * The L0 Coefficient increases linearly with the number of training steps up to 150. * After 150 training steps, the L0 Coefficient plateaus at a value of 2.00. ### Interpretation The chart suggests that the L0 Coefficient increases during the initial training phase, indicating a change in the model's parameters. The plateau after 150 training steps suggests that the model has converged or reached a point where further training does not significantly affect the L0 Coefficient. The L0 coefficient is a measure of sparsity, so this graph suggests that the model becomes more sparse as training progresses, until it reaches a maximum sparsity at 150 training steps. </details> (b) $L_{0}$ coefficient vs steps <details> <summary>figures/dead_features_vs_steps.png Details</summary> ![46bc7f16](/v1/image/46bc7f16b8c5984a22069a6ee797ecf6164c0b4c2ff04f83cb6d0d6c7158d4cd) ### Visual Description ## Line Chart: Dead Features over Training Steps ### Overview The image is a line chart that plots the number of "Dead Features" against "Training steps (M)". The chart illustrates how the number of dead features changes as the training progresses. The line starts near zero, increases rapidly, plateaus, and then increases again towards the end. ### Components/Axes * **Title:** Dead Features over Training Steps * **X-axis:** Training steps (M) * Scale: 0 to 200, with tick marks at intervals of 25 (0, 25, 50, 75, 100, 125, 150, 175, 200) * **Y-axis:** Dead Features * Scale: 0 to 3500, with tick marks at intervals of 500 (0, 500, 1000, 1500, 2000, 2500, 3000, 3500) * **Data Series:** One data series represented by a blue line. ### Detailed Analysis * **Blue Line (Dead Features):** * **Trend:** The line initially starts at approximately 0. It then increases rapidly between 0 and 75 training steps. The rate of increase slows down between 75 and 125 training steps, forming a plateau. After 150 training steps, the line begins to increase again, reaching approximately 3800 at 200 training steps. * **Data Points (Approximate):** * 0 Training Steps: ~0 Dead Features * 25 Training Steps: ~200 Dead Features * 50 Training Steps: ~1200 Dead Features * 75 Training Steps: ~2400 Dead Features * 100 Training Steps: ~2900 Dead Features * 125 Training Steps: ~3050 Dead Features * 150 Training Steps: ~3080 Dead Features * 175 Training Steps: ~3400 Dead Features * 200 Training Steps: ~3800 Dead Features ### Key Observations * The number of dead features increases significantly during the initial training phase. * The increase in dead features slows down and plateaus around 125 training steps. * The number of dead features increases again towards the end of the training process. ### Interpretation The chart suggests that as the model trains, an increasing number of features become "dead" or non-contributing. The initial rapid increase indicates a quick adaptation phase where many features are discarded. The plateau suggests a period of stabilization. The final increase could indicate overfitting or further refinement where some features become redundant. The overall trend highlights the dynamic nature of feature usage during the training process. </details> (c) Dead features vs steps <details> <summary>figures/explained_variance_vs_steps.png Details</summary> ![a4950665](/v1/image/a495066543572c5d866bab71e3ff9b8f26539473ec94ee8562812ea1c13277e3) ### Visual Description ## Chart: Explained Variance over Training Steps ### Overview The image is a line chart showing the explained variance of a model over training steps. The x-axis represents the training steps in millions (M), ranging from 0 to 200. The y-axis represents the explained variance, ranging from 0.5 to 1.0. The chart shows how the explained variance changes as the model is trained. ### Components/Axes * **Title:** Explained Variance over Training Steps * **X-axis:** Training steps (M) * Scale: 0 to 200, with markers at 0, 25, 50, 75, 100, 125, 150, 175, and 200. * **Y-axis:** Explained Variance * Scale: 0.5 to 1.0, with markers at 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0. * **Data Series:** A single blue line representing the explained variance. ### Detailed Analysis The blue line represents the explained variance over training steps. * **Initial Phase (0-50M):** The explained variance increases rapidly from approximately 0.5 to a peak of approximately 0.91. * **Peak (around 50M):** The explained variance reaches its maximum value of approximately 0.91. * **Decline Phase (50M-200M):** The explained variance gradually decreases from approximately 0.91 to approximately 0.78. Specific data points (approximate): * At 0M training steps, the explained variance is approximately 0.5. * At 25M training steps, the explained variance is approximately 0.85. * At 50M training steps, the explained variance is approximately 0.91. * At 100M training steps, the explained variance is approximately 0.87. * At 150M training steps, the explained variance is approximately 0.83. * At 200M training steps, the explained variance is approximately 0.78. ### Key Observations * The explained variance increases rapidly in the initial training phase. * The explained variance peaks around 50 million training steps. * After the peak, the explained variance gradually decreases as training continues. ### Interpretation The chart suggests that the model initially learns quickly, as indicated by the rapid increase in explained variance. However, after a certain point (around 50 million training steps), further training leads to a decrease in explained variance, which could indicate overfitting. The model may be starting to memorize the training data rather than generalizing to new data. This suggests that the optimal number of training steps for this model is around 50 million. Further training beyond this point may not be beneficial and could even be detrimental to the model's performance on unseen data. </details> (d) Explained variance vs steps Figure 16: Training dynamics of the cross-layer transcoder, showing sparsity, regularization strength, dead features, and reconstruction quality over training. Appendix G Attribution-Graph Following Ameisen et al. (2025), we define the attribution score between feature $n$ at layer $\ell$ and position $k$ , and feature $n^{\prime}$ at layer $\ell^{\prime}$ and position $k^{\prime}$ , as $$ a_{\ell,k,n}^{\ell^{\prime},k^{\prime},n^{\prime}}=\sum_{\ell\leq s\leq\ell^{\prime}}f_{k,n}^{\ell\rightarrow s}\;J_{s,k}^{\ell^{\prime},k^{\prime}}\;g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}, \tag{11} $$ where $f_{k,n}^{\ell→ s}$ denotes the decoder vector associated with feature $n$ projecting from layer $\ell$ to layer $s$ , $J_{s,k}^{\ell^{\prime},k^{\prime}}$ is the Jacobian mapping the MLP output at $(\ell,k)$ to the MLP input at $(\ell^{\prime},k^{\prime})$ , and $g_{k^{\prime},n^{\prime}}^{\ell^{\prime}}$ is the corresponding encoder feature at layer $\ell^{\prime}$ and position $k^{\prime}$ . The sum in this equation reflects the cross-layer mapping of the cross-layer transcoder. The Jacobian is computed during a modified forward pass in which all nonlinear operations, including normalization layers, attention mechanisms, and MLPs, are frozen using stop-gradient operations. The resulting attribution graph is pruned by retaining only those features that cumulatively explain $80\%$ of the contribution to the final logit, and only those edges that account for $95\%$ of the total edge-level effect. All attribution computations are performed using the circuit-tracer library (Hanna et al., 2025). For a complete description of the attribution graph computation and pruning, we refer the user to reading (Ameisen et al., 2025). For the visualization and the autointerp, we write our own pipeline. In Figure 17, we show a screenshot of the interface for the ’The opposite of ”large” is ”’ attribution graph. The features are colored with respect to their corresponding clusters. <details> <summary>x14.png Details</summary> ![ac4825d6](/v1/image/ac4825d678343b6ab2d70b00b224077233fe01f17ee8b61094566f68c1f375ab) ### Visual Description ## Dependency Diagram: Word Cluster Analysis ### Overview The image is a dependency diagram visualizing the relationships between words in a corpus, focusing on how specific word clusters relate to the word "small". The diagram shows connections between words at different levels (L0 to L11), with the thickness of the lines indicating the strength or frequency of the relationship. The diagram highlights the words "opposite", "large", "brackets", and "say" as clusters. ### Components/Axes * **Title:** Clusters: opposite - large - brackets - say small * **Target Word:** small 16.32% (located at the top-right) * **Vertical Axis:** L0 to L11, representing different levels or layers in the analysis. * **Horizontal Axis:** Represents the different words in the corpus. * **Nodes:** Represent individual words or clusters of words. * **Edges:** Represent the relationships between words, with line thickness indicating the strength of the relationship. * **Clusters:** * opposite (Green) * large (Orange) * brackets (Light Blue) * say small (Pink) * **Buttons:** Add Text, Save State, Load State (located at the top-left) ### Detailed Analysis or ### Content Details **Level L0 (Bottom):** * The following words are present at level L0, represented by grey squares: * "<softext>" * "The" * "opposite" (connected to a green circle above) * "of" * "r" * "large" (connected to an orange circle above) * "r" * "it" * "r" **Cluster Representation:** * **opposite (Green):** A green circle is present at level L1, connected to the word "opposite" at level L0. * **large (Orange):** An orange circle is present at level L1, connected to the word "large" at level L0. Several orange circles are present at levels L1, L5, and L7. * **brackets (Light Blue):** Multiple light blue circles are present at levels L1 to L10, connected to the word "small" at the top. * **say small (Pink):** The word "small" at the top is represented by a pink circle and has a value of 16.32%. **Connections to "small":** * The word "small" at the top (level L11) is connected to multiple light blue circles at levels L1 to L10. These light blue circles are associated with the "brackets" cluster. * The connections between the words at level L0 and the word "small" at level L11 are represented by thin grey lines. **Trend Verification:** * The "opposite" cluster has a single green node at L1. * The "large" cluster has orange nodes at L1, L5, and L7. * The "brackets" cluster has a series of light blue nodes from L1 to L10, increasing in density towards the top. * The "small" node at L11 is pink and has a value of 16.32%. ### Key Observations * The diagram visualizes the relationships between different words and the target word "small". * The "brackets" cluster has the most connections to the word "small". * The "opposite" and "large" clusters have fewer direct connections to the word "small". * The vertical position of the nodes seems to indicate the level of abstraction or dependency. ### Interpretation The dependency diagram illustrates how different word clusters relate to the word "small" within a given corpus. The strong connections between the "brackets" cluster and "small" suggest that words or phrases within the "brackets" cluster are frequently associated with "small". The diagram provides a visual representation of the semantic relationships between words, which can be useful for understanding the context in which "small" is used and the words that are most likely to co-occur with it. The percentage (16.32%) associated with "small" might represent the frequency or probability of "small" appearing in the corpus, or its relative importance within the network of relationships. </details> Figure 17: Circuit-tracing interface example for the ’The opposite of ”large” is ”’ with GPT2-sparse. Appendix H Graph: The opposite of ”large” is ” We obtain a replacement score of 0.82, with 459 features identified before pruning and 82 features remaining after pruning. The majority of features in the resulting attribution graph fall into four dominant clusters: - Opposition cluster: features associated with opposition and comparison, primarily localized at the token position corresponding to opposite. - Magnitude cluster: features related to notions of size (e.g., large, big, full, medium), predominantly located in the residual stream at the large token position. - Bracket cluster: features that activate on tokens enclosed in brackets. - Final-logit cluster: mainly the final logit itself and a couple of features that activate before the token ”small” or related terms. In the boxes below, we present the top activations of representative feature sets for each cluster. Feature 1117 ”Opposite” cluster in Washington has now adopted the wider measure of student debt outstanding. This new the situation in Syria, Iran and the wider region. ”The recharged by the wider dense forests of Sanjay Van and its overflow drained public, with interesting accounts of Oswald’s demeanor at this significant moment has a slightly wider range. Specifically, the Atom-powered NANO 56 becoming part of the wider Seven Years’ War in which Britain and France Feature 1337 ”Opposite” cluster opposite, piece of Mexico’s cultural identity. I made the hour opposite shows, or something bigger, “where there’s villains opposite sides of Mars in 2004 and used their instruments to discover geologic evidence opposite, but not anymore. Now everything he says to me is some kind opposite direction, and had little trouble finding space at the campsites. always seem to be just the opposite. show a growing trend to cast “no” votes , opposing how much salary and the occupation of the opposing forces was generally limited to mutual observation. work hand in hand for the purpose of opposing all movements of the thinking part the defense’s inability to stop opposing run games. The Bills have ing opposing quarterbacks. The Seahawks not only had depth, they were versatile. to win more hand battles particularly when the opposing tackle neutralizes his initial Feature 901 ”Large” cluster Let’s be honest: When someone advocates for large-scale Muslim robot provides a tragicomic reminder of why RWD needs to consider large as what kind of social safety nets should be in place to protect people from large advocates to limit the power of large, established corporations, analysts say. of large law firms is that they are so great that the only reason anyone that by scaling up tests, the method would be conducive for use on larger Feature 933 ”Large” cluster people healthy and anticipating health issues before they become a problem . Big Data is Big brown bucks with funny accents.” Judy flinched at BIG UP UBUNTU: Ubuntu releases are named after industry.¡—endoftext—¿ BIG LEAGUE: Barron’s Says The they need to submit their content in the same way . Big enough apps and offering alternatives routes . Big data and optical fiber Feature 1004 ”Large” cluster guide said was ? full of drinking saloons, dime museums, small would have 2 mana sources next turn (unless his hand was full of fast ’s house, it ’s full of adventure itself.? statement that all German Catholics had a right to full transparency” glimpsing a lobby full of construction debris. The front hallway was full of Jokubas had recently been reading a newspaper article which was full of Feature 412 “Brackets” cluster group answered either “very” or “somewhat” attached – except some work colleagues. Wilcox said she found it ‘highly unlikely very rare , “very likely ,” “high risk,” she says. ulent. Pentagon spokesman Peter Cook said the sample was “ on of PopMatters called the album “brilliant” and said arlene Lowe, described him as being “one of my biggest supporters”. Feature 518 “Brackets” cluster Kerry said Washington and Hanoi will “continue to have differences in opinions the United States will “take care of it.” He told reporters after the legislation would “provide new enforcement tools for protecting our citizens and will help Gary Ross, said in a statement that the Air Force is currently “short said, and Syrian President Bashar al-Assad would “have to go”. introduces politics into consumer policies,” said Palmor, adding that it would ”

Rendering Paper...