2412.11965

Model: gemini-2.0-flash

# Inferring Functionality of Attention Heads from their Parameters **Authors**: - Amit Elhelo Mor Geva (Blavatnik School of Computer Science, Tel Aviv University) Abstract Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific circuits or tasks. In this work, we seek a comprehensive mapping of the operations they implement in a model. We propose MAPS (Mapping Attention head ParameterS), an efficient framework that infers the functionality of attention heads from their parameters, without any model training or inference. We showcase the utility of MAPS for answering two types of questions: (a) given a predefined operation, mapping how strongly heads across the model implement it, and (b) given an attention head, inferring its salient functionality. Evaluating MAPS on 20 operations across 6 popular LLMs shows its estimations correlate with the head’s outputs during inference and are causally linked to the model’s predictions. Moreover, its mappings reveal attention heads of certain operations that were overlooked in previous studies, and valuable insights on function universality and architecture biases in LLMs. Next, we present an automatic pipeline and analysis that leverage MAPS to characterize the salient operations of a given head. Our pipeline produces plausible operation descriptions for most heads, as assessed by human judgment, while revealing diverse operations. We release our code and mappings at https://github.com/amitelhelo/MAPS. Inferring Functionality of Attention Heads from their Parameters Amit Elhelo Mor Geva Blavatnik School of Computer Science, Tel Aviv University {amitelhelw@mail,morgeva@tauex}.tau.ac.il 1 Introduction <details> <summary>x1.png Details</summary> ![0b0225cc](/v1/image/0b0225cc0ee63324945e31829a6297b447b688a2af5bc5b974b161340d9e88f0) ### Visual Description ## Diagram: Multi-Head Attention Layer Analysis ### Overview The image illustrates the analysis of a multi-head attention layer in a neural network. It shows how the layer projects parameters to a vocabulary and infers functionality by analyzing mappings between tokens. Two specific analyses are presented: evaluating the head's implementation of a predefined operation (country to capital) and inspecting the head's salient operations (name variations). ### Components/Axes * **Top:** * "Multi-head attention layer" * Diagram of the attention layer with blocks labeled "W1VO", "WnVO", "W1QK", "WnQK". * "Projecting parameters to the vocabulary |V|" * A grid representing the vocabulary projection, with dimensions |V| x |V|, and a cell labeled "M". * **Middle:** * "Inferring functionality by analyzing mappings between tokens" * **Bottom:** * **(A) Evaluating the head's implementation of a predefined operation** * Heatmap showing the relationship between countries (France, Germany, Egypt) and capitals (Cairo, Paris, Berlin). * "Country to capital 0.7" * **(B) Inspecting the head's salient operations** * Heatmap showing the relationship between names (Tomas, Donna) and name variations (tommi, Don, Tom). * "Name variations 0.9" ### Detailed Analysis **Multi-head attention layer:** * The diagram shows a multi-head attention layer. The layer contains multiple attention heads, each with its own set of weights (W1VO, WnVO, W1QK, WnQK). **Projecting parameters to the vocabulary:** * The parameters are projected to a vocabulary of size |V|. The projection results in a matrix M of size |V| x |V|. **Evaluating the head's implementation of a predefined operation (Country to capital):** * **Rows (Countries):** France, Germany, Egypt * **Columns (Capitals):** Cairo, Paris, Berlin * **Heatmap Data:** * France - Cairo: Low * France - Paris: Medium-High * France - Berlin: Low * Germany - Cairo: Low * Germany - Paris: Low * Germany - Berlin: Medium-High * Egypt - Cairo: Medium-High * Egypt - Paris: Low * Egypt - Berlin: Low * **Score:** 0.7 **Inspecting the head's salient operations (Name variations):** * **Rows (Names):** Tomas, Donna * **Columns (Name Variations):** tommi, Don, Tom * **Heatmap Data:** * Tomas - tommi: Medium-High * Tomas - Don: Low * Tomas - Tom: Low * Donna - tommi: Medium-High * Donna - Don: Low * Donna - Tom: Low * **Score:** 0.9 ### Key Observations * The "Country to capital" heatmap shows that the model correctly associates France with Paris, Germany with Berlin, and Egypt with Cairo, although the intensity varies. * The "Name variations" heatmap shows that the model associates both Tomas and Donna with the variation "tommi". ### Interpretation The diagram illustrates how multi-head attention layers can be analyzed to understand their functionality. By examining the mappings between tokens, we can infer what operations the attention heads are performing. The "Country to capital" example shows that the model has learned some basic geographical knowledge. The "Name variations" example shows that the model has learned to associate names with their variations. The scores (0.7 and 0.9) indicate the confidence or strength of these associations. The higher score for "Name variations" suggests that the model is more confident in its ability to identify name variations than in its knowledge of country-capital relationships. </details> Figure 1: Illustration of MAPS, a framework for inferring the functionality of attention heads in LLMs from their parameters. MAPS casts the head as a matrix $M$ which assigns a score for every pair of tokens in the model’s vocabulary. Then, it considers groups of token pairs (sub-matrices in $M$ ) to measure how strongly the head implements a given operation (A) and to inspect the head’s salient operations (B). Attention heads play a key role in modern large language models (LLMs) (Vaswani et al., 2017; Zhou et al., 2024; Olsson et al., 2022). Numerous studies (Zheng et al., 2024; Ferrando et al., 2024) have explored their functionality, typically by analyzing their attention patterns or outputs during inference for certain inputs or tasks. However, relying on the model’s behavior for certain inputs has drawbacks. First, this approach may overlook some of the functions implemented by the head, as heads can exhibit different behaviors for different inputs (Gould et al., 2024; Merullo et al., 2024a; Olsson et al., 2022; Kissane et al., 2024). Second, a comprehensive analysis of the head’s operation would require executing the model over numerous inputs, potentially the whole training corpus, which involves a high computational cost and could be impossible when the data is unavailable. Last, analyzing the examples that activate the head is often non-trivial and could be misleading (Bolukbasi et al., 2021; Gao et al., 2024; Kissane et al., 2024). In this work, we consider a different approach to this problem, where our goal is to infer the functionality of attention heads directly from their parameters and without executing the model. To this end, we leverage the approach of interpreting model parameters in the vocabulary space (Geva et al., 2021, 2022; Katz et al., 2024). Specifically, we build on the formulation by Elhage et al. (2021); Dar et al. (2023), who cast the attention head as a matrix $M$ , where each entry is a mapping score between two tokens. While this approach has been shown effective in identifying heads with certain operations, so far its usage has been limited to studying specific heads in detected circuits Wang et al. (2023); McDougall et al. (2024) or a single operation Gould et al. (2024). Here, we scale this interpretation approach into a general framework, called MAPS (Mapping Attention heads ParameterS), which enables answering two types of basic questions: (a) given a predefined operation, mapping how strongly different heads across the model implement it, and (b) given an attention head, inferring its prominent operations. This is done by considering patterns across groups of mappings in $M$ , as illustrated in Figure 1. Predefined relations signify groups of mappings expressing a certain relation (e.g. city of a country or pronoun resolving). Salient operations consist of subsets of mappings for which the head induces the most prominent effect. In addition, analyzing simple statistics of these mappings provides insights into how global or specific its operation is. We evaluate our framework on 6 popular LLMs and 20 predefined relations of 4 categories – knowledge, language, algorithmic, and translation. Experiments show that estimations by MAPS strongly correlate with the head outputs during inference. Moreover, causally removing all the heads implementing a certain operation substantially impairs the model’s ability to answer queries requiring this operation, compared to removing other heads. Analysis of the obtained mappings shows that, across all models, MAPS detects relation heads mostly in the middle and upper layers, while revealing universality patterns for several relations. Moreover, it demonstrates how the model’s architecture introduces biases in function encoding. Smaller models tend to encode higher numbers of relations on a single head, and in Llama-3.1 models, which use grouped-query attention, grouped attention heads often implement the same or similar relations. Notably, MAPS successfully detected previously identified heads of specific operations, while discovering additional heads of similar operations not reported before. Next, we demonstrate the utility of MAPS for inferring the prominent operations of a given head. We consider the head’s salient mappings in $M$ and use GPT-4o Hurst et al. (2024) to automatically describe the functionality they exhibit. Applying this procedure to GPT-2 xl and Pythia 6.9B, we map the prominent operations of 62% of their heads and 60%-96% of those in the middle and upper layers. Qualitative analysis shows semantic, linguistic, and algorithmic operations and reveals novel operations, such as the extension of time periods (day->month;month->year). A human study shows that our automated pipeline performs reasonably well, and GPT-4o reliably detects observable operations. To conclude, we introduce MAPS, an efficient framework for inferring attention heads’ functionality from their parameters. We showcase the utility of MAPS in systematically mapping a certain functionality across the model and automatically characterizing the salient operations of a given head. Estimations by MAPS correlate with the head’s outputs and are faithful to the model’s behavior, and provide valuable insights on architecture biases and universality of head operations in LLMs. 2 Preliminaries and Notation We assume a transformer-based LM with a hidden dimension $d$ , $L$ layers, $H$ attention heads per layer, a vocabulary $\mathcal{V}$ , an embedding matrix $E∈\mathbb{R}^{|\mathcal{V}|× d}$ , and an unembedding matrix $U∈\mathbb{R}^{d×|\mathcal{V}|}$ . Attention heads as interaction matrices We use the formulation by Elhage et al. (2021) and view an attention head as two “interaction” matrices $W_{QK},W_{VO}∈\mathbb{R}^{d× d}$ . Given a sequence of $n$ hidden states $X∈\mathbb{R}^{n× d}$ , the matrix $W_{QK}$ computes the query-key scores to produce an attention weights matrix $A∈\mathbb{R}^{n× n}$ : $$ A=\text{softmax}\Bigg{(}\frac{X(W_{QK})X^{T}}{\sqrt{d/H}}\Bigg{)} $$ The matrix $W_{VO}$ operates on the contextualized hidden states according to $A$ , namely $\tilde{X}=AX$ , and produces the head’s output $Y∈\mathbb{R}^{n× d}$ : $$ Y=\tilde{X}W_{VO} \tag{1} $$ The matrix $W_{QK}$ can be viewed as “reading” from the residual stream, and $W_{VO}$ can be viewed as the “writing” component. Notably, this formulation omits the bias terms of the head. Interpreting attention heads in embedding space Recent works have analyzed the operation of different components in transformers through projection to the model’s vocabulary space (nostalgebraist, 2020; Geva et al., 2021, 2022; Dar et al., 2023; Katz et al., 2024). Specifically, Elhage et al. (2021); Dar et al. (2023) interpret each of the attention head matrices – $W_{QK}$ and $W_{VO}$ – as a matrix that maps between pairs of tokens from the vocabulary. Considering $W_{VO}$ , it is interpreted via multiplication from both sides with the model’s embedding matrix: ${\tilde{M}=E(W_{VO})E^{T}∈\mathbb{R}^{|\mathcal{V}|×|\mathcal{V}|}}$ . Each entry in $\tilde{M}$ is viewed as a mapping score between source and target tokens ${s,t∈\mathcal{V}}$ based on $W_{VO}$ , which signifies how strongly the head promotes it in its outputs. Elhage et al. (2021) suggested that when the weights of $E$ and $U$ are not tied, a more faithful interpretation can be obtained by: $$ M=E(W_{VO})U $$ Other notable variations include applying the model’s first MLP layer to the embedding matrix $E$ (Gould et al., 2024) and the final layer norm on rows of $E(W_{VO})$ (Wang et al., 2023). 3 MAPS Based on the above view, we propose a general framework, called MAPS, for inferring the functionality of attention heads in LLMs directly from their parameters. We focus on analyzing the $W_{VO}$ component of the head, which produces the head’s output to the residual stream, and make the following observations. First, the $i$ -th row of $M$ provides the scores for mappings from the $i$ -th token to any token in $\mathcal{V}$ . Similarly, the $j$ -th column of $M$ provides scores for mappings from any token in $\mathcal{V}$ to the $j$ -th token. Therefore, considering the scores of certain submatrices of $M$ may reveal how the attention head operates on different sets of inputs. For example, analyzing the rows corresponding to tokens representing countries may reveal general knowledge-related operations implemented by the head, and attention heads that copy certain tokens should have diagonal-like submatrices in $M$ . An important question that arises is which parts of $M$ to consider in order to identify the head’s functionality. In principle, there are $2^{|\mathcal{V}|}$ different subsets of rows that can be considered, which would be infeasible to traverse with $|\mathcal{V}|=\mathcal{O}(10K)$ in typical LLMs. Here, we propose two complementary ways to approach this, described next. 3.1 Predefined Relations One intuitive approach is to define a set of possible operations that can be realized through pairs of tokens, and then measure the extent to which the head implements each operation. For example, the operation of mapping a country to its capital can be realized through a set of token pairs expressing that relation, e.g. (France, Paris) or (Egypt, Cairo). Similarly, mapping between synonyms can be realized via pairs such as (talk, speak) and (fast, quick). Such operations can be viewed as an implementation of relations between tokens. Let $R$ be a predefined relation and $\mathcal{D}_{R}$ a dataset of token pairs expressing $R$ . Also, denote by $\mathbf{m}_{i}∈\mathbb{R}^{|\mathcal{V}|}$ the $i$ -th row of $M$ (corresponding to the mapping scores of the $i$ -th token), and by $\texttt{topk}(\mathbf{m}_{i})$ the $k$ tokens with the highest scores in $\mathbf{m}_{i}$ . The extent to which an attention head, interpreted as the matrix $M$ , implements $R$ can be measured as the portion of pairs $(s,t)∈\mathcal{D}_{R}$ where $t$ is in the top-scoring tokens in $\mathbf{m}_{s}$ : $$ \phi_{R}(M):=\frac{1}{|\mathcal{D}_{R}|}\sum_{(s,t)\in\mathcal{D}_{R}}\mathds{% 1}[t\in\texttt{topk}(\mathbf{m}_{s})] \tag{2} $$ For instance, the score for $R=$ ‘‘country to capital’’ reflects how often the head promotes the capital city of a country in its output when operating on an input representation of that country. Notably, our formulation also supports suppression operations observed in previous work (Wang et al., 2023; Gould et al., 2024; McDougall et al., 2024), where certain attention heads suppress certain concepts or outputs during inference. Representing a suppressive relation is done by defining the pairs $(s,t)$ as before and considering the top-scoring tokens in $-\mathbf{m}_{s}$ instead of $\mathbf{m}_{s}$ . 3.2 Salient Operations The main limitation of the above approach is that it could miss certain relations that heads implement. A complementary approach would be to characterize the head’s functionality from prominent mappings appearing in $M$ . Dar et al. (2023) tackled this by considering the top-scoring mappings in $M$ . However, we recognize two drawbacks in this method: (a) the scores in $M$ are influenced by the token embedding norms, which could bias the top scores towards mappings of tokens with high embedding norms, and (b) the top entries in $M$ may cover mapping from a small number of tokens (e.g., from a single row), thus describing the head’s functionality for only a few tokens. Here, we propose a more holistic approach to identify salient mappings in $M$ , by first identifying the tokens on which the head’s operation is most prominent, and then considering the top-scoring mappings for these tokens. We measure how prominent the head’s operation on a token $t∈\mathcal{V}$ via the ratio of the token’s embedding norm after multiplying by $W_{VO}$ to the norm before this transformation: $$ \sigma_{t}(W_{VO}):=\frac{||\mathbf{e}_{t}W_{VO}||}{||\mathbf{e}_{t}||} \tag{3} $$ Comparing the sets of top versus salient mappings indeed shows substantial differences. The average Jaccard similarity of the sets obtained for heads in GPT-2 xl is 0.01. In the next sections, we experiment with both approaches, showing their effectiveness in inferring attention head functionality in multiple LLMs. 4 Mapping Predefined Relations In this section, we utilize MAPS to map how strongly attention heads implement various operations in multiple LLMs (§ 4.1). We assess the correctness and generalization of these estimations via correlative and causal experiments (§ 4.2, § 4.3) and analyze prominent trends (§ 4.4). 4.1 Experimental Setup Datasets We construct datasets for 20 relations of four categories: algorithmic (e.g., word to first letter), knowledge (e.g., country to capital), linguistic (e.g., adjective to comparative), and translation (English to French/Spanish), and 3 vocabularies of widely-used model families. For every relation, we collect pairs of strings expressing it. For instance, possible pairs for the relation word-to-compound are (hot, hotdog) and (wall, wallpaper). Data is obtained from previously published datasets and online sources and further augmented by querying ChatGPT to generate example pairs, which we (authors) manually validated. Then, we tokenize the pairs with each of the tokenizers of Llama-3.1 Dubey et al. (2024), Pythia Biderman et al. (2023) GPT Radford et al. (2019) and Phi-2 Javaheripi and Bubeck (2023), keeping only cases where the resulting mapping is between single tokens. Experimenting with different tokenizers is important as MAPS leverages the model’s vocabulary. Llama-3.1’s vocabulary has $\sim$ 130k tokens compared to $\sim$ 50k tokens for GPT-2, Phi-2, and Pythia. For more details on the collection, dataset statistics, and examples, see § A. Models We analyze models of various sizes from different families: Llama-3.1 8B and 70B Dubey et al. (2024), Pythia 6.9B and 12B Biderman et al. (2023), Phi-2 Javaheripi and Bubeck (2023), and GPT-2 xl Radford et al. (2019). These models have varying numbers of layers and attention heads, from 32 layers and 32 heads in Pythia 6.9B to 80 layers and 64 heads in Llama-3.1 70B. Additionally, Llama-3.1 uses grouped-query attention Ainslie et al. (2023), versus the other models which use multi-head attention Vaswani et al. (2017). Measuring predefined relations For every attention head and relation $R$ , we derive the matrix $M$ and calculate the relation score $\phi_{R}(M)$ (Eq. 2). We also compute the score for the suppressive variant $\bar{R}$ of every relation $R$ . For example, the suppressive variant of $R=\texttt{country to capital}$ corresponds to the operation of suppressing the capital of a given country. We follow previous works (Dar et al., 2023; Geva et al., 2021, 2022) and set low $k$ values to reflect strong prioritization of the target token in the head’s output. For Pythia, Phi-2 and GPT-2, we use $k=1$ for the copying and name-copying relations and $k=10$ for other relations. For the Llama-3.1 models, we set $k=3$ for copying and name-copying and $k=25$ for other relations. The bigger values for Llama-3.1 are due to their large vocabulary, which allows expressing a concept with more tokens. The smaller values for the copying relations are for measuring them more strictly. For further discussion on this selection, see § A. To classify whether a head “implements” a relation $R$ , we apply a threshold $\tau$ to $\phi_{R}(M)$ . Namely, if $t$ appears in the top- $k$ mappings of $s$ for $\tau$ percent of the pairs $(s,t)∈\mathcal{D}_{R}$ , then we consider the head as implementing $R$ . We choose a threshold of $\tau=15\%$ after experimenting with different thresholds and comparing against randomly initialized heads (see § A for details). 4.2 Evaluation of Functionality Estimation We evaluate whether the functionality estimations by MAPS faithfully describe the operations of the heads during inference. Our experiments show that the estimated operation of a head strongly correlates with its outputs and demonstrates the expected causal effect on the model’s generation. Experiment 1: Correlation with head outputs For every relation $R$ and source-target pair $(s,t)∈\mathcal{D}_{R}$ , we evaluate the model using four prompt templates (provided in § B.1). One representative template is: We do not simply feed in $s$ as input to avoid potential biases from the attention sink phenomenon Xiao et al. (2024). $$ \mathcal{P}_{s}:=\texttt{``This is a document about $\langle$s$\rangle$''} $$ Where $\langle\texttt{s}\rangle$ is the string of the source token $s$ . For example, for the pair (England, London), we will have ‘‘This is a document about England’’. Next, we obtain the output $\mathbf{y}_{s}∈\mathbb{R}^{d}$ of every attention head at the last position (corresponding to $s$ ), Here the head outputs include the bias term of $W_{V}$ , see § B.1. and project it to the model’s vocabulary space, i.e. $\mathbf{y}_{s}U∈\mathbb{R}^{|\mathcal{V}|}$ . The top-scoring tokens in the resulting vector are those promoted by the head given the prompt $\mathcal{P}_{s}$ Geva et al. (2022). To check whether the head implements the relation $R$ , namely promote $t$ when given $s$ in the input, we test for every pair $(s,t)$ whether $t$ appears in the top $k$ tokens in $\mathbf{y}_{s}U$ . We use the same $k$ values specified in § 4.1. Concretely, for every head $h$ we compute the following score, which represents how strongly the head implements $R$ during inference: $$ \phi^{*}_{R}(h):=\frac{1}{|\mathcal{D}_{R}|}\sum_{(s,t)\in\mathcal{D}_{R}}% \mathds{1}[t\in\texttt{topk}(\mathbf{y}_{s}U)] \tag{4} $$ We check the correlation between the static score $\phi_{R}(h)$ inferred by our method and the dynamic score $\phi^{*}_{R}(h)$ computed separately for each of the four templates. As a baseline, we compute $\phi^{*}_{R}(h)$ while restricting the attention in $h$ from $s$ to be only to itself. This emulates an operation of the head as if it fully attends to the representation of $s$ . Results Table 1 shows the results for Llama-3.1 8B. For the vast majority of relations, we observe a strong to very strong correlation of 0.71-0.95 Schober et al. (2018) when the query’s subject is not contextualized. This high correlation often remains or even increases when considering the head’s outputs for contextualized inputs. This shows that MAPS well-estimates the head’s behavior for task-related inputs. Still, for some relations (e.g. word to compound and word to last letter) correlation is lower for contextualized inputs, demonstrating that in some cases, the head may switch its operation depending on the context. This agrees with the observation that heads often implement multiple operations (§ 4.4). Results for other models are in § B.1, generally exhibiting similar trends, though with occasional larger drops in the contextualized setting for Pythia and GPT-2 xl. | Category | Relation | Correlation w/o context. | Correlation w/ context. | | --- | --- | --- | --- | | Algorithmic | Copying | 0.76 | 0.73 | | Name copying | 0.95 | 0.95 | | | Word to first letter | 0.90 | 0.78 | | | Word to last letter | 0.67 | 0.36 | | | Knowledge | Country to capital | 0.85 | 0.85 | | Country to language | 0.76 | 0.62 | | | Object to superclass | 0.74 | 0.73 | | | Product by company | 0.46 | 0.49 | | | Work to location | 0.44 | 0.45 | | | Linguistic | Word to antonym | 0.90 | 0.86 | | Adj to comparative | 0.85 | 0.86 | | | Adj to superlative | 0.87 | 0.89 | | | Noun to pronoun | 0.89 | 0.79 | | | Verb to past tense | 0.91 | 0.86 | | | Word to compound | 0.78 | 0.62 | | | Word to homophone | 0.85 | 0.75 | | | Word to synonym | 0.79 | 0.69 | | | Translation | English to French | 0.71 | 0.68 | | English to Spanish | 0.82 | 0.81 | | Table 1: Correlation between the relation score of a head and the head’s outputs in Llama-3.1 8B, with and without head contextualization. Results are statistically significant with p-values $≤$ 3.9e-128 (see § B.1). | Relation | TR Tasks | CTR Tasks | | | | | --- | --- | --- | --- | --- | --- | | Base | - TR | - RND | Base | - TR | | | Adj to comparative | 0.91 | 0.20 | 0.82 | 0.92 | 0.63 | | Copying | 1.00 | 0.68 | 1.00 | 0.95 | 0.88 | | Country to capital | 0.97 | 0.00 | 0.95 | 0.89 | 0.90 | | Country to language | 1.00 | 0.08 | 0.96 | 0.89 | 0.89 | | Name copying | 1.00 | 0.24 | 1.00 | 0.90 | 0.92 | | Noun to pronoun | 0.88 | 0.46 | 0.86 | 0.90 | 0.88 | | Object to superclass | 0.78 | 0.39 | 0.68 | 0.90 | 0.87 | | Verb to past tense | 0.22 | 0.04 | 0.26 | 0.03 | 0.02 | | Word to first letter | 0.91 | 0.34 | 0.87 | 0.91 | 0.74 | | Year to following | 0.92 | 0.00 | 0.87 | 0.83 | 0.79 | Table 2: Accuracy of Pythia 12B on tasks for a target relation (TR) versus on control (CTR) tasks, when removing heads implementing the relation compared to when removing random heads (RND). Results for RND heads are averaged over 5 experiments. We omit standard deviation for brevity and report it in § B.2. Experiment 2: Causal effect on model outputs For a given relation $R$ , we evaluate the model’s performance on queries that require applying $R$ , when removing the heads classified by MAPS as implementing $R$ versus when removing random heads from the model. We choose a diverse set of 13 relations and construct a test set $\tilde{\mathcal{D}}_{R}$ for every relation $R$ as follows. First, we craft a task prompt that requires the model to apply $R$ . For example, a prompt for the country to capital relation could be ‘‘The capital of $\langle s\rangle$ is’’, with $\langle s\rangle$ being a placeholder for a country. Then, for each pair $(s,t)∈\mathcal{D}_{R}$ we instantiate the prompt with $s$ to create an input $\tilde{\mathcal{P}}_{s}$ and a test example $(\tilde{\mathcal{P}}_{s},t)∈\tilde{\mathcal{D}}_{R}$ . Let $\mathcal{H}_{R}^{i}$ be the subset of $i$ attention heads with the highest scores for $\phi_{R}(M)$ . We evaluate the models on $\tilde{\mathcal{D}}_{R}$ while running each input $n$ times, each time canceling (by setting to zero) the outputs of the attention heads $\mathcal{H}_{R}^{i}$ and obtaining the model’s prediction with greedy decoding. We set $n$ as the minimum between the number of heads in the model with $\phi_{R}(M)>0$ and a fixed boundary: 150 for GPT-2 xl, Pythia 6.9B, Pythia 12B, and Llama-3.1 8B and 250 for Llama-3.1 70B. In cases when the accuracy drops to 0 after ablating $i<n$ heads, we report results obtained up to $i$ . We compare the above intervention against a baseline where $i$ randomly sampled heads that are not in $\mathcal{H}_{R}^{i}$ are ablated, repeating this experiment 5 times and reporting the average accuracy. Additionally, to establish that relation heads are important specifically for tasks involving $R$ , we remove the relation heads as above and measure the model’s performance on up to five control tasks for other relations. We choose the relations such that $<$ 15% of the target relation heads are also control relation heads, and the absolute difference between the baseline accuracy on the control task and the target task is $≤$ 20%. Results Results for Pythia 12B are presented in Table 2, excluding relations where the base accuracy was $<$ 0.1. For all relations, removing the relation heads identified by MAPS causes a major accuracy drop of $≥$ 32% compared to $≤$ 13% when removing random heads. Moreover, while the accuracy drop for the control tasks is considerable in some cases (at most 33%), it is significantly smaller than the relative drop on the target relation task. Results for the other models are generally similar (see § B.2). Notable differences are that the accuracy drops in Llama-3.1 are often smaller, but in 9 out of 11 relations they are larger than those obtained for the random and control baselines. 4.3 Generalization to Multi-Token Entities A natural question that arises is how well the estimations by MAPS generalize to contextualized inputs representing multiple tokens. Namely, if we infer the head’s ability to perform country-to-capital mappings from country names tokenized as a single token, will we observe the same behavior for countries tokenized as multiple tokens? To test this, we apply the data collection process from § 4.1 to create new datasets for 11 relations of source-target pairs $(s,t)$ where $s$ has multiple tokens. Then, we repeat the correlative experiment in § 4.2 for GPT-2 xl, Pythia 6.9B and Pythia 12B using this data and the prompt template ‘‘This is a document about $\langle$ s $\rangle$ ’’. We observe that the estimated operations generalize to multi-token representations. For 53 out of the 64 model-relation combinations (with and without contextualization), the correlation between the relation score and the head’s output in the multi-token setting is similar ( $≤$ 0.05 difference) or higher than the single-token setting. In the remaining cases, there is a slightly bigger drop ( $≤$ 0.13), but the correlations remain $≥$ 0.63. The full results are provided in § C. 4.4 Analysis Function distribution Figure 2 shows category-level classification results of all heads in GPT-2 xl, Phi-2, Pythia 12B, and Llama-3.1 70B. A head is assigned to a certain category if it implements at least one relation from it or its suppressive variant. Considering prominent trends across all models, we first observe that MAPS identified relations from all categories, with classified heads mostly being located in the middle and upper layers. This may suggest that early layers perform operations that cannot be represented in the model’s output vocabulary space. Interestingly, we observe a “side effect” of the grouped attention structure in Llama-3.1 models, where grouped heads often implement the same relations or their suppressive variants. In addition, heads often implement multiple relations from the same or different categories. The portion of multi-category heads (out of all classified heads) generally decreases in model size: 38% in GPT-2 xl, 29% in Phi-2, 20% in Pythia 6.9B, Pythia 12B and 11% in Llama-3.1 70B. An exception to this trend is Llama-3.1 8B with 11% of multi-category heads, which may be caused by its grouped query attention structure. Also, 20%-36% of the classified heads implement at least one suppression relation. <details> <summary>x2.png Details</summary> ![dc4166b4](/v1/image/dc4166b4b45bc26d510abe3deea8aa5bf6fb88875fe28e4e91e45a501b02530c) ### Visual Description ## Heatmap: Category Distribution Across Model Layers and Heads ### Overview The image presents a series of heatmaps visualizing the distribution of different categories across the layers and heads of four different language models: GPT-2 xl, Phi-2, Pythia 12B, and Llama-3.1 70B. Each heatmap represents a model, with the y-axis indicating the "Head" and the x-axis indicating the "Layer". The color of each cell represents a category, as defined in the legend. ### Components/Axes * **Title:** Category Distribution Across Model Layers and Heads * **Heatmaps:** Four heatmaps, one for each model: GPT-2 xl, Phi-2, Pythia 12B, and Llama-3.1 70B. * **Y-axis (Head):** Represents the attention heads within each layer. * GPT-2 xl: Ranges from 0 to 24, incrementing by 5. * Phi-2: Ranges from 0 to 30, incrementing by 6. * Pythia 12B: Ranges from 0 to 32, incrementing by 8. * Llama-3.1 70B: Ranges from 0 to 60, incrementing by 12. * **X-axis (Layer):** Represents the layers of the model. * GPT-2 xl: Ranges from 0 to 45, incrementing by 9. * Phi-2: Ranges from 0 to 30, incrementing by 6. * Pythia 12B: Ranges from 0 to 35, incrementing by 7. * Llama-3.1 70B: Ranges from 0 to 64, incrementing by 16. * **Legend (Top-Left):** * Gray: Unclassified * Blue: Algorithmic * Orange: Knowledge * Green: Linguistic * Red: Translation * Purple: 2 categories * Brown: 3 categories * Pink: 4 categories ### Detailed Analysis **1. GPT-2 xl (Top-Left)** * X-axis (Layer): 0 to 45, incrementing by 9. * Y-axis (Head): 0 to 20, incrementing by 5. * Trend: The categories are distributed somewhat evenly across the layers, with a higher concentration of "Knowledge" (orange) and "Linguistic" (green) categories in the middle layers (around layer 18-36). "Algorithmic" (blue) is scattered throughout. * Data Points: * Head 0, Layer 0: Unclassified (gray) * Head 0, Layer 9: Algorithmic (blue) * Head 5, Layer 18: Knowledge (orange) * Head 10, Layer 27: Linguistic (green) * Head 15, Layer 36: 2 categories (purple) * Head 20, Layer 45: Unclassified (gray) **2. Phi-2 (Top-Right)** * X-axis (Layer): 0 to 30, incrementing by 6. * Y-axis (Head): 0 to 30, incrementing by 6. * Trend: The categories are more concentrated in the later layers (around layer 18-30). "Linguistic" (green) is the dominant category. * Data Points: * Head 0, Layer 0: Unclassified (gray) * Head 0, Layer 6: Knowledge (orange) * Head 6, Layer 12: Unclassified (gray) * Head 12, Layer 18: Linguistic (green) * Head 18, Layer 24: Linguistic (green) * Head 24, Layer 30: Linguistic (green) * Head 30, Layer 30: Unclassified (gray) **3. Pythia 12B (Bottom-Left)** * X-axis (Layer): 0 to 35, incrementing by 7. * Y-axis (Head): 0 to 32, incrementing by 8. * Trend: The categories are distributed relatively evenly across the layers and heads. "Knowledge" (orange) and "Linguistic" (green) are present throughout. * Data Points: * Head 0, Layer 0: Unclassified (gray) * Head 0, Layer 7: 2 categories (purple) * Head 8, Layer 14: Knowledge (orange) * Head 16, Layer 21: Knowledge (orange) * Head 24, Layer 28: 3 categories (brown) * Head 32, Layer 35: Linguistic (green) **4. Llama-3.1 70B (Bottom-Right)** * X-axis (Layer): 0 to 64, incrementing by 16. * Y-axis (Head): 0 to 60, incrementing by 12. * Trend: The categories are sparsely distributed. "Knowledge" (orange) and "Linguistic" (green) are present, but less concentrated than in other models. * Data Points: * Head 0, Layer 0: Unclassified (gray) * Head 0, Layer 16: Unclassified (gray) * Head 12, Layer 32: Algorithmic (blue) * Head 24, Layer 48: Knowledge (orange) * Head 36, Layer 64: Linguistic (green) * Head 48, Layer 64: Unclassified (gray) * Head 60, Layer 64: Unclassified (gray) ### Key Observations * **Category Distribution:** The distribution of categories varies significantly across the four models. * **Dominant Categories:** "Knowledge" (orange) and "Linguistic" (green) appear to be important categories for all models, but their concentration varies. * **Sparse Distribution:** Llama-3.1 70B shows a sparser distribution of categories compared to the other models. * **Layer Concentration:** Phi-2 shows a concentration of categories in the later layers. ### Interpretation The heatmaps provide insights into how different language models distribute various categories of information across their layers and heads. The varying distributions suggest that each model learns and processes information differently. The concentration of "Knowledge" and "Linguistic" categories highlights their importance in language modeling. The sparse distribution in Llama-3.1 70B could indicate a more distributed representation of information or a different learning strategy. The concentration of categories in the later layers of Phi-2 might suggest that these layers are responsible for higher-level reasoning or language understanding. Further analysis would be needed to understand the specific roles of each category and how they contribute to the overall performance of the models. </details> Figure 2: Functionality mapping by MAPS for 20 relations of 4 categories — algorithmic, knowledge, linguistic, translation — across all attention heads in GPT-2 xl, Phi-2, Pythia 12B, Llama-3.1 70B. A head is marked as a specific category if it implements at least one relation from this category. Function universality Figure 3 presents the distributions of relation scores for several representative relations in multiple models showing two interesting trends. First, despite architecture and training data differences, models encode relations in their heads to similar degrees, as observed by the similar highest scores per relation. This observation supports the “universality hypothesis” Li et al. (2015) that different networks learn similar features and circuits and expands recent similar findings about universality in LLMs Gould et al. (2024); Arditi et al. (2024); Tigges et al. (2024). Second, the scores for a given relation are diverse, with different heads implementing the relation at varying degrees, as opposed to having a small set of heads with high relation scores. This has implications for research concerning localization and editing; certain concepts or associations are encoded in a large number of model components at varying degrees. Comparison with known head functionalities Wang et al. (2023) identified “Name Mover” and “Anti Name Mover” heads in a circuit for indirect object identification in GPT-2 small, which copy or suppress copying specific names in the context, and Merullo et al. (2024a) identified “Mover” and “Capital” heads in GPT-2 medium. MAPS successfully identified all these heads as name copiers or country-to-capital mappers (which agrees with a similar analysis conducted by Wang et al., 2023). In addition, it discovered 25 heads in GPT-2 small and 46 in GPT-2 medium that implement similar operations but were not recognized in prior analyses. While the additional heads may not participate in the specific circuits discovered, they may be triggered for circuits of similar or related tasks that were overlooked in previous analyses. Notably, for all the heads identified in previous works, MAPS reveals various additional functionalities. These observations extend the findings by Merullo et al. (2024a) of heads that implement multiple functionalities. Taken together, these results demonstrate the effectiveness of MAPS in comprehensively mapping the implementation of a certain operation by attention heads across the model. A more detailed comparison is in § D. <details> <summary>x3.png Details</summary> ![2583de4d](/v1/image/2583de4df948a0394feb6488c5d3dda2779be7cd60111d6d30ec967235b1361b) ### Visual Description ## Scatter Plot: Relation Scores of Different Language Models ### Overview The image is a scatter plot comparing the relation scores of four different language models (GPT-2 xl, Pythia 6.9B, Phi-2, and Llama-3.1 70B) across six different relation types: "Adj to antonym", "Word to homophone", "Word to synonym", "Work to location", and "Country to capital". The x-axis represents the "Relation score", ranging from 0.0 to 1.0. Each dot represents a data point for a specific model and relation type. ### Components/Axes * **X-axis:** "Relation score", ranging from 0.0 to 1.0 with increments of 0.5. * **Y-axis:** Categorical axis representing the relation types: * Adj to antonym * Word to homophone * Word to synonym * Work to location * Country to capital * **Legend (Top-Right):** * Blue: GPT-2 xl * Orange: Pythia 6.9B * Green: Phi-2 * Red: Llama-3.1 70B ### Detailed Analysis **1. Adj to antonym:** * GPT-2 xl (Blue): Scores are clustered around 0.1 to 0.3. * Pythia 6.9B (Orange): Scores are clustered around 0.1 to 0.3. * Phi-2 (Green): Scores are clustered around 0.2 to 0.4. * Llama-3.1 70B (Red): Scores are more spread out, ranging from 0.1 to 0.6, with a few outliers near 0.6. **2. Word to homophone:** * GPT-2 xl (Blue): Scores are clustered around 0.0 to 0.1. * Pythia 6.9B (Orange): Scores are clustered around 0.0 to 0.1. * Phi-2 (Green): Scores are clustered around 0.0 to 0.2. * Llama-3.1 70B (Red): Scores are clustered around 0.0 to 0.2. **3. Word to synonym:** * GPT-2 xl (Blue): Scores are clustered around 0.0 to 0.2. * Pythia 6.9B (Orange): Scores are clustered around 0.1 to 0.3. * Phi-2 (Green): Scores are clustered around 0.1 to 0.3. * Llama-3.1 70B (Red): Scores are clustered around 0.1 to 0.4. **4. Work to location:** * GPT-2 xl (Blue): Scores are clustered around 0.0 to 0.2. * Pythia 6.9B (Orange): Scores are clustered around 0.0 to 0.3. * Phi-2 (Green): Scores are clustered around 0.1 to 0.4. * Llama-3.1 70B (Red): Scores are clustered around 0.1 to 0.4. **5. Country to capital:** * GPT-2 xl (Blue): Scores are mostly clustered between 0.0 and 1.0, with a higher density between 0.0 and 0.2, and 0.8 and 1.0. * Pythia 6.9B (Orange): Scores are mostly clustered between 0.0 and 1.0, with a higher density between 0.0 and 0.2, and 0.8 and 1.0. * Phi-2 (Green): Scores are mostly clustered between 0.0 and 1.0, with a higher density between 0.0 and 0.2, and 0.8 and 1.0. * Llama-3.1 70B (Red): Scores are mostly clustered between 0.0 and 1.0, with a higher density between 0.0 and 0.2, and 0.8 and 1.0. ### Key Observations * For "Adj to antonym", Llama-3.1 70B shows a wider range of scores compared to other models. * For "Word to homophone", all models have relatively low relation scores. * For "Country to capital", all models show a bimodal distribution, with clusters near 0.0 and 1.0. * Llama-3.1 70B generally has higher relation scores compared to other models across most relation types. ### Interpretation The scatter plot visualizes the performance of different language models on various relational tasks. The "Country to capital" task seems to be the easiest for all models, as indicated by the high density of scores near 1.0. The "Word to homophone" task appears to be the most challenging. Llama-3.1 70B generally outperforms the other models, suggesting it has a better understanding of the relationships tested. The bimodal distribution for "Country to capital" might indicate that some country-capital pairs are easily recognized, while others are more difficult. The spread of data points suggests variability in the models' performance across different instances of each relation type. </details> Figure 3: Relation scores for all heads of Llama-3.1 70B, Pythia 6.9B, Phi-2, GPT-2 xl for several relations. We observe that heads from all models implement these relations to similar degrees. 5 Inspecting Salient Operations We saw that given an operation realized as a relation between pairs of tokens, we can map how strongly it is implemented by attention heads across the model. Here, we use MAPS to tackle a complementary problem of inferring the prominent operations of a given attention head. We introduce an automatic pipeline for interpreting salient mappings in attention heads (§ 5.1) and use it to broadly infer the functionalities in Pythia 6.9B and GPT-2 xl (§ 5.2). In § F, we extend our analysis to show that the skewness of saliency scores can indicate how global or specific the head’s functionality is. 5.1 Automatic Functionality Inference We propose the following steps for inferring the functionality of an attention head: 1. Using the saliency score (Eq. 3) to identify the top $k$ tokens for which the head’s transformation is most prominent. 1. For each salient token $s$ , collecting the top $n$ tokens it is mapped to according to $M$ , namely, the tokens corresponding to the top entries in $\mathbf{m}_{s}$ . This could be extended to suppression for better coverage. 1. Inferring the head’s salient operations by querying an LLM about prominent patterns in the list of salient tokens and their top mappings. Notably, we ask the model to indicate there is no pattern when no clear pattern is observed across the mappings. For the exact prompt used, see § E. We run this pipeline on a total of 2,224 attention heads in GPT-2 xl and Pythia 6.9B, while setting $k=30$ (step 1) and $n=5$ (step 2) and using GPT-4o (Hurst et al., 2024) (step 3). We analyze how often GPT-4o was able to recognize a prominent functionality and measure the quality of its descriptions compared to human judgment. 5.2 Results <details> <summary>x4.png Details</summary> ![326a9bc1](/v1/image/326a9bc1061f03c11cf0c1a6f381deed99d0e090ca5f5bfdca9e9de0d930b1ea) ### Visual Description ## Line Graphs: Pattern Detection in Neural Network Heads ### Overview The image presents two line graphs comparing the percentage of heads where a pattern was detected across different layers of two neural network models: Pythia 6.9b and GPT2 xl. The x-axis represents the layer number, and the y-axis represents the percentage of heads where a pattern was detected. ### Components/Axes **Left Graph (Pythia 6.9b):** * **Title:** Pythia 6.9b * **X-axis:** Layer, ranging from 0 to 30 in increments of 10. * **Y-axis:** % of heads where a pattern was detected, ranging from 0 to 100 in increments of 20. **Right Graph (GPT2 xl):** * **Title:** GPT2 xl * **X-axis:** Layer, ranging from 0 to 40 in increments of 10. * **Y-axis:** % of heads where a pattern was detected, ranging from 0 to 100 in increments of 20. **Shared Elements:** * Both graphs have a grid for easier value estimation. * Both graphs use a blue line to represent the data. ### Detailed Analysis **Pythia 6.9b:** * **Trend:** The line starts at approximately 20% at layer 0, increases to approximately 60% by layer 5, fluctuates between 70% and 95% from layer 10 to layer 25, and then decreases to approximately 35% by layer 30. * **Data Points (Approximate):** * Layer 0: 20% * Layer 5: 60% * Layer 15: 80% * Layer 20: 90% * Layer 25: 80% * Layer 30: 35% **GPT2 xl:** * **Trend:** The line starts at approximately 25% at layer 0, increases to approximately 55% by layer 5, fluctuates between 40% and 60% from layer 5 to layer 15, then increases to fluctuate between 70% and 90% from layer 25 to layer 45. * **Data Points (Approximate):** * Layer 0: 25% * Layer 5: 55% * Layer 10: 45% * Layer 20: 50% * Layer 30: 75% * Layer 40: 85% ### Key Observations * Both models show an initial increase in pattern detection as the layer number increases. * Pythia 6.9b exhibits a peak in pattern detection around layers 10-25, followed by a decrease. * GPT2 xl shows a more gradual increase in pattern detection, with higher percentages in later layers compared to Pythia 6.9b. * The range of layers differs between the two models (0-30 for Pythia 6.9b and 0-40 for GPT2 xl). ### Interpretation The graphs illustrate how pattern detection varies across different layers in the Pythia 6.9b and GPT2 xl neural network models. The data suggests that different models may exhibit different patterns of feature extraction and representation across their layers. Pythia 6.9b seems to concentrate pattern detection in the middle layers, while GPT2 xl shows a more sustained increase in pattern detection towards the later layers. This could indicate differences in the models' architectures, training data, or learning strategies. The initial increase in both models likely reflects the initial layers learning basic features, while the later fluctuations may represent the extraction of more complex patterns. The decrease in pattern detection in the final layers of Pythia 6.9b could indicate a specialization or refinement of features in those layers. </details> Figure 4: Portion of heads where GPT-4o identified a prominent pattern across the head’s salient mappings. Figure 4 shows the percentage of heads per layer in GPT-2 xl and Pythia 6.9B where GPT-4o described a pattern. In both models, we observe a high rate of 60%-96% interpretable heads in the middle and upper layers, compared to a lower rate of 20%-60% in the early and last layers. These trends are consistent with those observed for predefined relations (§ 4), suggesting that early-layer heads are less interpretable in the vocabulary space. Qualitative analysis of 107 heads with identified patterns shows diverse operations: 38% semantic (e.g., extension of time-periods, day->month; month->year; year->decade), 36% algorithmic (e.g., capitalization, water->Water), and 26% linguistic (e.g., completion of sub-words (inhib->inhibition; resil->resilience). Examples of salient mappings and their interpretations are provided in § E. Interpretation quality We conduct a human study to assess the plausibility of the generated descriptions, finding that GPT-4o correctly identifies the presence or absence of a pattern in 80% of the cases and reliably detects observable patterns. This shows that our automatic pipeline is reasonable and demonstrates promising trends in automatically interpreting attention heads with MAPS. For more details on this study and its results, see § E. 6 Related Work Prior studies of attention heads in LLMs mostly focused on analyzing their attention patterns Voita et al. (2019); Clark et al. (2019); Vig and Belinkov (2019), training probes and sparse auto-encoders Kissane et al. (2024), studying head outputs, and performing causal interventions (see survey by Zheng et al., 2024). Unlike these methods, MAPS infers the functionality of attention heads from their parameters, without any training or inference. Vocabulary projections of attention head parameters have been used for analyzing certain attention head operations in LLMs Wang et al. (2023); McDougall et al. (2024); Kim et al. (2024); García-Carrasco et al. (2024); Elhage et al. (2021). However, they have been used mostly as a validation tool for operations inferred by other methods and were applied to specific relations and heads, typically in the scope of specific circuits. Gould et al. (2024) studied a single relation across all heads of multiple LLMs. Our work proposes a general framework that uses vocabulary projections as its primary tool for inferring attention head functionality. Millidge and Black (2022) utilized an LLM to interpret the vocabulary projections of singular vectors of attention heads and MLP matrices, but their approach does not consider input-output mappings which are essential for estimating head functionality. More recently, Merullo et al. (2024b) used parameter similarities of heads at different layers to study their “communication channels”. Lastly, Hernandez et al. (2024) showed that relation operations of attention heads can be well-approximated by linear functions. Our work further shows that some of these relations are implemented by mappings encoded in head parameters. 7 Conclusion We present MAPS, an efficient framework for analyzing the functionality of attention heads from their parameters. MAPS utility is two-fold: it allows mapping how strongly a given operation is implemented across the heads of a model and inferring the salient operations of a given head. Experiments show that estimations by MAPS correlate with the head outputs during inference and causally relate to the model’s behavior. Moreover, strong LLMs can interpret them automatically, often aligning with human judgment. Our analysis provides insights into architecture biases on function encoding and function universality in LLMs. Limitations MAPS primarily focuses on analyzing the part of the head’s computation that writes the output to the residual stream, i.e., the matrix $W_{VO}$ . In other words, we use single-token mappings to analyze the operation of the output part of the head on contextualized representations $\tilde{X}$ . While our experiments in § 4.3 show that these estimations generalize to multi-token inputs, it is still valuable to examine the head’s computation responsible for contextualization and for creating $\tilde{X}$ , i.e., the matrix $W_{QK}$ . Another limitation of MAPS is that its expressivity is bounded by the model’s vocabulary. Namely, it can only map operations that can be expressed via pairs of tokens. While this formulation can effectively describe and capture various features, as demonstrated by our experiments in § 4 and § 5, there are likely to be operations that this framework would overlook, such as idioms and positional features. A related challenge is the lower coverage of MAPS in early layers, where the model may not yet operate in the output vocabulary space, but instead computes general-purpose features to be used by later layers. Extending MAPS to support other types of representations is a promising direction to overcome these limitations, as well as exploring methods such as linear mappings Yom Din et al. (2024) and patching Ghandeharioun et al. (2024) to improve the performance on early layers. Lastly, MAPS relies on the formulation of attention heads as interaction matrices (§ 2), which ignores the bias terms of $W_{V},W_{O}$ . While our experiments show there is a strong correlation between the estimations by MAPS and head outputs, these terms may influence them. Incorporating these bias terms into the analysis is an interesting direction, which we leave for future works to explore. Acknowledgments We thank Guy Dar, Daniela Gottesman, Ohav Barbi, Ori Yoran, Yoav Gur-Arieh and Samuel Amouyal who helped with analysis and provided useful feedback. This research was supported in part by The Israel Science Foundation grant 1083/24. References - Ainslie et al. (2023) Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics. - Arditi et al. (2024) Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717. - Azaria and Mitchell (2023) Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it‘s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics. - Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR. - Bohnet et al. (2022) Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, et al. 2022. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037. - Bolukbasi et al. (2021) Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. 2021. An interpretability illusion for bert. ArXiv preprint, abs/2104.07143. - Clark et al. (2019) Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics. - Dar et al. (2023) Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. 2023. Analyzing transformers in embedding space. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, Toronto, Canada. Association for Computational Linguistics. - Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. ArXiv preprint, abs/2407.21783. - Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12. - Ferrando et al. (2024) Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R Costa-jussà. 2024. A primer on the inner workings of transformer-based language models. ArXiv preprint, abs/2405.00208. - Gao et al. (2024) Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024. Scaling and evaluating sparse autoencoders. ArXiv preprint, abs/2406.04093. - García-Carrasco et al. (2024) Jorge García-Carrasco, Alejandro Maté, and Juan C. Trujillo. 2024. How does GPT-2 predict acronyms? extracting and understanding a circuit via mechanistic interpretability. In International Conference on Artificial Intelligence and Statistics, 2-4 May 2024, Palau de Congressos, Valencia, Spain, volume 238 of Proceedings of Machine Learning Research, pages 3322–3330. PMLR. - Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. - Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. - Ghandeharioun et al. (2024) Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. 2024. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Forty-first International Conference on Machine Learning. - Gould et al. (2024) Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. 2024. Successor heads: Recurring, interpretable attention heads in the wild. In The Twelfth International Conference on Learning Representations. - Gur-Arieh et al. (2025) Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, and Mor Geva. 2025. Enhancing automated interpretability with output-centric feature descriptions. arXiv preprint arXiv:2501.08319. - Hernandez et al. (2024) Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. 2024. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations. - Hurst et al. (2024) Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. ArXiv preprint, abs/2410.21276. - Javaheripi and Bubeck (2023) Mojan Javaheripi and Sébastien Bubeck. 2023. Phi-2: The surprising power of small language models. - Katz et al. (2024) Shahar Katz, Yonatan Belinkov, Mor Geva, and Lior Wolf. 2024. Backward lens: Projecting language model gradients into the vocabulary space. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2390–2422, Miami, Florida, USA. Association for Computational Linguistics. - Kim et al. (2024) Geonhee Kim, Marco Valentino, and André Freitas. 2024. A mechanistic interpretation of syllogistic reasoning in auto-regressive language models. ArXiv preprint, abs/2408.08590. - Kissane et al. (2024) Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, and Neel Nanda. 2024. Interpreting attention layer outputs with sparse autoencoders. In ICML 2024 Workshop on Mechanistic Interpretability. - Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations. - Li et al. (2015) Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. 2015. Convergent learning: Do different neural networks learn the same representations? In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, volume 44 of Proceedings of Machine Learning Research, pages 196–212, Montreal, Canada. PMLR. - Loper and Bird (2002) Edward Loper and Steven Bird. 2002. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 63–70, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. - McDougall et al. (2024) Callum Stuart McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. 2024. Copy suppression: Comprehensively understanding a motif in language model attention heads. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 337–363, Miami, Florida, US. Association for Computational Linguistics. - Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. - Merullo et al. (2024a) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024a. Circuit component reuse across tasks in transformer language models. In The Twelfth International Conference on Learning Representations. - Merullo et al. (2024b) Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2024b. Talking heads: Understanding inter-layer communication in transformer language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. - Millidge and Black (2022) Beren Millidge and Sid Black. 2022. The singular value decompositions of transformer weight matrices are highly interpretable. - Nanda and Bloom (2022) Neel Nanda and Joseph Bloom. 2022. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens. - nostalgebraist (2020) nostalgebraist. 2020. Interpreting gpt: the logit lens. - Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022. In-context learning and induction heads. ArXiv preprint, abs/2209.11895. - Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9. - Schober et al. (2018) Patrick Schober, Christa Boer, and Lothar A. Schwarte. 2018. Correlation coefficients: Appropriate use and interpretation. Anesthesia & Analgesia, 126:1763–1768. - Tigges et al. (2024) Curt Tigges, Michael Hanna, Qinan Yu, and Stella Biderman. 2024. LLM circuit analyses are consistent across training and scale. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. - Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008. - Vig and Belinkov (2019) Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy. Association for Computational Linguistics. - Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics. - Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85. - Wang et al. (2023) Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. - Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. - Xiao et al. (2024) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations. - Yom Din et al. (2024) Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. 2024. Jump to conclusions: Short-cutting transformers with linear transformations. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9615–9625, Torino, Italia. ELRA and ICCL. - Yu et al. (2024) Lei Yu, Meng Cao, Jackie CK Cheung, and Yue Dong. 2024. Mechanistic understanding and mitigation of language model non-factual hallucinations. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7943–7956, Miami, Florida, USA. Association for Computational Linguistics. - Zheng et al. (2024) Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, and Zhiyu Li. 2024. Attention heads of large language models: A survey. ArXiv preprint, abs/2409.03752. - Zhou et al. (2024) Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, and Yongbin Li. 2024. On the role of attention heads in large language model safety. ArXiv preprint, abs/2410.13708. Appendix A Mapping Predefined Relations – Additional Details and Results In § 4, we showed how MAPS can be utilized to map all heads that implement a predefined relation across a language model. Here we offer further details on the datasets and implementation, as well as supplementary results. A.1 Datasets | Category | Relation | Example mappings | Dataset size per tokenizer | | | | --- | --- | --- | --- | --- | --- | | Llama-3.1 | Pythia | GPT-2 / Phi-2 | | | | | Algorithmic | Copying | (ottawa, ottawa),(say,say) | 450 | 432 | 436 | | Name copying | (Mallory, Mallory),(Walt, Walt) | 134 | 113 | 132 | | | Word to first letter | (bend, b),(past, p) | 238 | 237 | 238 | | | Word to last letter | (bend, d),(past, t) | 238 | 237 | 238 | | | Year to following | (1728, 1729),(1958, 1959) | | 147 | 133 | | | Knowledge | Country to capital | (Bulgaria, Sofia),(Chile, Santiago) | 45 | 32 | 43 | | Country to language | (Laos, Lao),(Denmark, Danish) | 51 | 37 | 48 | | | Object to superclass | (tiger, animal),(carp, fish) | 62 | 46 | 65 | | | Product by company | (Xbox, Microsoft),(Bravia, Sony) | 39 | | 40 | | | Work to location | (farmer, farm),(chef, kitchen) | 48 | 34 | 45 | | | Linguistic | Adj to comparative | (big, bigger),(high, higher) | 47 | 44 | 48 | | Adj to superlative | (angry, angriest),(high, highest) | 39 | | 41 | | | Noun to pronoun | (viewers, they),(Anna, she) | 257 | 238 | 253 | | | Verb to past tense | (ask, asked),(eat, ate) | 110 | 112 | 112 | | | Word to antonym | (love, hate),(right, wrong) | 91 | 88 | 92 | | | Word to compound | (hot, hotdog),(wall, wallpaper) | 38 | | 36 | | | Word to homophone | (steal, steel),(sea, see) | 103 | 88 | 91 | | | Word to synonym | (vague, obscure),(ill, sick) | 154 | 142 | 154 | | | Translation | English to French | (cat, chat),(love, amour) | 32 | | | | English to Spanish | (cat, gato),(love, amor) | 34 | | | | Table 3: Datasets used for inspecting predefined operations in models with different tokenizers. Every model column describes the datasets’ sizes for this model. Different tokenizers lead to differences between datasets. We discard datasets that were left with $≤$ 30 single-token mappings after tokenization. Table 4: Sources for constructing per-relation datasets used in § 4. We display the list of categories and relations used to map predefined relations (§ 4), alongside the sizes of the different datasets and examples for relations pairs in Table 3. Data collection We obtained the relation pairs from the sources: WikiData Vrandečić and Krötzsch (2014); “English Word Frequency List” Kaggle dataset, https://www.kaggle.com/datasets/wheelercode/english-word-frequency-list which is based on Google Books Ngram Viewer Exports, version 3, exported on Feb 17, 2020, https://storage.googleapis.com/books/ngrams/books/datasetsv3.html the datasets used by Hernandez et al. (2024), which are based on CounterFact Meng et al. (2022) and WikiData Vrandečić and Krötzsch (2014), and ChatGPT. https://chatgpt.com/ We also used the nltk package Loper and Bird (2002) to validate several relation datasets. Except for the Translation and year to following datasets, all datasets are in English. The details on which source was used to compose which relation are presented in Table 4. In the datasets for the relations work to location, verb to past tense, product by company, object to superclass, adj to superlative, adj to comparative, word to antonym, we filter out pairs where the source token appeared as a source token in other pairs. Relation pairs were filtered out from different datasets to assert their correctness. Data processing For every model, we tokenized the various datasets using the model’s tokenizer. To maximize the number of words mapped to single tokens, we added a leading space before every word. For example, if the relation source word was "Don", we tokenized the string " Don" instead. Finally, we filtered out relation pairs where at least one of the words was mapped to more than one token. A.2 Implementation Details Applying the first MLP For every model except Llama-3.1 70B, and similarly to Wang et al. (2023); Gould et al. (2024), we first applied the model’s first MLP to the tokens embeddings. Notably, we did not apply the first MLP when we analyzed heads from the models’ first layers (layer 0), since the first attention layer precedes the first MLP in the computation. To adjust the embeddings to the first MLP’s input distribution, we also applied the layer norm that precedes it. Regarding Llama-3.1 70B, we observed better results when not applying the first MLP. Selection of $k$ To calculate a head’s relation score $\phi_{R}(M)$ , we obtain the top- $k$ tokens in $\mathbf{m}_{s}$ for every source token $s$ . For Pythia, GPT-2 and Phi-2 we set $k=1$ for copying and name-copying relations and $k=10$ for other relations. For the Llama-3.1 models we set $k=3$ for copying and name-copying and $k=25$ for other relations. Table 5 – which presents the tokenization applied to several base words by the tokenizers of Llama-3.1, GPT-2 and Pythia – demonstrates the need to set larger $k$ values for Llama-3.1. The larger vocabulary size allows Llama-3.1’s tokenizer to express the same concept with more tokens. | Word | Llama-3.1 | Pythia | GPT-2 | | --- | --- | --- | --- | | Hello | >Hello, Hello, _hello, Ġhello, hello, ĠHello, Hallo, Bonjour, Hola | Hello, Ġhello, hello, ĠHello | hello, ĠHello, Ġhello, Hello | | Please | Please, Ġplease, please, ĠPLEASE, ĠPlease, .Please, PLEASE, >Please, Bitte, ĠBITTE, ĠBitte, Ġbitte | Please, please, Ġplease, ĠPlease | Please, Ġplease, ĠPlease, ĠPLEASE, please | | Love | ĠLOVE, love, loven, Ġlove, Love, ĠLove, ĠLiebe, Ġliebe, Ġamour, Ġamore, Ġamor | love, ĠLOVE, Love, Ġlove, ĠLove | Ġlove, love, ĠLove, Love, ĠLOVE | | Water | -water, _WATER, ĠWater, _water, water, Ġwater, Water, ĠWATER, .water, ĠWasser, ’eau, agua, Ġagua | Water, Ġwater, water, ĠWater, agua | Water, water, Ġwater, ewater, ĠWater | | School | ĠSCHOOL, -school, schools, Ġschool, _school, school, ĠSchool, .school, School | School, Ġschool, school, ĠSchool | ĠSchool, Ġschool, school, ĠSCHOOL, School | Table 5: Different tokenizations for base words by the tokenizers of Llama-3.1, Pythia and GPT-2. The “Ġ” symbol represents a leading space. We observe that Llama-3.1’s larger vocabulary allows expressing every base word with more tokens. A.3 Random Baselines A concern that may arise from choosing a relatively small relation score threshold, is that the results obtained by MAPS may capture the similarity of tokens embeddings, rather than a functionality implemented by attention head’s weights. To study this, we applied MAPS to randomly initialized matrices from the empirical distribution of the model. Concretely, for every layer in the original model, we sampled $H$ random matrices (with the same shape as $W_{VO}$ ) from a normal distribution, for which the mean and standard deviation are the average and the standard deviation of the $W_{VO}$ matrices in the original layer. We applied our predefined relation analysis (described in § 4.1) to those matrices and measured the amounts of “functional attention heads” classified among them. For models Phi-2, Pythia 6.9B, Pythia 12B, Llama-3.1 8B and Llama-3.1 70B no random matrices were classified as relation heads. For GPT-2 xl, 5 matrices were classified as such, compared to 250 relation heads in the trained model, and out of 1200 heads in the model. This demonstrates that the choice of $\tau=15\%$ is meaningful for separating between functionalities of trained attention heads and random ones. While smaller thresholds could have also been justified by this experiment, we chose $\tau=15\%$ to assert that the heads encode a substantial fraction of the relation pairs. A.4 Additional Results In Figure 5 we display all heads classified in Llama-3.1 70B, Llama-3.1 8B, Pythia 12B, Pythia 6.9B, Phi-2 and GPT-2 xl divided to four categories. In Tables 6 and 7 we present the number of relation heads (and suppression relation heads) discovered in the same models, divided into relations. We observe that several relations (Name copying, Adj to comparative, Word to first letter) are demonstrated by a relatively large number of heads in at least five out of six models. On the other hand, several relations (e.g., word to homophone, word to last letter) are demonstrated by a small number of heads across all models. | Category | Relation | GPT-2 xl | Phi-2 | Pythia 6.9B | Pythia 12B | Llama-3.1 8B | Llama-3.1 70B | | --- | --- | --- | --- | --- | --- | --- | --- | | Algorithmic | Copying | 35 | 15 | 11 | 9 | 2 | 1 | | Name copying | 71 | 25 | 27 | 23 | 3 | 14 | | | Word to first letter | 4 | 5 | 13 | 13 | 15 | 19 | | | Word to last letter | 0 | 1 | 2 | 1 | 2 | 2 | | | Year to following | 47 | 16 | 14 | 22 | | | | | Knowledge | Country to capital | 60 | 17 | 26 | 31 | 5 | 26 | | Country to language | 50 | 23 | 24 | 30 | 5 | 28 | | | Object to superclass | 17 | 12 | 11 | 19 | 0 | 13 | | | Product by company | 24 | 4 | | | 1 | 3 | | | Work to location | 10 | 6 | 6 | 8 | 0 | 5 | | | Linguistic | Adj to comparative | 45 | 47 | 27 | 28 | 8 | 25 | | Adj to superlative | 23 | 23 | | | 10 | 21 | | | Noun to pronoun | 14 | 13 | 13 | 16 | 8 | 12 | | | Verb to past tense | 15 | 27 | 17 | 28 | 8 | 18 | | | Word to antonym | 12 | 15 | 11 | 15 | 5 | 11 | | | Word to compound | 1 | 1 | | | 2 | 5 | | | Word to homophone | 0 | 0 | 0 | 0 | 0 | 2 | | | Word to synonym | 7 | 7 | 3 | 7 | 1 | 2 | | | Translation | English to French | | | | | 0 | 2 | | English to Spanish | | | | | 3 | 10 | | Table 6: Number of heads implementing each of the relations across different models. | Category | Relation | GPT-2 xl | Phi-2 | Pythia 6.9B | Pythia 12B | Llama-3.1 8B | Llama-3.1 70B | | --- | --- | --- | --- | --- | --- | --- | --- | | Algorithmic | Copying | 8 | 7 | 5 | 7 | 0 | 2 | | Name copying | 23 | 9 | 9 | 7 | 3 | 8 | | | Word to first letter | 0 | 2 | 2 | 0 | 9 | 11 | | | Word to last letter | 0 | 0 | 2 | 2 | 1 | 3 | | | Year to following | 5 | 2 | 1 | 0 | | | | | Knowledge | Country to capital | 19 | 8 | 5 | 5 | 1 | 10 | | Country to language | 26 | 12 | 9 | 11 | 3 | 9 | | | Object to superclass | 2 | 5 | 3 | 6 | 0 | 4 | | | Product by company | 7 | 0 | | | 0 | 3 | | | Work to location | 2 | 3 | 1 | 1 | 0 | 2 | | | Linguistic | Adj to comparative | 11 | 29 | 15 | 19 | 5 | 13 | | Adj to superlative | 6 | 13 | | | 5 | 10 | | | Noun to pronoun | 1 | 2 | 2 | 4 | 4 | 7 | | | Verb to past tense | 2 | 21 | 8 | 7 | 5 | 10 | | | Word to antonym | 0 | 4 | 3 | 4 | 2 | 3 | | | Word to compound | 0 | 1 | | | 2 | 3 | | | Word to homophone | 0 | 0 | 0 | 0 | 1 | 1 | | | Word to synonym | 0 | 2 | 0 | 1 | 0 | 1 | | | Translation | English to French | | | | | 0 | 0 | | English to Spanish | | | | | 2 | 7 | | Table 7: Number of suppression heads implementing each of the relations across different models. <details> <summary>x5.png Details</summary> ![623b64b0](/v1/image/623b64b0b82da26e68aa9635e58ed2e7639a98415518097b597794bdc0b3d266) ### Visual Description ## Heatmap: Attention Head Categories Across Layers ### Overview The image presents a series of heatmaps visualizing the distribution of attention head categories across different layers of a model. Each heatmap represents a specific category or an aggregate of categories. The x-axis represents the layer number (0-80), and the y-axis represents the attention head (0-60). The color of each data point indicates the category the attention head belongs to, as defined by the legend. ### Components/Axes * **Titles:** The heatmaps are titled as follows: "All Categories", "Algorithmic", "Knowledge", "Linguistic", and "Translation". * **X-axis:** Labeled "layer", with ticks at 0, 16, 32, 48, 64, and 80. * **Y-axis:** Labeled "head", with ticks at 0, 12, 24, 36, 48, and 60. * **Legend (located to the right of the "All Categories" heatmap):** * Pink: "4 categories" * Brown: "3 categories" * Purple: "2 categories" * Red: "Translation" * Green: "Linguistic" * Orange: "Knowledge" * Blue: "Algorithmic" * Light Gray: "Unclassified" ### Detailed Analysis **1. All Categories:** * This heatmap shows the distribution of all categories. * The distribution is sparse, with most heads belonging to one category. * There is a concentration of "Knowledge" (orange) and "Algorithmic" (blue) heads in the earlier layers (approximately layers 16-48). * "Linguistic" (green) heads are more prevalent in the later layers (approximately layers 48-80). * "Translation" (red) heads are sparsely distributed. * The "4 categories" (pink), "3 categories" (brown), and "2 categories" (purple) are very sparse. **2. Algorithmic:** * This heatmap isolates the "Algorithmic" category (blue). * The "Algorithmic" heads are primarily concentrated in the earlier layers (approximately layers 16-48). * There are a few "Algorithmic" heads in the later layers, but they are less frequent. **3. Knowledge:** * This heatmap isolates the "Knowledge" category (orange). * The "Knowledge" heads are also concentrated in the earlier layers (approximately layers 16-48). * The distribution is more spread out compared to "Algorithmic". **4. Linguistic:** * This heatmap isolates the "Linguistic" category (green). * The "Linguistic" heads are more prevalent in the later layers (approximately layers 48-80). * There are fewer "Linguistic" heads in the earlier layers. **5. Translation:** * This heatmap isolates the "Translation" category (red). * The "Translation" heads are sparsely distributed across all layers. * There appears to be a slight concentration in the later layers (approximately layers 64-80). ### Key Observations * "Algorithmic" and "Knowledge" categories are more active in the earlier layers. * "Linguistic" category is more active in the later layers. * "Translation" category is sparsely distributed. * The "All Categories" heatmap shows a mix of all categories, with a clear separation of "Algorithmic/Knowledge" and "Linguistic" across layers. * The "Unclassified" category is not explicitly visualized in the individual category heatmaps, but its presence can be inferred from the "All Categories" heatmap. ### Interpretation The heatmaps suggest that different layers of the model specialize in different types of tasks. The earlier layers (16-48) seem to focus on "Algorithmic" and "Knowledge" related tasks, while the later layers (48-80) focus on "Linguistic" tasks. The "Translation" category appears to be more distributed, suggesting that it might be integrated across different layers. The distribution of attention heads across layers could reflect the hierarchical nature of the model, where earlier layers learn lower-level features and later layers learn higher-level features. The concentration of "Algorithmic" and "Knowledge" heads in earlier layers might indicate that these tasks require more fundamental processing, while "Linguistic" tasks require more complex processing in later layers. The sparsity of the "Translation" category could indicate that translation-related information is integrated across different layers, or that it is less prominent compared to other categories. The "Unclassified" category might represent attention heads that do not fall into any of the defined categories, or that are involved in more general tasks. </details> (a) Functionality mapping by MAPS for relations of 4 categories — algorithmic, knowledge, linguistic, translation — across all attention heads in Llama-3.1 70B. A head is marked for a specific category if it implements (also in a suppression variant) at least one relation from this category. <details> <summary>x6.png Details</summary> ![996e46ab](/v1/image/996e46abfb7f352a511ed3f2ccf14590e5b7030e30c31dd7fce396ab60a77a68) ### Visual Description ## Heatmap: Category Distribution Across Layers and Heads ### Overview The image presents a series of heatmaps visualizing the distribution of different categories (Algorithmic, Knowledge, Linguistic, Translation, and Unclassified) across various layers and heads of a model. The heatmaps are arranged side-by-side, with one showing the distribution of all categories combined and the others showing the distribution of each individual category. The x-axis represents the layer number, and the y-axis represents the head number. ### Components/Axes * **X-axis (Layer):** Represents the layer number, ranging from 0 to 30 with increments of 6. * **Y-axis (Head):** Represents the head number, ranging from 0 to 30 with increments of 6. * **Heatmap Cells:** Each cell represents a specific layer and head combination. The color of the cell indicates the category or combination of categories present at that location. * **Legend (All Categories Plot):** Located to the right of the "All Categories" heatmap. * **Unclassified:** Not explicitly represented by a color, but implied to be the background color (light gray). * **Algorithmic:** Blue * **Knowledge:** Orange * **Linguistic:** Green * **Translation:** Red * **2 categories:** Purple * **3 categories:** Brown * **4 categories:** Pink ### Detailed Analysis **1. All Categories** * This heatmap shows the combined distribution of all categories. * The distribution is sparse, with most cells being unclassified (light gray). * Several cells contain multiple categories, indicated by the purple, brown, and pink colors. * **Specific Data Points:** * Layer 18, Head 18: Linguistic (Green) * Layer 24, Head 12: Translation (Red) * Layer 24, Head 18: 3 categories (Brown) * Layer 24, Head 24: 2 categories (Purple) * Layer 24, Head 30: Linguistic (Green) * Layer 30, Head 0: Algorithmic (Blue) * Layer 30, Head 18: Linguistic (Green) * Layer 30, Head 24: Algorithmic (Blue) * Layer 30, Head 30: Linguistic (Green) **2. Algorithmic** * This heatmap shows the distribution of the "Algorithmic" category (Blue). * The distribution is sparse, with most cells being unclassified. * **Specific Data Points:** * Layer 0, Head 0: Algorithmic (Blue) * Layer 18, Head 12: Algorithmic (Blue) * Layer 18, Head 18: Algorithmic (Blue) * Layer 18, Head 24: Algorithmic (Blue) * Layer 18, Head 30: Algorithmic (Blue) * Layer 24, Head 12: Algorithmic (Blue) * Layer 24, Head 18: Algorithmic (Blue) * Layer 24, Head 24: Algorithmic (Blue) * Layer 30, Head 0: Algorithmic (Blue) * Layer 30, Head 24: Algorithmic (Blue) **3. Knowledge** * This heatmap shows the distribution of the "Knowledge" category (Orange). * The distribution is sparse, with most cells being unclassified. * **Specific Data Points:** * Layer 6, Head 18: Knowledge (Orange) * Layer 18, Head 0: Knowledge (Orange) * Layer 18, Head 18: Knowledge (Orange) * Layer 18, Head 24: Knowledge (Orange) * Layer 24, Head 18: Knowledge (Orange) **4. Linguistic** * This heatmap shows the distribution of the "Linguistic" category (Green). * The distribution is relatively more dense compared to other categories. * **Specific Data Points:** * Layer 0, Head 18: Linguistic (Green) * Layer 0, Head 24: Linguistic (Green) * Layer 6, Head 18: Linguistic (Green) * Layer 12, Head 18: Linguistic (Green) * Layer 12, Head 24: Linguistic (Green) * Layer 12, Head 30: Linguistic (Green) * Layer 18, Head 0: Linguistic (Green) * Layer 18, Head 12: Linguistic (Green) * Layer 18, Head 18: Linguistic (Green) * Layer 18, Head 24: Linguistic (Green) * Layer 18, Head 30: Linguistic (Green) * Layer 24, Head 0: Linguistic (Green) * Layer 24, Head 18: Linguistic (Green) * Layer 24, Head 30: Linguistic (Green) * Layer 30, Head 18: Linguistic (Green) * Layer 30, Head 30: Linguistic (Green) **5. Translation** * This heatmap shows the distribution of the "Translation" category (Red). * The distribution is very sparse, with only a few cells classified. * **Specific Data Points:** * Layer 12, Head 24: Translation (Red) * Layer 18, Head 30: Translation (Red) * Layer 24, Head 30: Translation (Red) ### Key Observations * The "Linguistic" category appears to be the most prevalent, with a relatively dense distribution across layers and heads. * The "Translation" category is the least prevalent, with only a few occurrences. * Several layer-head combinations contain multiple categories, suggesting that these locations are involved in processing multiple types of information. * The distributions of individual categories are sparse, indicating that each category is primarily associated with specific layers and heads. ### Interpretation The heatmaps provide insights into how different categories of information are processed within the model. The varying distributions suggest that different layers and heads specialize in processing specific types of information. The presence of multiple categories in some layer-head combinations indicates that these locations may be involved in integrating information from different categories. The relative prevalence of the "Linguistic" category suggests that the model is heavily focused on processing linguistic information. The sparsity of the "Translation" category may indicate that the model relies on other categories to perform translation tasks or that translation-specific processing is concentrated in a few specific locations. </details> (b) Functionality mapping by MAPS for Llama-3.1 8B. <details> <summary>x7.png Details</summary> ![db599468](/v1/image/db59946886e48450f08c91cf85c74c985b5973a60e86f93f4c8b20593a8f385f) ### Visual Description ## Heatmap: Category Distribution Across Layers and Heads ### Overview The image consists of four heatmaps arranged horizontally. Each heatmap visualizes the distribution of categories across different layers and heads of a model. The first heatmap, "All Categories," shows the combined distribution of all categories, while the subsequent heatmaps ("Algorithmic," "Knowledge," and "Linguistic") display the distribution of individual categories. The heatmaps share the same axes: "layer" on the x-axis and "head" on the y-axis. A legend is provided next to the "All Categories" heatmap to indicate the color-coding for each category. ### Components/Axes * **X-axis (Layer):** Represents the layer number, ranging from 0 to 35, with tick marks at intervals of 7. * **Y-axis (Head):** Represents the head number, ranging from 0 to 40, with tick marks at intervals of 8. * **Heatmaps:** Each heatmap is a grid of cells, where each cell's color indicates the category or combination of categories present at a specific layer and head. * **Legend (Located to the right of the "All Categories" heatmap):** * **Brown:** "3 categories" * **Purple:** "2 categories" * **Green:** "Linguistic" * **Orange:** "Knowledge" * **Blue:** "Algorithmic" * **Light Gray:** "Unclassified" ### Detailed Analysis **1. All Categories Heatmap:** * This heatmap shows a mix of all categories. * There are regions with single categories, combinations of two categories (purple), and combinations of three categories (brown). * The distribution appears relatively uniform across layers and heads, with some concentrations of specific categories in certain areas. **2. Algorithmic Heatmap:** * This heatmap shows the distribution of the "Algorithmic" category (blue). * The "Algorithmic" category is sparsely distributed across layers and heads. * There are no clear patterns or concentrations of the "Algorithmic" category. **3. Knowledge Heatmap:** * This heatmap shows the distribution of the "Knowledge" category (orange). * The "Knowledge" category is more concentrated in the middle layers (around layer 16 to 32) and heads (around head 8 to 24). * There are fewer instances of the "Knowledge" category in the earlier and later layers. **4. Linguistic Heatmap:** * This heatmap shows the distribution of the "Linguistic" category (green). * The "Linguistic" category is distributed across layers and heads, with some concentrations in the earlier layers (around layer 0 to 16). * There are fewer instances of the "Linguistic" category in the later layers. ### Key Observations * The "All Categories" heatmap provides an overview of the combined distribution of all categories. * The "Algorithmic" category is sparsely distributed. * The "Knowledge" category is concentrated in the middle layers and heads. * The "Linguistic" category is concentrated in the earlier layers. * The "Unclassified" category is not explicitly shown in its own heatmap, but its presence can be inferred from the "All Categories" heatmap in areas where no other categories are present. ### Interpretation The heatmaps visualize the distribution of different categories across the layers and heads of a model. The distribution patterns suggest that different layers and heads may be specialized for processing different types of information. For example, the concentration of the "Knowledge" category in the middle layers and heads may indicate that these layers are responsible for processing knowledge-related information. Similarly, the concentration of the "Linguistic" category in the earlier layers may indicate that these layers are responsible for processing linguistic information. The sparse distribution of the "Algorithmic" category may suggest that this category is less important for the model's overall performance. The presence of combinations of categories in the "All Categories" heatmap indicates that some layers and heads may be involved in processing multiple types of information. </details> (c) Functionality mapping by MAPS for Pythia 12B. <details> <summary>x8.png Details</summary> ![d1372ff8](/v1/image/d1372ff8a62462110f68c7356e7a7fa9882ebbeb434f988cd20f0c8024a91373) ### Visual Description ## Heatmap: Category Distribution Across Layers and Heads ### Overview The image presents four heatmaps displaying the distribution of different categories across layers and heads of a model. The first heatmap shows "All Categories," while the subsequent heatmaps focus on "Algorithmic," "Knowledge," and "Linguistic" categories individually. The heatmaps use color to indicate the presence of a category at a specific layer and head combination. ### Components/Axes * **Titles:** "All Categories," "Algorithmic," "Knowledge," "Linguistic" * **Y-axis:** "head" with tick marks at 0, 6, 12, 18, 24, and 30. * **X-axis:** "layer" with tick marks at 0, 6, 12, 18, 24, and 30. * **Legend (located to the right of the "All Categories" heatmap):** * Brown: "3 categories" * Purple: "2 categories" * Green: "Linguistic" * Orange: "Knowledge" * Blue: "Algorithmic" * Gray: "Unclassified" (This is the background color of the heatmaps) ### Detailed Analysis **1. All Categories Heatmap:** This heatmap shows a mix of all categories. * Brown squares (3 categories) are sparsely distributed. * Purple squares (2 categories) are also sparsely distributed. * Green squares (Linguistic) are scattered throughout the heatmap. * Orange squares (Knowledge) are scattered throughout the heatmap. * Blue squares (Algorithmic) are scattered throughout the heatmap. **2. Algorithmic Heatmap:** This heatmap shows the distribution of the "Algorithmic" category (blue squares). * Blue squares are present across all layers and heads, but are not densely packed. * There appears to be a slightly higher concentration of blue squares in the lower layers (layer 18-30). **3. Knowledge Heatmap:** This heatmap shows the distribution of the "Knowledge" category (orange squares). * Orange squares are present across all layers and heads, but are not densely packed. * There appears to be a slightly higher concentration of orange squares in the middle layers (layer 6-18). **4. Linguistic Heatmap:** This heatmap shows the distribution of the "Linguistic" category (green squares). * Green squares are present across all layers and heads, but are not densely packed. * The distribution appears relatively uniform across layers and heads. ### Key Observations * The "All Categories" heatmap confirms that the other three categories ("Algorithmic," "Knowledge," and "Linguistic") are present in the combined view. * The individual heatmaps show the specific distribution of each category. * The heatmaps are sparse, indicating that most layer/head combinations are not strongly associated with a single category. ### Interpretation The heatmaps visualize the distribution of different categories across the layers and heads of a model. The sparsity of the heatmaps suggests that individual layer/head combinations are not strongly specialized for a single category. The "All Categories" heatmap provides a combined view, while the individual heatmaps allow for a more detailed analysis of each category's distribution. The slight variations in concentration across layers for "Algorithmic" and "Knowledge" might indicate some degree of specialization at different depths of the model. The presence of "2 categories" and "3 categories" indicates overlap between the categories. </details> (d) Functionality mapping by MAPS for Pythia 6.9B. <details> <summary>x9.png Details</summary> ![e83d2aef](/v1/image/e83d2aef5d1d88be8713436be32d2a582788bb9e4461bf75bb0c005dfb99524a) ### Visual Description ## Heatmap: Category Distribution Across Layers and Heads ### Overview The image presents four heatmaps displaying the distribution of different categories across layers and heads. The first heatmap, "All Categories," shows the combined distribution of all categories, while the subsequent heatmaps ("Algorithmic," "Knowledge," and "Linguistic") show the individual distributions of each category. The heatmaps are arranged horizontally, sharing the same axes. ### Components/Axes * **Titles:** "All Categories," "Algorithmic," "Knowledge," "Linguistic" * **Y-axis:** "head," with ticks at 0, 6, 12, 18, 24, and 30. * **X-axis:** "layer," with ticks at 0, 6, 12, 18, 24, and 30. * **Legend (located to the right of the "All Categories" heatmap):** * Brown: 3 categories * Purple: 2 categories * Green: Linguistic * Orange: Knowledge * Blue: Algorithmic * Gray: Unclassified (This is the background color) ### Detailed Analysis **1. All Categories:** * This heatmap shows a mix of all categories. * There are instances where 2 or 3 categories overlap, indicated by purple and brown squares, respectively. * The distribution appears relatively even across layers and heads, with some concentrations in specific areas. **2. Algorithmic:** * The "Algorithmic" category (blue) is sparsely distributed. * There are a few clusters of "Algorithmic" instances, particularly around layer 24 and head 18. * Most of the heatmap is gray, indicating "Unclassified." **3. Knowledge:** * The "Knowledge" category (orange) is also sparsely distributed. * There are a few clusters of "Knowledge" instances, particularly around layer 18 and head 6. * Most of the heatmap is gray, indicating "Unclassified." **4. Linguistic:** * The "Linguistic" category (green) is more densely distributed compared to "Algorithmic" and "Knowledge." * There are several clusters of "Linguistic" instances, particularly in the upper-right quadrant (higher layers and lower heads). * Most of the heatmap is gray, indicating "Unclassified." ### Key Observations * The "Linguistic" category appears to be the most prevalent among the three categories shown. * The "Algorithmic" and "Knowledge" categories are sparsely distributed. * There are instances where multiple categories overlap, as indicated in the "All Categories" heatmap. * The majority of the heatmap area is "Unclassified," suggesting that these categories do not dominate the overall distribution. ### Interpretation The heatmaps provide a visual representation of how different categories are distributed across layers and heads. The "Linguistic" category seems to be more prominent, while "Algorithmic" and "Knowledge" are less frequent. The overlapping categories in the "All Categories" heatmap suggest that some layers and heads may be responsible for processing multiple types of information. The "Unclassified" areas indicate that there are other categories or types of information not represented in these heatmaps. This analysis could be used to understand how different types of information are processed within a model or system, and how the processing is distributed across different layers and heads. </details> (e) Functionality mapping by MAPS for Phi-2. <details> <summary>x10.png Details</summary> ![e8fa0c17](/v1/image/e8fa0c176cc66212954725918b0dd14eadf1c1387911739ed639cd39332d18a0) ### Visual Description ## Heatmap: Category Distribution Across Layers and Heads ### Overview The image presents four heatmaps visualizing the distribution of different categories (Algorithmic, Knowledge, Linguistic, and combinations thereof) across different layers and heads of a model. The first heatmap, "All Categories," shows the combined distribution, while the subsequent heatmaps isolate each individual category. The x-axis represents the layer (from 0 to 45), and the y-axis represents the head (from 0 to 25). ### Components/Axes * **X-axis (Layer):** Represents the layer number, ranging from 0 to 45, with tick marks at approximately 0, 9, 18, 27, 36, and 45. * **Y-axis (Head):** Represents the head number, ranging from 0 to 25, with tick marks at approximately 0, 5, 10, 15, 20, and 25. * **Heatmaps:** Each heatmap is a 2D grid where each cell's color indicates the presence of a specific category. * **Legend (for "All Categories" heatmap):** Located to the right of the "All Categories" heatmap. * Brown: "3 categories" * Purple: "2 categories" * Green: "Linguistic" * Orange: "Knowledge" * Blue: "Algorithmic" * Gray: "Unclassified" (background color) ### Detailed Analysis **1. All Categories Heatmap:** * This heatmap shows a mix of all categories. * There appears to be a higher concentration of categories in the middle layers (around layer 18-36) and across all heads. * The distribution seems relatively uniform, with no clear patterns except for the concentration in the middle layers. * Specific data points are difficult to extract due to the mixed categories, but the overall density is visually apparent. **2. Algorithmic Heatmap:** * The "Algorithmic" category (blue) is distributed across all layers and heads, but appears to be more concentrated in the middle layers (18-36) and towards the lower heads (15-25). * The distribution is somewhat sparse, with many unclassified (gray) cells. * There is a slight upward trend in density from layer 0 to layer 36, then a slight decrease towards layer 45. * Example: At layer 9, heads 0, 1, 4, 6, 7, 11, 12, 16, 17, 20, 21, 22, 25 are active. * Example: At layer 36, heads 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 are active. **3. Knowledge Heatmap:** * The "Knowledge" category (orange) is also distributed across all layers and heads, but appears to be more concentrated in the middle layers (18-36). * The distribution is sparse, with many unclassified (gray) cells. * There is a slight upward trend in density from layer 0 to layer 36, then a slight decrease towards layer 45. * Example: At layer 9, heads 7, 13, 17 are active. * Example: At layer 36, heads 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 are active. **4. Linguistic Heatmap:** * The "Linguistic" category (green) is distributed across all layers and heads, but appears to be more concentrated in the middle layers (18-36). * The distribution is sparse, with many unclassified (gray) cells. * There is a slight upward trend in density from layer 0 to layer 36, then a slight decrease towards layer 45. * Example: At layer 9, heads 5, 8, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 are active. * Example: At layer 36, heads 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 are active. ### Key Observations * All three categories (Algorithmic, Knowledge, and Linguistic) are present across all layers and heads, but with varying densities. * The middle layers (around 18-36) tend to have a higher concentration of all categories. * The distributions are generally sparse, indicating that many layer-head combinations are not strongly associated with any of these categories. * The "All Categories" heatmap shows a mix of categories, making it difficult to discern individual patterns without the isolated heatmaps. ### Interpretation The heatmaps provide insights into how different types of information (Algorithmic, Knowledge, and Linguistic) are processed within the layers and heads of a model. The concentration of all categories in the middle layers suggests that these layers might be crucial for integrating different types of information. The sparse distributions indicate that individual heads are specialized to some extent, with only a subset of heads being strongly associated with each category. The presence of "2 categories" and "3 categories" combinations in the "All Categories" heatmap suggests that some layer-head combinations are involved in processing multiple types of information simultaneously. This could indicate that these heads are responsible for integrating different aspects of the input. The data suggests a hierarchical processing structure, where early and late layers may handle specific aspects of the input, while the middle layers integrate these aspects to form a more comprehensive representation. The specific roles of each layer and head would require further investigation, but these heatmaps provide a valuable starting point for understanding the model's internal workings. </details> (f) Functionality mapping by MAPS for GPT-2 xl. Figure 5: Functionality mapping by MAPS. Appendix B Additional Details on Evaluation Experiment B.1 Correlative Experiment In § 4.2 we conducted an experiment which calculates the correlation between MAPS ’s estimations and heads outputs during inference. Implementation details Recall that the attention head’s formulation that we used: $Y=\tilde{X}W_{VO}$ omits the bias terms of $W_{V},W_{O}$ (§ 2). To account for the bias term of $W_{V}$ in the correlative experiment, where we compute the attention head’s output dynamically, we use both the original attention head definition Vaswani et al. (2017) and the formulation suggested by Elhage et al. (2021), which we have followed so far. First, following Vaswani et al. (2017), we obtain the head’s intermediate output: $\hat{y}∈\mathbb{R}^{n× d_{\text{head}}}$ , where $d_{\text{head}}$ is the inner dimension of the head, often fixed to $\frac{d}{H}$ . Notably, this output already considers the bias term of $W_{V}$ . In Vaswani et al. (2017), $\hat{y}$ is viewed as the head’s final output. Then, following Elhage et al. (2021), we multiply this intermediate output by $W_{O}∈\mathbb{R}^{{d_{\text{head}}× d}}$ and obtain the head’s final output. We use the following templates: ‘‘This is a document about $\langle$ s $\rangle$ ’’, ‘‘No $\langle$ s $\rangle$ means no’’, ‘‘The story of $\langle$ s $\rangle$ contains’’, ‘‘When I think about $\langle$ s $\rangle$ I think about’’. Additional results Tables 8, 9, 10, 11, 12 present the correlation results between the static score $\phi_{R}(h)$ inferred by our method and the score $\phi^{*}_{R}(h)$ observed dynamically (both when we allow contextualization or not), obtained for Llama-3.1 70B, Llama-3.1 8B, Pythia 12B, Pythia 6.9B, GPT-2 xl. We also present the p-values and the maximum relation score obtained for any head in the model for the required relation. Notably, some of the lower correlations are demonstrated for relations that are not fully implemented by the model’s attention heads, as indicated by the small maximum relation scores. Tables 13, 14, 15, 16, 17 present the results (following the same format) for the suppression relation scores. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.84 | 0.81 | 0.22 | | Name copying | 0.94 | 0.89 | 0.83 | | | Word to first letter | 0.88 | 0.78 | 0.95 | | | Word to last letter | 0.66 | 0.39 | 0.16 | | | Knowledge | Country to capital | 0.93 | 0.88 | 0.87 | | Country to language | 0.94 | 0.88 | 0.67 | | | Object to superclass | 0.75 | 0.76 | 0.52 | | | Product by company | 0.69 | 0.65 | 0.36 | | | Work to location | 0.58 | 0.58 | 0.31 | | | Linguistic | Adj to comparative | 0.90 | 0.88 | 0.57 | | Adj to superlative | 0.90 | 0.84 | 0.67 | | | Noun to pronoun | 0.57 | 0.41 | 0.33 | | | Verb to past tense | 0.90 | 0.80 | 0.81 | | | Word to antonym | 0.93 | 0.91 | 0.62 | | | Word to compound | 0.85 | 0.82 | 0.39 | | | Word to homophone | 0.87 | 0.80 | 0.16 | | | Word to synonym | 0.84 | 0.79 | 0.27 | | | Translation | English to French | 0.71 | 0.68 | 0.22 | | English to Spanish | 0.85 | 0.83 | 0.47 | | Table 8: Correlation between the relation score of a head and the head’s output in Llama-3.1 70B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are 0. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.76 | 0.73 | 0.18 | | Name copying | 0.95 | 0.95 | 0.71 | | | Word to first letter | 0.90 | 0.78 | 0.89 | | | Word to last letter | 0.67 | 0.36 | 0.27 | | | Knowledge | Country to capital | 0.85 | 0.85 | 0.49 | | Country to language | 0.76 | 0.62 | 0.31 | | | Object to superclass | 0.74 | 0.73 | 0.15 | | | Product by company | 0.46 | 0.49 | 0.18 | | | Work to location | 0.44 | 0.45 | 0.10 | | | Linguistic | Adj to comparative | 0.85 | 0.86 | 0.60 | | Adj to superlative | 0.87 | 0.89 | 0.59 | | | Noun to pronoun | 0.89 | 0.79 | 0.57 | | | Verb to past tense | 0.91 | 0.86 | 0.73 | | | Word to antonym | 0.90 | 0.86 | 0.37 | | | Word to compound | 0.78 | 0.62 | 0.21 | | | Word to homophone | 0.85 | 0.75 | 0.08 | | | Word to synonym | 0.79 | 0.69 | 0.17 | | | Translation | English to French | 0.71 | 0.68 | 0.12 | | English to Spanish | 0.82 | 0.81 | 0.29 | | Table 9: Correlation between the relation score of a head and the head’s output in Llama-3.1 8B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 3.9e-128. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.89 | 0.60 | 0.42 | | Name copying | 0.86 | 0.57 | 0.65 | | | Word to first letter | 0.84 | 0.62 | 0.75 | | | Word to last letter | 0.36 | 0.17 | 0.16 | | | Year to following | 0.90 | 0.78 | 1.00 | | | Knowledge | Country to capital | 0.93 | 0.89 | 0.97 | | Country to language | 0.94 | 0.89 | 0.86 | | | Object to superclass | 0.88 | 0.87 | 0.74 | | | Work to location | 0.75 | 0.64 | 0.29 | | | Linguistic | Adj to comparative | 0.92 | 0.80 | 0.95 | | Noun to pronoun | 0.85 | 0.74 | 0.50 | | | Verb to past tense | 0.89 | 0.71 | 0.54 | | | Word to antonym | 0.92 | 0.85 | 0.60 | | | Word to homophone | 0.67 | 0.43 | 0.07 | | | Word to synonym | 0.90 | 0.67 | 0.35 | | Table 10: Correlation between the relation score of a head and the head’s output in Pythia 12B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 5.7e-40. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.88 | 0.45 | 0.53 | | Name copying | 0.94 | 0.62 | 0.96 | | | Word to first letter | 0.87 | 0.64 | 0.67 | | | Word to last letter | 0.44 | 0.43 | 0.27 | | | Year to following | 0.94 | 0.79 | 0.99 | | | Knowledge | Country to capital | 0.95 | 0.91 | 0.97 | | Country to language | 0.91 | 0.86 | 0.84 | | | Object to superclass | 0.88 | 0.88 | 0.72 | | | Work to location | 0.76 | 0.68 | 0.29 | | | Linguistic | Adj to comparative | 0.91 | 0.76 | 0.77 | | Noun to pronoun | 0.89 | 0.67 | 0.63 | | | Verb to past tense | 0.91 | 0.70 | 0.81 | | | Word to antonym | 0.93 | 0.87 | 0.64 | | | Word to homophone | 0.70 | 0.38 | 0.05 | | | Word to synonym | 0.93 | 0.64 | 0.36 | | Table 11: Correlation between the relation score of a head and the head’s output in Pythia 6.9B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 1.7e-139. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.95 | 0.65 | 0.52 | | Name copying | 0.97 | 0.70 | 0.92 | | | Word to first letter | 0.91 | 0.69 | 0.32 | | | Word to last letter | 0.61 | 0.20 | 0.05 | | | Year to following | 0.94 | 0.74 | 0.95 | | | Knowledge | Country to capital | 0.98 | 0.88 | 0.98 | | Country to language | 0.96 | 0.84 | 0.75 | | | Object to superclass | 0.94 | 0.81 | 0.43 | | | Product by company | 0.96 | 0.91 | 0.65 | | | Work to location | 0.88 | 0.73 | 0.31 | | | Linguistic | Adj to comparative | 0.95 | 0.78 | 0.88 | | Adj to superlative | 0.94 | 0.73 | 0.54 | | | Noun to pronoun | 0.96 | 0.68 | 0.58 | | | Verb to past tense | 0.93 | 0.76 | 0.28 | | | Word to antonym | 0.96 | 0.85 | 0.38 | | | Word to compound | 0.80 | 0.65 | 0.17 | | | Word to homophone | 0.46 | 0.38 | 0.02 | | | Word to synonym | 0.95 | 0.79 | 0.21 | | Table 12: Correlation between the relation score of a head and the head’s output in GPT-2 xl, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 1.1e-45. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.88 | 0.85 | 0.18 | | Name copying | 0.95 | 0.83 | 0.66 | | | Word to first letter | 0.86 | 0.72 | 0.56 | | | Word to last letter | 0.56 | 0.42 | 0.33 | | | Knowledge | Country to capital | 0.91 | 0.90 | 0.84 | | Country to language | 0.89 | 0.89 | 0.49 | | | Object to superclass | 0.81 | 0.83 | 0.39 | | | Product by company | 0.81 | 0.78 | 0.31 | | | Work to location | 0.70 | 0.70 | 0.21 | | | Linguistic | Adj to comparative | 0.91 | 0.88 | 0.72 | | Adj to superlative | 0.90 | 0.87 | 0.56 | | | Noun to pronoun | 0.33 | 0.30 | 0.46 | | | Verb to past tense | 0.91 | 0.80 | 0.54 | | | Word to antonym | 0.91 | 0.80 | 0.35 | | | Word to compound | 0.86 | 0.82 | 0.24 | | | Word to homophone | 0.91 | 0.81 | 0.31 | | | Word to synonym | 0.83 | 0.77 | 0.21 | | | Translation | English to French | 0.61 | 0.59 | 0.09 | | English to Spanish | 0.86 | 0.83 | 0.35 | | Table 13: Correlation between the suppression relation score of a head and the head’s output in Llama-3.1 70B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are 0. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.77 | 0.74 | 0.11 | | Name copying | 0.99 | 0.95 | 0.72 | | | Word to first letter | 0.78 | 0.41 | 0.61 | | | Word to last letter | 0.77 | 0.31 | 0.25 | | | Knowledge | Country to capital | 0.90 | 0.87 | 0.18 | | Country to language | 0.76 | 0.74 | 0.20 | | | Object to superclass | 0.61 | 0.63 | 0.08 | | | Product by company | 0.44 | 0.38 | 0.08 | | | Work to location | 0.40 | 0.32 | 0.12 | | | Linguistic | Adj to comparative | 0.81 | 0.91 | 0.81 | | Adj to superlative | 0.87 | 0.93 | 0.62 | | | Noun to pronoun | 0.80 | 0.57 | 0.40 | | | Verb to past tense | 0.90 | 0.85 | 0.46 | | | Word to antonym | 0.81 | 0.70 | 0.29 | | | Word to compound | 0.84 | 0.76 | 0.24 | | | Word to homophone | 0.89 | 0.61 | 0.17 | | | Word to synonym | 0.75 | 0.65 | 0.09 | | | Translation | English to French | 0.74 | 0.65 | 0.06 | | English to Spanish | 0.84 | 0.81 | 0.26 | | Table 14: Correlation between the suppression relation score of a head and the head’s output in Llama-3.1 8B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 2.6e-89. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.91 | 0.78 | 0.31 | | Name copying | 0.99 | 0.72 | 1.00 | | | Word to first letter | 0.48 | 0.18 | 0.11 | | | Word to last letter | 0.59 | 0.23 | 0.19 | | | Year to following | 0.39 | 0.59 | 0.12 | | | Knowledge | Country to capital | 0.63 | 0.62 | 0.56 | | Country to language | 0.84 | 0.70 | 0.46 | | | Object to superclass | 0.79 | 0.77 | 0.41 | | | Work to location | 0.61 | 0.64 | 0.24 | | | Linguistic | Adj to comparative | 0.93 | 0.74 | 0.73 | | Noun to pronoun | 0.68 | 0.29 | 0.28 | | | Verb to past tense | 0.96 | 0.75 | 0.73 | | | Word to antonym | 0.90 | 0.77 | 0.32 | | | Word to homophone | 0.61 | 0.39 | 0.03 | | | Word to synonym | 0.82 | 0.63 | 0.16 | | Table 15: Correlation between the suppression relation score of a head and the head’s output in Pythia 12B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 2.2e-45. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.88 | 0.81 | 0.41 | | Name copying | 0.98 | 0.79 | 0.96 | | | Word to first letter | 0.81 | 0.37 | 0.31 | | | Word to last letter | 0.30 | 0.08 | 0.24 | | | Year to following | 0.45 | 0.80 | 0.33 | | | Knowledge | Country to capital | 0.92 | 0.91 | 0.66 | | Country to language | 0.89 | 0.81 | 0.51 | | | Object to superclass | 0.86 | 0.78 | 0.33 | | | Work to location | 0.73 | 0.58 | 0.21 | | | Linguistic | Adj to comparative | 0.95 | 0.83 | 0.59 | | Noun to pronoun | 0.86 | 0.51 | 0.56 | | | Verb to past tense | 0.94 | 0.80 | 0.82 | | | Word to antonym | 0.91 | 0.78 | 0.30 | | | Word to homophone | 0.49 | 0.31 | 0.02 | | | Word to synonym | 0.87 | 0.73 | 0.13 | | Table 16: Correlation between the suppression relation score of a head and the head’s output in Pythia 6.9B, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 3.6e-7. | Category | Relation | Correlation w/o context | Correlation w/ context | Max relation score (over heads) | | --- | --- | --- | --- | --- | | Algorithmic | Copying | 0.97 | 0.71 | 0.29 | | Name copying | 0.99 | 0.72 | 0.97 | | | Word to first letter | 0.78 | 0.52 | 0.04 | | | Word to last letter | 0.78 | 0.54 | 0.06 | | | Year to following | 0.75 | 0.52 | 0.32 | | | Knowledge | Country to capital | 0.94 | 0.80 | 0.72 | | Country to language | 0.96 | 0.78 | 0.50 | | | Object to superclass | 0.89 | 0.82 | 0.23 | | | Product by company | 0.88 | 0.77 | 0.33 | | | Work to location | 0.83 | 0.62 | 0.18 | | | Linguistic | Adj to comparative | 0.86 | 0.60 | 0.38 | | Adj to superlative | 0.81 | 0.59 | 0.27 | | | Noun to pronoun | 0.92 | 0.34 | 0.40 | | | Verb to past tense | 0.84 | 0.64 | 0.17 | | | Word to antonym | 0.53 | 0.37 | 0.05 | | | Word to compound | 0.80 | 0.58 | 0.14 | | | Word to homophone | 0.10 | 0.04 | 0.01 | | | Word to synonym | 0.81 | 0.59 | 0.08 | | Table 17: Correlation between the suppression relation score of a head and the head’s output in GPT-2 xl, with and without head contextualization. The “max relation score” is the highest relation score achieved by a head in the model. All p-values observed are $≤$ 2.3e-3. | Relation | Prompt | | --- | --- | | Adj to comparative | lovely-> lovelier; edgy-> edgier; <s>-> | | Copying | walk-> walk; cat-> cat; water-> water; <s>-> | | Country to capital | The capital of <s> is | | Country to language | The official language of <s> is | | English to Spanish | apartment-> departamento; computer-> computadora; tribe-> tribu; <s>-> | | Name copying | John-> John; Donna-> Donna; <s>-> | | Noun to pronoun | mother-> she; father-> he; tribe-> they; actress-> she; apartment-> it; <s>-> | | Object to superclass | A <s> is a kind of | | Product by company | Nesquik is made by Nestlé; Mustang is made by Ford; <s> is made by | | Verb to past tense | hike->hiked; purchase-> purchased; <s>-> | | Word to first letter | word-> w, o, r, d; cat-> c, a, t; <s>-> | | Word to last letter | word-> d, r, o, w; cat-> t, a, c; <s>-> | | Year to following | 1300-> 1301; 1000-> 1001; <s>-> | Table 18: Relations and prompts used in the causal experiment. The < s> string is replaced with the relation’s source tokens. B.2 Causal Experiment In § 4.2 we measured the causal effect of removing the heads that implement a specific operation on the model’s performance in handling queries that depend on that operation. Implementation details We evaluate models on tasks for 13 relations. For each model, we filter out relations where (a) the base accuracy is very low ( $<$ 0.1) or (b) there is no dataset for the relation (see § A). The task prompts used for the different relations are presented in Table 18. Notably, When ablating an attention head, we remove its output only from the last position of the prompt. Additional results In Tables 19, 20, 21, 22, 23 we present the extended experiment results for Llama-3.1 70B, Llama-3.1 8B, Pythia 12B, Pythia 6.9B, GPT-2 xl. | Relation name | # heads removed | TR tasks | CTR tasks | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Base | -TR | -RND | # tasks | Base (CTR) | -TR (CTR) | | | | Adj to comparative | 175 | 0.98 | $\downarrow$ 13% 0.85 | $\downarrow$ 0% 0.98 $±$ 0.00 | 5 | 0.94 $±$ 0.05 | $\downarrow$ 3% 0.92 $±$ 0.08 | | Copying | 250 | 0.97 | $\downarrow$ 30% 0.68 | $\downarrow$ 0% 0.97 $±$ 0.01 | 3 | 0.97 $±$ 0.03 | $\downarrow$ 23% 0.75 $±$ 0.34 | | Country to capital | 118 | 0.84 | $\downarrow$ 66% 0.29 | $\uparrow$ 1% 0.85 $±$ 0.09 | 5 | 0.93 $±$ 0.08 | $\uparrow$ 0% 0.94 $±$ 0.09 | | Country to language | 133 | 0.96 | $\downarrow$ 6% 0.90 | $\downarrow$ 0% 0.96 $±$ 0.00 | 4 | 0.92 $±$ 0.08 | $\downarrow$ 1% 0.92 $±$ 0.10 | | English to Spanish | 175 | 0.91 | $\downarrow$ 6% 0.85 | $\uparrow$ 0% 0.91 $±$ 0.00 | 4 | 0.97 $±$ 0.03 | $\uparrow$ 0% 0.97 $±$ 0.03 | | Name copying | 205 | 0.99 | $\downarrow$ 95% 0.05 | $\uparrow$ 1% 1.00 $±$ 0.00 | 3 | 0.97 $±$ 0.03 | $\downarrow$ 15% 0.83 $±$ 0.23 | | Noun to pronoun | 154 | 0.98 | $\uparrow$ 0% 0.98 | $\uparrow$ 0% 0.98 $±$ 0.00 | 5 | 0.93 $±$ 0.08 | $\downarrow$ 1% 0.92 $±$ 0.09 | | Object to superclass | 119 | 0.79 | $\downarrow$ 4% 0.76 | $\downarrow$ 2% 0.77 $±$ 0.02 | 5 | 0.88 $±$ 0.11 | $\downarrow$ 3% 0.85 $±$ 0.15 | | Product by company | 59 | 0.67 | $\downarrow$ 4% 0.64 | $\downarrow$ 0% 0.67 $±$ 0.00 | 1 | 0.79 $±$ 0.00 | $\downarrow$ 2% 0.77 $±$ 0.00 | | Word to first letter | 250 | 1.00 | $\downarrow$ 8% 0.92 | $\downarrow$ 0% 1.00 $±$ 0.00 | 5 | 0.94 $±$ 0.05 | $\downarrow$ 5% 0.89 $±$ 0.14 | | Word to last letter | 250 | 0.92 | $\downarrow$ 18% 0.76 | $\uparrow$ 1% 0.93 $±$ 0.01 | 5 | 0.94 $±$ 0.05 | $\uparrow$ 1% 0.95 $±$ 0.04 | Table 19: Accuracy of Llama-3.1 70B on tasks for a target relation (TR) versus on control (CTR) tasks, when removing heads implementing the relation compared to when removing random heads (RND). Results for RND heads are averaged over 5 experiments. | Relation name | # heads removed | TR tasks | CTR tasks | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Base | -TR | -RND | # tasks | Base (CTR) | -TR (CTR) | | | | Adj to comparative | 69 | 0.98 | $\downarrow$ 7% 0.91 | $\downarrow$ 3% 0.95 $±$ 0.05 | 4 | 0.96 $±$ 0.04 | $\uparrow$ 0% 0.96 $±$ 0.04 | | Copying | 150 | 1.00 | $\downarrow$ 94% 0.06 | $\downarrow$ 0% 1.00 $±$ 0.00 | 3 | 0.95 $±$ 0.04 | $\downarrow$ 5% 0.91 $±$ 0.05 | | Country to capital | 19 | 0.89 | $\downarrow$ 75% 0.22 | $\uparrow$ 2% 0.91 $±$ 0.03 | 5 | 0.87 $±$ 0.12 | $\uparrow$ 1% 0.87 $±$ 0.12 | | Country to language | 30 | 0.98 | $\downarrow$ 50% 0.49 | $\uparrow$ 1% 0.99 $±$ 0.01 | 5 | 0.98 $±$ 0.02 | $\downarrow$ 0% 0.98 $±$ 0.02 | | English to Spanish | 54 | 0.94 | $\uparrow$ 3% 0.97 | $\downarrow$ 1% 0.93 $±$ 0.01 | 3 | 0.95 $±$ 0.04 | $\uparrow$ 2% 0.97 $±$ 0.02 | | Name copying | 70 | 1.00 | $\downarrow$ 87% 0.13 | $\downarrow$ 0% 1.00 $±$ 0.00 | 2 | 0.94 $±$ 0.05 | $\downarrow$ 4% 0.90 $±$ 0.08 | | Noun to pronoun | 35 | 0.98 | $\downarrow$ 0% 0.98 | $\uparrow$ 0% 0.99 $±$ 0.00 | 5 | 0.97 $±$ 0.04 | $\uparrow$ 1% 0.98 $±$ 0.03 | | Object to superclass | 34 | 0.74 | $\downarrow$ 11% 0.66 | $\uparrow$ 1% 0.75 $±$ 0.01 | 2 | 0.79 $±$ 0.09 | $\downarrow$ 3% 0.77 $±$ 0.07 | | Product by company | 12 | 0.54 | $\downarrow$ 5% 0.51 | $\uparrow$ 4% 0.56 $±$ 0.01 | 1 | 0.70 $±$ 0.00 | $\downarrow$ 1% 0.69 $±$ 0.00 | | Verb to past tense | 113 | 0.70 | $\downarrow$ 61% 0.27 | $\downarrow$ 7% 0.65 $±$ 0.10 | 2 | 0.71 $±$ 0.18 | $\downarrow$ 1% 0.70 $±$ 0.14 | | Word to first letter | 150 | 1.00 | $\downarrow$ 98% 0.02 | $\downarrow$ 0% 1.00 $±$ 0.00 | 5 | 0.96 $±$ 0.04 | $\downarrow$ 30% 0.67 $±$ 0.33 | Table 20: Accuracy of Llama-3.1 8B on tasks for a target relation (TR) versus on control (CTR) tasks, when removing heads implementing the relation compared to when removing random heads (RND). Results for RND heads are averaged over 5 experiments. | Relation name | # heads removed | TR tasks | CTR tasks | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Base | -TR | -RND | # tasks | Base (CTR) | -TR (CTR) | | | | Adj to comparative | 150 | 0.91 | $\downarrow$ 77% 0.20 | $\downarrow$ 10% 0.82 $±$ 0.07 | 3 | 0.92 $±$ 0.04 | $\downarrow$ 32% 0.63 $±$ 0.18 | | Copying | 150 | 1.00 | $\downarrow$ 32% 0.68 | $\downarrow$ 0% 1.00 $±$ 0.00 | 3 | 0.95 $±$ 0.05 | $\downarrow$ 7% 0.88 $±$ 0.11 | | Country to capital | 75 | 0.97 | $\downarrow$ 100% 0.00 | $\downarrow$ 2% 0.95 $±$ 0.02 | 2 | 0.89 $±$ 0.02 | $\uparrow$ 0% 0.90 $±$ 0.01 | | Country to language | 94 | 1.00 | $\downarrow$ 92% 0.08 | $\downarrow$ 4% 0.96 $±$ 0.01 | 2 | 0.89 $±$ 0.01 | $\downarrow$ 0% 0.89 $±$ 0.01 | | Name copying | 150 | 1.00 | $\downarrow$ 76% 0.24 | $\downarrow$ 0% 1.00 $±$ 0.00 | 2 | 0.90 $±$ 0.02 | $\uparrow$ 2% 0.92 $±$ 0.05 | | Noun to pronoun | 105 | 0.88 | $\downarrow$ 48% 0.46 | $\downarrow$ 2% 0.86 $±$ 0.03 | 5 | 0.90 $±$ 0.07 | $\downarrow$ 3% 0.88 $±$ 0.08 | | Object to superclass | 75 | 0.78 | $\downarrow$ 50% 0.39 | $\downarrow$ 13% 0.68 $±$ 0.03 | 2 | 0.90 $±$ 0.02 | $\downarrow$ 3% 0.87 $±$ 0.09 | | Verb to past tense | 150 | 0.22 | $\downarrow$ 84% 0.04 | $\uparrow$ 17% 0.26 $±$ 0.11 | 1 | 0.03 $±$ 0.00 | $\downarrow$ 33% 0.02 $±$ 0.00 | | Word to first letter | 150 | 0.91 | $\downarrow$ 63% 0.34 | $\downarrow$ 4% 0.87 $±$ 0.04 | 5 | 0.91 $±$ 0.08 | $\downarrow$ 19% 0.74 $±$ 0.30 | | Year to following | 56 | 0.92 | $\downarrow$ 100% 0.00 | $\downarrow$ 5% 0.87 $±$ 0.07 | 2 | 0.83 $±$ 0.05 | $\downarrow$ 5% 0.79 $±$ 0.03 | Table 21: Accuracy of Pythia 12B on tasks for a target relation (TR) versus its accuracy on control (CTR) tasks, when removing heads implementing the relation compared to when removing random heads (RND). Results for RND heads are averaged over 5 experiments. | Relation name | # heads removed | TR tasks | CTR tasks | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Base | -TR | -RND | # tasks | Base (CTR) | -TR (CTR) | | | | Adj to comparative | 124 | 0.52 | $\downarrow$ 100% 0.00 | $\downarrow$ 51% 0.25 $±$ 0.18 | 1 | 0.68 $±$ 0.00 | $\downarrow$ 25% 0.51 $±$ 0.00 | | Copying | 150 | 1.00 | $\downarrow$ 93% 0.07 | $\downarrow$ 1% 0.99 $±$ 0.01 | 0 | | | | Country to capital | 45 | 0.97 | $\downarrow$ 100% 0.00 | $\downarrow$ 1% 0.96 $±$ 0.02 | 1 | 1.00 $±$ 0.00 | $\downarrow$ 0% 1.00 $±$ 0.00 | | Country to language | 74 | 0.97 | $\downarrow$ 92% 0.08 | $\uparrow$ 1% 0.98 $±$ 0.01 | 0 | | | | Name copying | 143 | 1.00 | $\downarrow$ 97% 0.03 | $\downarrow$ 1% 0.99 $±$ 0.01 | 0 | | | | Noun to pronoun | 102 | 0.68 | $\downarrow$ 46% 0.37 | $\uparrow$ 13% 0.77 $±$ 0.09 | 3 | 0.68 $±$ 0.11 | $\downarrow$ 25% 0.51 $±$ 0.22 | | Object to superclass | 67 | 0.78 | $\downarrow$ 53% 0.37 | $\downarrow$ 4% 0.75 $±$ 0.02 | 2 | 0.71 $±$ 0.03 | $\uparrow$ 1% 0.71 $±$ 0.18 | | Verb to past tense | 150 | 0.43 | $\downarrow$ 94% 0.03 | $\downarrow$ 16% 0.36 $±$ 0.07 | 0 | | | | Word to first letter | 66 | 1.00 | $\downarrow$ 100% 0.00 | $\downarrow$ 0% 1.00 $±$ 0.00 | 2 | 0.97 $±$ 0.00 | $\downarrow$ 13% 0.85 $±$ 0.13 | | Year to following | 52 | 0.73 | $\downarrow$ 100% 0.00 | $\uparrow$ 5% 0.77 $±$ 0.07 | 2 | 0.73 $±$ 0.05 | $\downarrow$ 2% 0.71 $±$ 0.05 | Table 22: Accuracy of Pythia 6.9B on tasks for a target relation (TR) versus its accuracy on control (CTR) tasks, when removing heads implementing the relation compared to when removing random heads (RND). Results for RND heads are averaged over 5 experiments. | Relation name | # heads removed | TR tasks | CTR tasks | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Base | -TR | -RND | # tasks | Base (CTR) | -TR (CTR) | | | | Copying | 150 | 0.99 | $\downarrow$ 30% 0.69 | $\downarrow$ 0% 0.99 $±$ 0.00 | 0 | | | | Country to capital | 38 | 0.88 | $\downarrow$ 100% 0.00 | $\downarrow$ 3% 0.86 $±$ 0.05 | 1 | 0.71 $±$ 0.00 | $\uparrow$ 2% 0.72 $±$ 0.00 | | Country to language | 148 | 0.96 | $\downarrow$ 91% 0.08 | $\downarrow$ 2% 0.94 $±$ 0.01 | 0 | | | | Name copying | 133 | 0.76 | $\downarrow$ 100% 0.00 | $\downarrow$ 15% 0.65 $±$ 0.08 | 1 | 0.71 $±$ 0.00 | $\downarrow$ 15% 0.60 $±$ 0.00 | | Noun to pronoun | 27 | 0.71 | $\downarrow$ 26% 0.53 | $\downarrow$ 2% 0.69 $±$ 0.04 | 4 | 0.72 $±$ 0.13 | $\downarrow$ 3% 0.69 $±$ 0.16 | | Object to superclass | 99 | 0.71 | $\downarrow$ 54% 0.32 | $\downarrow$ 1% 0.70 $±$ 0.02 | 1 | 0.71 $±$ 0.00 | $\downarrow$ 42% 0.41 $±$ 0.00 | | Product by company | 73 | 0.40 | $\downarrow$ 81% 0.08 | $\downarrow$ 0% 0.40 $±$ 0.00 | 1 | 0.40 $±$ 0.00 | $\uparrow$ 2% 0.41 $±$ 0.00 | | Verb to past tense | 150 | 0.40 | $\downarrow$ 56% 0.18 | $\downarrow$ 4% 0.38 $±$ 0.18 | 0 | | | | Word to first letter | 62 | 0.18 | $\downarrow$ 16% 0.16 | $\downarrow$ 1% 0.18 $±$ 0.02 | 1 | 0.04 $±$ 0.00 | $\uparrow$ 250% 0.15 $±$ 0.00 | | Year to following | 54 | 0.53 | $\downarrow$ 100% 0.00 | $\downarrow$ 5% 0.50 $±$ 0.03 | 1 | 0.71 $±$ 0.00 | $\downarrow$ 36% 0.45 $±$ 0.00 | Table 23: Accuracy of GPT-2 xl on tasks for a target relation (TR) versus its accuracy on control (CTR) tasks, when removing heads implementing the relation compared to when removing random heads (RND). Results for RND heads are averaged over 5 experiments. Appendix C Generalization to Multi-Token Entities – Additional Results In § 4.3 we conducted an experiment that evaluates how well the classifications by MAPS generalize to contextualized inputs. Table 24 shows the full results of this experiment. We omit the correlations for GPT-2 xl and the relation word to last letter, as all static scores are very small ( $≤$ 0.05). | Model | Relation | # samples | W/o context | W/ context | | | | --- | --- | --- | --- | --- | --- | --- | | Single-token | Multi-token | Single-token | Multi-token | | | | | Pythia 12B | Copying | 283 | 0.91 | 0.85 | 0.48 | 0.44 | | Country to capital | 30 | 0.94 | 0.93 | 0.85 | 0.87 | | | Country to language | 70 | 0.94 | 0.90 | 0.88 | 0.83 | | | Name copying | 83 | 0.87 | 0.76 | 0.38 | 0.33 | | | Noun to pronoun | 174 | 0.84 | 0.85 | 0.78 | 0.79 | | | Object to superclass | 91 | 0.88 | 0.89 | 0.84 | 0.86 | | | Word to first letter | 77 | 0.83 | 0.73 | 0.56 | 0.64 | | | Word to last letter | 77 | 0.34 | 0.50 | 0.11 | 0.09 | | | Word to synonym | 71 | 0.92 | 0.86 | 0.61 | 0.58 | | | Work to location | 65 | 0.77 | 0.72 | 0.74 | 0.70 | | | Year to following | 65 | 0.90 | 0.84 | 0.64 | 0.60 | | | Pythia 6.9B | Copying | 283 | 0.90 | 0.87 | 0.34 | 0.32 | | Country to capital | 30 | 0.95 | 0.93 | 0.89 | 0.89 | | | Country to language | 70 | 0.92 | 0.88 | 0.85 | 0.83 | | | Name copying | 83 | 0.94 | 0.92 | 0.47 | 0.47 | | | Noun to pronoun | 174 | 0.89 | 0.85 | 0.69 | 0.70 | | | Object to superclass | 91 | 0.88 | 0.90 | 0.86 | 0.82 | | | Word to first letter | 77 | 0.89 | 0.79 | 0.59 | 0.66 | | | Word to last letter | 77 | 0.45 | 0.70 | 0.44 | 0.44 | | | Word to synonym | 71 | 0.94 | 0.91 | 0.62 | 0.62 | | | Work to location | 65 | 0.79 | 0.76 | 0.71 | 0.75 | | | Year to following | 65 | 0.94 | 0.87 | 0.72 | 0.67 | | | GPT-2 xl | Copying | 301 | 0.95 | 0.88 | 0.68 | 0.64 | | Country to capital | 34 | 0.98 | 0.97 | 0.87 | 0.86 | | | Country to language | 70 | 0.96 | 0.91 | 0.82 | 0.80 | | | Name copying | 91 | 0.97 | 0.93 | 0.60 | 0.58 | | | Noun to pronoun | 154 | 0.97 | 0.95 | 0.47 | 0.56 | | | Object to superclass | 97 | 0.93 | 0.89 | 0.83 | 0.82 | | | Word to first letter | 78 | 0.92 | 0.89 | 0.53 | 0.72 | | | Word to synonym | 79 | 0.95 | 0.89 | 0.79 | 0.76 | | | Work to location | 67 | 0.89 | 0.80 | 0.74 | 0.76 | | | Year to following | 90 | 0.95 | 0.82 | 0.74 | 0.63 | | Table 24: Extended results for the multi-token experiment, presented in Section 4.3. All p-values observed are $≤$ 9.3e-4. Appendix D Comparison to Head Operations Identified in Prior Works Name-mover heads in GPT-2 small Wang et al. (2023) studied the Indirect Object Identification circuit in GPT-2 small. Analyzing the operations of the circuit’s heads, they defined heads that copy names as Name-Mover heads and heads that suppress names as Negative Name-Mover heads. They also classified heads that contribute to these tasks when the original mover heads are ablated as “backup” mover heads. Using MAPS we classified all three name-mover heads as implementing the name copying relation, and the two negative name-mover heads as implementing the suppression variant of name copying. We note that a similar analysis was performed by Wang et al. (2023) as well. However, by applying MAPS to all heads in the model, and not just the heads in the discovered circuit, we were able to identify 21 additional name-copying heads as well, 6 of which were identified by Wang et al. (2023) as “backup” heads. One backup mover head and one backup negative mover head that were identified by Wang et al. (2023), were not identified by MAPS. Moreover, we find that each of the five identified name-mover heads implements a myriad of other relations. In Figure 6(a) we present the name copying relation scores for all heads in GPT-2 small and the heads classified by Wang et al. (2023). We further examined the name copying heads not classified by Wang et al. (2023), to study whether their omission was mostly due to limited involvement in the specific task studied by Wang et al. (2023), or instead a consequence of inaccurate estimations by MAPS. These heads show a strong correlation (0.94, p-value of $2.5e{-7}$ ) between their name copying static and dynamic relation scores (for the prompt This is a document about $\langle$ s $\rangle$ , see § 4.2), when attention is restricted to the name position, suggesting that they indeed copy names when they attend to them. However, the attention weight assigned to the name token may change depending on the context. For example, head 8.11 in GPT-2 small has a static relation score of 0.88. Its dynamic relation score is 0.23 for the prompt This is a document about $\langle$ s $\rangle$ , but it increases substantially to 0.92 for the prompt “ John->John; Donna-> Donna; $\langle$ s $\rangle$ -> ”. We anticipate that other relation heads will demonstrate the name-copying functionality for other prompts or interventions. Crafting prompts that steer heads to demonstrate a specific functionality over another (for example by adapting MAPS to the $W_{QK}$ matrix) is an interesting direction for future work. Mover heads in GPT-2 medium Merullo et al. (2024a) studied the Indirect Object Identification (IOI) and Colored Objects circuits in GPT-2 medium. They discovered two sets of attention heads implementing certain functions, both called “Mover” heads. Heads from the first set copy names (in IOI), and heads from the second set copy colors (in the Colored Objects task). The authors also point out a significant overlap between the two sets. Using MAPS, we classified all mover heads as implementing the name copying relation. We find that many of these heads also implement the relations: year to following, country to language, country to capital, copying. Lastly, we identify 31 other name-copying heads. Notably, in our counting, we omit the heads 14.5, 17.10, 16.0, 18.12, and 21.7, which are labeled in Figure 2 of Merullo et al. (2024a) as Mover-heads. This is because, to the best of our understanding, the paper does not provide any explanation for why they are classified as such, while other heads are described as more important than them. Capital heads in GPT-2 medium Merullo et al. (2024a) have also studied a circuit for resolving the capital city of a country (in Appendix I). MAPS identified all attention heads classified in that study, along with 15 others. In Figure 6(b) we present the name copying, country to capital relation scores for all heads in GPT-2 medium and the heads classified by Merullo et al. (2024a). <details> <summary>x11.png Details</summary> ![c3ea5651](/v1/image/c3ea5651e21450ff2ce235d3d5072efdfd72904a0720e074f6a7fb7cd5b663d3) ### Visual Description ## Heatmaps: GPT-2 Name-Copying Heads and (Suppression) Name-Copying Heads ### Overview The image presents two heatmaps side-by-side, visualizing the "Name Copying score" for different heads and layers in a GPT-2 model. The left heatmap shows "Name-Copying heads," while the right heatmap shows "(Suppression) Name-Copying heads." Both heatmaps use a color gradient to represent the Name Copying score, ranging from dark purple (0.0) to bright yellow (1.0). The heatmaps also overlay markers indicating 'Name-Mover Heads' and 'Backup Name-Mover Heads' as classified by 'Interp. in the Wild'. ### Components/Axes **Left Heatmap (GPT-2: Name-Copying heads):** * **Title:** GPT-2: Name-Copying heads * **Y-axis:** Head, with labels from 0 to 11. * **X-axis:** Layer, with labels from 0 to 11. * **Color Bar (Right Side):** Name Copying score, ranging from 0.0 to 1.0 in increments of 0.2. * **Legend (Top-Left):** * 'Interp. in the Wild' classifications * 'X' represents Name-Mover Heads * Gray dot represents Backup Name-Mover Heads **Right Heatmap (GPT-2: (Suppression) Name-Copying heads):** * **Title:** GPT-2: (Suppression) Name-Copying heads * **Y-axis:** Head, with labels from 0 to 11. * **X-axis:** Layer, with labels from 0 to 11. * **Color Bar (Right Side):** (Suppression) Name Copying score, ranging from 0.0 to 1.0 in increments of 0.2. * **Legend (Top-Left):** * 'Interp. in the Wild' classifications * 'X' represents (Negative) Name-Mover Heads * Gray dot represents Backup (Negative) Name-Mover Heads ### Detailed Analysis **Left Heatmap (GPT-2: Name-Copying heads):** * **Head 0:** * Layer 9: Yellow, approximately 0.8-1.0. * Layer 10: Yellow, approximately 0.8-1.0. * Layer 11: Green, approximately 0.6-0.8. * **Head 1:** * Layer 10: Green, approximately 0.6-0.8. * Layer 11: Green, approximately 0.6-0.8. * **Head 2:** * Layer 9: Yellow, approximately 0.8-1.0. * **Head 3:** * Layer 4: Blue, approximately 0.2-0.4. * **Head 4:** * Layer 1: Blue, approximately 0.2-0.4. * **Head 6:** * Layer 0: Blue, approximately 0.2-0.4. * **Head 9:** * Layer 6: Yellow, approximately 0.8-1.0. * Layer 7: Yellow, approximately 0.8-1.0. * **Head 10:** * Layer 8: Yellow, approximately 0.8-1.0. * Layer 9: Green, approximately 0.6-0.8. * **Head 11:** * Layer 9: Green, approximately 0.6-0.8. **Name-Mover Heads (X markers):** * Head 0, Layer 10 * Head 6, Layer 9 * Head 9, Layer 6 **Backup Name-Mover Heads (Gray dots):** * Head 0, Layer 9 * Head 1, Layer 10 * Head 6, Layer 10 * Head 9, Layer 7 **Right Heatmap (GPT-2: (Suppression) Name-Copying heads):** * **Head 4:** * Layer 6: Blue, approximately 0.2-0.4. * **Head 6:** * Layer 0: Blue, approximately 0.2-0.4. * **Head 9:** * Layer 6: Green, approximately 0.4-0.6. * **Head 11:** * Layer 5: Blue, approximately 0.0-0.2. **(Negative) Name-Mover Heads (X markers):** * Head 9, Layer 10 * Head 11, Layer 11 **Backup (Negative) Name-Mover Heads (Gray dots):** * Head 6, Layer 9 ### Key Observations * The left heatmap shows more intense "Name Copying" activity (yellow and green) compared to the right heatmap, which is predominantly dark purple and blue. * Name-Mover Heads and Backup Name-Mover Heads are concentrated in specific head-layer combinations. * The suppression heatmap shows very little activity, with most values close to 0. ### Interpretation The heatmaps visualize the extent to which different heads and layers in a GPT-2 model are involved in "Name Copying." The left heatmap indicates the baseline Name Copying activity, while the right heatmap shows the activity after a "suppression" intervention. The significant reduction in Name Copying scores in the right heatmap suggests that the suppression technique is effective in reducing this behavior. The markers for Name-Mover Heads and Backup Name-Mover Heads highlight specific areas of the model that are most relevant to this function, allowing for targeted analysis and intervention. The data suggests that specific heads and layers are more responsible for name-copying behavior than others. </details> (a) Comparison between “Name-Mover” heads discovered by Wang et al. (2023) and heads which implement the name copying relation, discovered by MAPS. <details> <summary>x12.png Details</summary> ![dfe778ad](/v1/image/dfe778adb1a2416e890ad0676b0f99e39d804c9c396be114dfc5a6fd73b0e093) ### Visual Description ## Heatmap: GPT-2 Medium Head Analysis ### Overview The image presents two heatmaps, each analyzing the heads of a GPT-2 medium model. The left heatmap focuses on "Name Copying" heads, while the right heatmap focuses on "Country to Capital" heads. Both heatmaps share a similar structure, displaying the interaction between layers (x-axis) and heads (y-axis). The color intensity represents a score, with yellow indicating higher scores and dark purple indicating lower scores. 'Circuits Components Reused' classifications are marked with an "X" on each heatmap. ### Components/Axes * **Titles:** * Left: "GPT-2 medium: Name Copying heads" * Right: "GPT-2 medium: Country to capital heads" * **Axes:** * X-axis (both heatmaps): "Layer" with ticks from 0 to 22 in increments of 2. * Y-axis (both heatmaps): "Head" with ticks from 0 to 15. * **Color Scales:** * Left: "Name Copying score" ranging from 0.0 to 1.0. * Right: "Country to capital score" ranging from 0.0 to 1.0. * **Legend (both heatmaps):** Located in the lower-left corner of each heatmap. * Text: "'Circuits Components Reused' classifications" * Symbol: "X" * Left: "Mover Heads" * Right: "Capital heads" ### Detailed Analysis #### Left Heatmap: Name Copying Heads * **General Trend:** The heatmap is mostly dark purple, indicating low "Name Copying" scores across most layer-head combinations. There are a few scattered areas with higher scores (green and yellow). * **Mover Heads (marked with "X"):** * Layer 1, Head 0: Not a mover head. * Layer 8, Head 5: Mover Head. * Layer 15, Head 13: Mover Head. * Layer 14, Head 15: Mover Head. * Layer 16, Head 15: Mover Head. * Layer 18, Head 15: Mover Head. * Layer 20, Head 15: Mover Head. * Layer 22, Head 15: Mover Head. * Layer 14, Head 14: Mover Head. * Layer 16, Head 14: Mover Head. * **Specific Data Points:** * Layer 6, Head 0: Score ~0.8 * Layer 10, Head 8: Score ~0.6 * Layer 14, Head 15: Score ~0.8 * Layer 16, Head 15: Score ~0.8 * Layer 18, Head 15: Score ~0.6 * Layer 20, Head 15: Score ~0.6 * Layer 22, Head 15: Score ~0.6 #### Right Heatmap: Country to Capital Heads * **General Trend:** Similar to the left heatmap, this one is also predominantly dark purple, indicating low "Country to Capital" scores. There are a few scattered areas with higher scores. * **Capital Heads (marked with "X"):** * Layer 12, Head 5: Capital Head. * Layer 14, Head 13: Capital Head. * Layer 18, Head 1: Capital Head. * Layer 20, Head 13: Capital Head. * Layer 20, Head 14: Capital Head. * Layer 8, Head 0: Capital Head. * **Specific Data Points:** * Layer 12, Head 5: Score ~0.6 * Layer 14, Head 13: Score ~0.6 * Layer 18, Head 1: Score ~0.8 * Layer 20, Head 13: Score ~0.2 * Layer 20, Head 14: Score ~0.2 * Layer 8, Head 0: Score ~0.6 * Layer 22, Head 0: Score ~0.8 ### Key Observations * Both heatmaps show sparse activation patterns, with most layer-head combinations having low scores. * The "Mover Heads" and "Capital Heads" are concentrated in specific layers and heads, suggesting specialized roles within the GPT-2 model. * The "Name Copying" task seems to have more active heads in the later layers (14-22) compared to the "Country to Capital" task. ### Interpretation The heatmaps provide insights into how different heads within the GPT-2 medium model contribute to specific tasks. The sparse activation patterns suggest that only a subset of heads are actively involved in "Name Copying" and "Country to Capital" tasks. The concentration of "Mover Heads" and "Capital Heads" in specific layers and heads indicates that these heads may have learned specialized functions related to these tasks. The difference in activation patterns between the two tasks suggests that different sets of heads are utilized for different types of knowledge processing. The data suggests that the model has learned to distribute the workload across its heads, with some heads specializing in specific tasks or sub-tasks. </details> (b) Comparison between “Name-Mover” and “Capital” heads discovered by Merullo et al. (2024a) and heads which implement the name copying and the country to capital relations discovered in our work. Figure 6: Comparison between relation heads discovered by MAPS and heads classified in prior works. Appendix E Automatic Mapping of Salient Head Operations E.1 Automatic Functionality Inference In § 5.1 we showed that GPT-4o can be utilized to interpret attention heads’ salient operations. Here, we provide additional implementation details and present an evaluation of the interpretation quality. Implementation details We found that GPT-4o sometimes describes in words that the pattern is unclear, rather than just outputting the word “Unclear”, as requested. To handle these cases, we classify every head for which GPT-4o’s response contained the string “clear” as a head where a pattern was not detected. We view this as an upper bound over the true ratio of heads with undetected patterns. Also, for some heads, GPT-4o would stop generating descriptions mid-generation. We hypothesize that it is because of strings viewed as special GPT-4o tokens that appeared in the salient mappings. We solved this issue by querying GPT-4o again with other random seeds. We note that in several mappings the salient tokens were decoded as an unreadable character. This could be solved by alternating between Transformers package Wolf et al. (2020) decoding functions. Prompt format We present the prompt used to query GPT-4o in Table 26. | Head | Salient mappings | GPT-4o description | | --- | --- | --- | | Pythia 6.9B 15.3 | osevelt: 1943, 1941, 1940, 1930, 1936 Roosevelt: 1943, 1941, 1936, 1940, 1930 FDR: 1943, 1942, 1941, 1938, 1936 Napole: 1800, 1800, 18,18, 1840 oslov: 1968, 1970, 1960, 1964, 1965 Napoleon: 1800, 1800,18, 18, Napoleon taire: 1840, 1850,1800, Pruss, 1830 afka: 1908, 1912, 1916, 1903, 1911 lantern: 1870, 1880, 1930, Depression, railroad Edison: 1920,1920,1900, 1908, 1880 Confederate: 1863, 1864, 1861, 1862, 1870 1861: 1861, 1863, 1860, 1864, 1870 | The input strings are partial or full names of historical figures as well as years and terms relating to historical events. The mappings associate each input with years or terms relevant to their historical significance, reflecting events or periods related to the input entity. | | Pythia 6.9B 16.1 | inhib: inhibition, inhib, Inhib, inhibiting, inhibit resil: resilience, resistance,Resp, res,resistance toler: toler, tolerance, tolerate, tolerated, tolerant aggrav: aggrav, exacerb, help, assistance : response, responses, responding, inhibiting destructive: destructive, destruction, destroying salvage: saving, save,saving,save, saves reluct: reluctance, resistance, resisting, resist prophyl: protection, protective, Protection Relief: relief, Relief, relie, relieved, relieve surv: survival, Survival, protection, surviv | The input strings are truncated forms of words, often found in contexts related to protection, resistance, or functionality. The mappings primarily expand truncated forms into semantically related words, often the full form of the input string or related words. | | Pythia 6.9B 16.11 | weeks: months, month, summer, season, year months: year,year, Year,Year, yearly month: year, Year,year,Year, yearly Month: year, Year,year,Year, years weeks: month, months,month,months, summer months: year, Year,year,Year, yearly Week: months, month,months,month, Month week: month, months,month,months, season month: year, Year,year,Year, yearly overnight: month, week, weeks,acci, months years: decade, decades, aging, century, life | The input strings are related to time periods such as weeks, months, and years. Mappings are connecting input strings to related or hierarchical time concepts, often extending them into longer periods like months to years and weeks to months. | | Pythia 6.9B 22.13 | periodontal: dental, Dental, dentist, dent, periodontal mandibular: dental, Dental, mandibular, teeth, dentist odontic: dental, Dental, dentist, teeth, tooth psori: skin, Skin,skin, dermat, skins retinal: eye, ophthal, retinal, ocular, eyes echocardiography: cardiac, Card, hearts,Card, Cardi scalp: brain, Brain,brain, brains, scalp hippocampal: hippocampal, Brain, brain,brain, hippocampus ocardi: cardiac, Card, hearts, Heart, heart ACL: knee, knees, thigh, Hip, ankle caries: dental, Dental, dentist, dent, Dent | The input strings seem to relate to various medical and anatomical terms, including parts of the body, diseases, and medical procedures. The mappings primarily associate anatomical or medical terms (input strings) with related medical terminology, such as conditions, associated body parts, or broader medical categories. | | GPT-2 xl 26.2 | Jedi: lightsaber, Jedi, Kenobi, droid, Skywalker lightsaber: lightsaber, Jedi, Kenobi, Skywalker, Sith galactic: Galactic, galactic, starship, galaxy, droid Starfleet: galactic, Starfleet, starship, Galactic, interstellar Klingon: starship, Starfleet, Klingon, Trek, Starship starship: starship, Galactic, galactic, interstellar, Planetary Skyrim: Skyrim, Magicka, Bethesda, Elven, Hearth Darth: Jedi, lightsaber, Kenobi, Darth, Sith galaxy: Galactic, galactic, starship, galaxy, droid | The input strings are terms related to popular science fiction and fantasy franchises such as Star Wars, Star Trek, Pokémon, Elder Scrolls, Harry Potter, and general fantastical terms. The pattern observed is that each mapping takes an input term from a science fiction or fantasy context and maps it to other terms that are often from the same or related fictional universe. | Table 25: Example salient operations of attention heads in Pythia 6.9B and GPT-2 xl and their corresponding descriptions by GPT-4o. | Below you are given a list of input strings, and a list of mappings: each mapping is between an input string and a list of 5 strings. | | --- | | Mappings are provided in the format "s: t1, t2, t3, t4, t5" where each of s, t1, t2, t3, t4, t5 is a short string, typically corresponding to a single word or a sub-word. | | Your goal is to describe shortly and simply the inputs and the function that produces these mappings. To perform the task, look for semantic and textual patterns. | | For example, input tokens ’water’,’ice’,’freeze’ are water-related, and a mapping (’fire’:’f’) is from a word to its first letter. | | As a final response, suggest the most clear patterns observed or indicate that no clear pattern is visible (write only the word "Unclear"). | | Your response should be a vaild json, with the following keys: | | "Reasoning": your reasoning. | | "Input strings": One sentence describing the input strings (or "Unclear"). | | "Observed pattern": One sentence describing the most clear patterns observed (or "Unclear"). | | The input strings are: | | <input strings> | | The mappings are: | | <mapping strings> | Table 26: The prompt used to query GPT-4o. The salient tokens and mappings (§ 3.2), which are unique for every head, are plugged instead of <input strings> and <mapping strings>. Examples Table 25 provides examples of salient mappings and the patterns described by GPT-4o for three attention heads in GPT-2 xl and Pythia 6.9B. E.2 Interpretation Quality To assess the accuracy and plausibility of the model-generated descriptions, we let human annotators — five graduate students who are fluent English speakers — evaluate its responses in terms of (a) did GPT-4o correctly recognize the existence of a pattern in the mappings, (b) the quality of the generated descriptions, (c) the category of the recognized patterns. We conduct this study for a random sample of 138 (13.5%) heads in Pythia 6.9B and 134 (11.2%) heads in GPT-2 xl. Annotation instructions We present the instructions given to the human annotators in Figures 7, 8. <details> <summary>x13.png Details</summary> ![ec3e0a4a](/v1/image/ec3e0a4a31c69c38dd7ca9da3cfd8352c3dacfad445d21347d7ecd0c37c20ed1) ### Visual Description ## Instructions: GPT4 Pattern Identification Task ### Overview The image presents instructions for a task designed to evaluate GPT4's ability to infer relations or functions from a set of demonstrations. The task involves analyzing input strings and their mappings, identifying patterns, and then answering multiple-choice questions to assess the agreement between human assessment and GPT4's description of the patterns. ### Components/Axes * **Title:** Instructions * **Introduction:** Explains the goal of the task: to verify the correctness of GPT4 in inferring a relation or function from a list of demonstrations. * **Given Information:** * A list of 30 demonstrations of a function mapping an input string to a list of 5 strings, formatted as "s: t1, t2, t3, t4, t5". * A description generated by GPT4 of patterns identified across the input strings and their mappings. * **Task Instructions:** * Analyze input strings and mappings to identify prominent patterns (semantic, language-related, general, or unnatural). * Answer multiple-choice questions to indicate agreement with GPT4's description. * **Questions:** * **Q1:** Did GPT4 correctly identify the presence or lack of a pattern? (4 options) * 1: There is no observable pattern, and GPT4 indicated there is no pattern. * 2: There is no observable pattern, but GPT4 described a pattern. * 3: There is an observable pattern, and GPT4 indicated there is no pattern. * 4: There is an observable pattern, and GPT4 described a pattern. * **Q2:** (Answer only if Q1 is 4) How precise is the description of GPT4? (4 options) * Correct and accurate: The description accurately describes the pattern, without errors. * Correct but inaccurate: The description is correct overall, but too general/abstract or too specific/explicit. * Partially correct: The description describes the correct pattern to some degree, but includes incorrect parts. * Poor: The description does not describe the pattern at all. * **Q3:** (Answer only if Q1 is 3 or 4) How would you categorize the most prominent pattern? (4 options) * Semantic * Language * General * Unnatural ### Detailed Analysis or ### Content Details The instructions outline a process for evaluating GPT4's pattern recognition capabilities. The task is broken down into distinct steps: receiving input data (demonstrations and GPT4's description), analyzing the data for patterns, and providing feedback via multiple-choice questions. The questions are designed to assess both the accuracy and precision of GPT4's pattern identification. ### Key Observations * The task focuses on evaluating GPT4's ability to identify various types of patterns: semantic, language-related, general, and unnatural. * The multiple-choice questions provide a structured way to assess the quality of GPT4's pattern descriptions. * The instructions emphasize the importance of comparing human assessment with GPT4's output. ### Interpretation The instructions describe a human-in-the-loop evaluation of GPT4's ability to identify and describe patterns in data. The task aims to quantify the accuracy and precision of GPT4's pattern recognition, and to categorize the types of patterns it can successfully identify. The results of this evaluation could be used to improve GPT4's pattern recognition capabilities or to better understand its strengths and weaknesses in this area. The task is designed to be subjective, relying on human judgment to assess the quality of GPT4's descriptions. </details> Figure 7: First part of human annotation instructions. <details> <summary>x14.png Details</summary> ![63b19533](/v1/image/63b19533765cffbebd63da1df9d6b1ab73976ce6c5f1c4f99a8f9f8bd2733c27) ### Visual Description ## Document: Important Guidelines ### Overview The image presents a set of guidelines, likely for evaluating or categorizing patterns in data mappings, particularly in the context of natural language processing or similar tasks. The guidelines cover aspects such as pattern recognition, description accuracy, and mapping categories. ### Components/Axes The document is structured as a list of bullet points, each representing a specific guideline or instruction. There are no axes or scales in this image. ### Detailed Analysis or Content Details Here's a transcription of the text, organized by bullet point: * **Important guidelines:** * In Q1, we consider that "GPT4 indicated there is no pattern" if it either responded with the word "Unclear", or explained that there is no pattern in a sentence. * In cases where the description of the model includes suggestive commentary about the hidden motivation for the function represented in the mappings (in addition to an explicit explanation), the commentary should not be considered. An example for a description which includes commentary is "The mappings generally consist of repetitions or small variations of their corresponding input string's characters, *suggesting a pattern related to breaking down or rearranging the input string*". * We consider a pattern *recognizable* when it is apparent across 20 or more mappings. We require that *at least one* of the following will hold: * The functionality behind the mappings (of input to output strings) will be visible and clear - for example, mappings of words to their first letters. * The destination strings will be highly related to each other - for example, cases where all the source strings are mapped to numbers. * In cases where there is a mutual pattern encompassing *only* the source strings, we do not consider this as a recognizable pattern. * In Q2 we use the terms *correct* and *accurate* to label the descriptions. *Correct* descriptions describe the mappings and do not include incorrect parts. *Correct* descriptions might be *accurate* or *inaccurate*. The *inaccuracy* metric refers to whether the descriptions are too general (or too specific). * In Q3, the different mapping categories are: * *Semantic* - the mapping encodes semantic associations of the input strings (which might require knowledge). For example, associating countries with their capitals or languages. * *Language* - the mapping encodes a relationship which requires language knowledge (e.g. syntactic or lexical expertise) relationship. For example, mapping words to prefixes, or nouns to pronouns. * *General* - the mapping encodes a general functionality, which naturally can be applied to a large subset of strings. For example, mapping a string to itself, or a number to its successor/predecessor. * *Unnatural* - the mapping *does not* encode a recognizable/understandable function or relation, one that might be used for natural language processing (see examples of unnatural patterns in *the examples spreadsheet*). * Please use the Notes column to add any information, insight or problem you find relevant. ### Key Observations * The guidelines emphasize the importance of clear and explicit descriptions of patterns. * Recognizability is linked to the frequency of a pattern's occurrence (at least 20 mappings). * The document defines specific categories for classifying mappings: Semantic, Language, General, and Unnatural. * The guidelines refer to external resources, such as "the examples spreadsheet". ### Interpretation The document provides a framework for evaluating and categorizing patterns in data mappings. The guidelines aim to ensure consistency and accuracy in the description and classification of these patterns. The distinction between different mapping categories (Semantic, Language, General, Unnatural) suggests a hierarchical approach to understanding the relationships between input and output strings. The reference to "GPT4" and "natural language processing" indicates that these guidelines are likely used in the context of evaluating the performance or behavior of AI models. </details> Figure 8: Second part of human annotation instructions. Human study results The overall results per question and the distribution of responses across models and layers are presented in Figure 9 (Question 1), Figure 10 (Question 2), Figure 11 (Question 3). In 80% of the cases, GPT-4o correctly identifies the presence or absence of a pattern. In most of the failure cases (87%), the model described a pattern that is not visible in the mappings. We also find that in lower layers there are fewer patterns and they are harder to parse: there are higher rates of unnatural patterns and inaccurate descriptions. This agrees with our findings in § 4. In case of an observable pattern, GPT-4o will identify it: for 95% of heads with observable patterns, GPT-4o described a pattern, and $<$ 2% of the descriptions were labeled “poor”. Overall, this analysis shows that the quality of our automatic annotation pipeline is reasonable and demonstrates promising trends in automatically interpreting attention heads with MAPS. We leave further improvements to the pipeline for future work to explore. In particular, addressing model hallucinations could involve methods like aggregating multiple model responses to check its confidence (Kuhn et al., 2023), using intrinsic classifiers for hallucinations (e.g. Azaria and Mitchell (2023), Yu et al. (2024)), employing a strong LLM to indicate whether the generated pattern matches the mappings Gur-Arieh et al. (2025), using an NLI model Bohnet et al. (2022), or similarity-based heuristics. <details> <summary>x15.png Details</summary> ![de827d99](/v1/image/de827d9913408b7f7313d123b3b4c83b5827a1596515cb6e3b8367de8fdb4a4a) ### Visual Description ## Pie Chart: GPT4 Pattern Identification Accuracy ### Overview The image is a pie chart that visualizes the accuracy of GPT4 in identifying the presence or absence of patterns. The chart breaks down the results into four categories, each representing a different combination of actual pattern presence and GPT4's identification. ### Components/Axes * **Title:** Q1 Did GPT4 correctly identify the presence or lack of a pattern? * **Legend (Top-Left):** * Dark Green: There is an observable pattern, and GPT4 described a pattern. * Lime Green: There is no observable pattern, and GPT4 indicated there is no pattern. * Red: There is no observable pattern, but GPT4 described a pattern. * Dark Red: There is an observable pattern, and GPT4 indicated there is no pattern. * **Pie Chart Slices:** * Dark Green: 46.3% * Lime Green: 33.5% * Red: 17.6% * Dark Red: 2.6% ### Detailed Analysis * **Dark Green Slice:** Represents instances where there was an observable pattern, and GPT4 correctly identified and described it. This slice occupies 46.3% of the pie chart. * **Lime Green Slice:** Represents instances where there was no observable pattern, and GPT4 correctly indicated the absence of a pattern. This slice occupies 33.5% of the pie chart. * **Red Slice:** Represents instances where there was no observable pattern, but GPT4 incorrectly described a pattern. This slice occupies 17.6% of the pie chart. * **Dark Red Slice:** Represents instances where there was an observable pattern, but GPT4 incorrectly indicated the absence of a pattern. This slice occupies 2.6% of the pie chart. ### Key Observations * GPT4 correctly identified patterns (or lack thereof) in the majority of cases (46.3% + 33.5% = 79.8%). * GPT4 was more likely to incorrectly identify a pattern when none existed (17.6%) than to miss a pattern that was present (2.6%). ### Interpretation The pie chart suggests that GPT4 is generally accurate in identifying patterns. However, it is more prone to false positives (identifying patterns where none exist) than false negatives (missing existing patterns). This could indicate a bias in the model towards finding patterns, even when they are not truly present. The high percentage of correct identifications (79.8%) suggests that GPT4 is a useful tool for pattern recognition, but its tendency towards false positives should be considered when interpreting its results. </details> (a) Human annotation distribution for Question 1. <details> <summary>x16.png Details</summary> ![29fba248](/v1/image/29fba2480497b332d372deecc8e780c0bab6b3b096241a2d03008835d3abec62) ### Visual Description ## Stacked Bar Chart: GPT-2 xl Head Distribution by Layer Bin ### Overview The image is a stacked bar chart showing the distribution of attention heads in a GPT-2 XL model across different layer bins. The x-axis represents the layer bins, grouped into ranges of 12 layers each, and the y-axis represents the number of heads. Each bar is segmented into three colors (dark green, light green, and dark red), representing different categories of attention heads. The percentage of each category within each layer bin is labeled on the bar segments. ### Components/Axes * **Title:** GPT-2 xl * **X-axis:** Layer_Bin, with categories \[0, 12), \[12, 24), \[24, 36), \[36, 48) * **Y-axis:** # heads, ranging from 0 to 35. The scale has no explicit markings, but we can infer the approximate values. * **Colors:** * Dark Green: Represents the first category of attention heads. * Light Green: Represents the second category of attention heads. * Dark Red: Represents the third category of attention heads. ### Detailed Analysis Here's a breakdown of the data for each layer bin: * **Layer Bin \[0, 12):** * Dark Green: 0.0% * Light Green: 66.7% * Dark Red: 12.1% * Implied total height: 0 + 66.7 + 12.1 = 78.8%. The remaining 21.2% is not accounted for. * **Layer Bin \[12, 24):** * Dark Green: 21.2% * Light Green: 48.5% * Dark Red: 12.1% * Small Red: 3.0% * Implied total height: 21.2 + 48.5 + 12.1 + 3.0 = 84.8%. The remaining 15.2% is not accounted for. * **Layer Bin \[24, 36):** * Dark Green: 36.4% * Light Green: 79.4% * Dark Red: 8.8% * Small Red: 2.9% * Implied total height: 36.4 + 79.4 + 8.8 + 2.9 = 127.5%. This is impossible, there is an error in the data. * **Layer Bin \[36, 48):** * Dark Green: 55.9% * Light Green: 26.5% * Dark Red: 17.6% * Implied total height: 55.9 + 26.5 + 17.6 = 100%. ### Key Observations * The proportion of the first category (dark green) generally increases as the layer bin increases, except for the first bin. * The proportion of the second category (light green) decreases from the first bin to the last bin. * The proportion of the third category (dark red) varies across the layer bins, with the highest proportion in the last bin. * The total percentages for the first three bins do not add up to 100%, indicating missing data or calculation errors. ### Interpretation The chart illustrates how the distribution of attention heads changes across the layers of the GPT-2 XL model. The increasing proportion of the first category (dark green) in later layers might suggest that these heads become more important or specialized as the model processes information through deeper layers. Conversely, the decreasing proportion of the second category (light green) might indicate that these heads are more relevant in earlier layers. The varying proportion of the third category (dark red) suggests that these heads play a more dynamic role across different layers. The data inconsistencies (percentages not adding up to 100%) raise questions about the accuracy or completeness of the data. Further investigation is needed to understand the missing information and ensure the reliability of the analysis. </details> (b) Human annotation distribution for Question 1 across layers (GPT-2 xl). <details> <summary>x17.png Details</summary> ![51277be9](/v1/image/51277be97eb246234d3390d6e42f92b216f1d544da165af762f035ac191de722) ### Visual Description ## Stacked Bar Chart: Pythia 6.9B ### Overview The image is a stacked bar chart titled "Pythia 6.9B". It displays the distribution of attention heads across different layers of the Pythia 6.9B model. The x-axis represents the layer, divided into four categories: "[0, 8)", "[8, 16)", "[16, 24)", and "[24, 32)". The y-axis represents the number of heads. Each bar is segmented into three colored sections (dark green, light green, red, and dark red), each representing a proportion of the total number of heads for that layer. The percentages for each segment are labeled on the chart. ### Components/Axes * **Title:** Pythia 6.9B * **X-axis:** Layer, with categories: \[0, 8), \[8, 16), \[16, 24), \[24, 32) * **Y-axis:** # heads, ranging from 0 to 40. * **Bar Colors (implied legend):** * Dark Green: Represents the base layer. * Light Green: Represents the second layer. * Red: Represents the third layer. * Dark Red: Represents the top layer. ### Detailed Analysis * **Layer \[0, 8):** * Dark Green: 31.4% (approximately 11 heads) * Light Green: 48.6% (approximately 17 heads) * Red: 5.7% (approximately 2 heads) * Dark Red: 14.3% (approximately 5 heads) * **Layer \[8, 16):** * Dark Green: 40.5% (approximately 15 heads) * Light Green: 21.6% (approximately 8 heads) * Red: 2.7% (approximately 1 head) * Dark Red: 35.1% (approximately 13 heads) * **Layer \[16, 24):** * Dark Green: 68.0% (approximately 17 heads) * Light Green: 12.0% (approximately 3 heads) * Red: 0.0% (approximately 0 heads) * Dark Red: 20.0% (approximately 5 heads) * **Layer \[24, 32):** * Dark Green: 43.9% (approximately 17 heads) * Light Green: 31.7% (approximately 12 heads) * Red: 4.9% (approximately 2 heads) * Dark Red: 19.5% (approximately 8 heads) ### Key Observations * The dark green segment (base layer) is most prominent in the \[16, 24) layer. * The dark red segment (top layer) is most prominent in the \[8, 16) layer. * The red segment (third layer) is smallest in the \[8, 16) and \[16, 24) layers. * The light green segment (second layer) is most prominent in the \[0, 8) layer. ### Interpretation The stacked bar chart visualizes the distribution of attention heads across different layers of the Pythia 6.9B model. The data suggests that the importance or contribution of each layer varies depending on the overall layer depth. For example, the base layer (dark green) seems to be more dominant in the middle layers \[16, 24), while the top layer (dark red) is more prominent in the \[8, 16) layer. The red layer is consistently the smallest, suggesting it might play a less significant role in the model's overall attention mechanism. The distribution of attention heads across layers likely reflects the model's architecture and how it processes information at different stages. </details> (c) Human annotation distribution for Question 1 across layers (Pythia 6.9B). Figure 9: Quality of GPT-4o interpretation (§ E) - Human annotation distribution for Question 1. <details> <summary>x18.png Details</summary> ![9d840554](/v1/image/9d84055443d179d2e84b0dbade35de9541442ab85ee9becaec6d5a1d19f36405) ### Visual Description ## Pie Chart: Accuracy of GPT4 Description ### Overview The image is a pie chart showing the distribution of responses to the question "How accurate is the description of GPT4?". The question is labeled "Q2" and includes the condition "(answer only if your answer to Q1 is 4)". The pie chart is divided into four categories: "Correct and accurate", "Partially correct", "Correct but inaccurate", and "Poor". Each category is represented by a different color and percentage. ### Components/Axes * **Title:** Q2 (answer only if your answer to Q1 is 4) How accurate is the description of GPT4? * **Categories:** * Correct and accurate (Green) - 66.4% * Partially correct (Orange-Yellow) - 16.8% * Correct but inaccurate (Orange) - 15.2% * Poor (Red) - 1.6% ### Detailed Analysis The pie chart is divided into four sections, each representing a different level of accuracy in the description of GPT4. * **Correct and accurate (Green):** This category represents the largest portion of the pie chart, with 66.4% of the responses indicating that the description of GPT4 is both correct and accurate. * **Partially correct (Orange-Yellow):** This category represents 16.8% of the responses, indicating that the description of GPT4 is partially correct. * **Correct but inaccurate (Orange):** This category represents 15.2% of the responses, indicating that the description of GPT4 is correct but inaccurate. * **Poor (Red):** This category represents the smallest portion of the pie chart, with only 1.6% of the responses indicating that the description of GPT4 is poor. ### Key Observations * The vast majority of respondents (66.4%) believe the description of GPT4 is "Correct and accurate". * A significant portion of respondents (16.8%) find the description "Partially correct". * Only a small fraction of respondents (1.6%) consider the description "Poor". ### Interpretation The pie chart suggests that the majority of people who answered "4" to question 1 find the description of GPT4 to be correct and accurate. There is a notable portion who find it partially correct, while very few consider it poor. This indicates a generally positive perception of the accuracy of GPT4's description among the surveyed group. The condition that the answer to Q1 must be 4 adds a layer of context, suggesting that this group may have a specific understanding or experience with GPT4 that influences their perception of its description. </details> (a) Human annotation distribution for Question 2. <details> <summary>x19.png Details</summary> ![faf62b94](/v1/image/faf62b946d7a52d6275ee38a790779972497faf72e02ff86c8916707a406918d) ### Visual Description ## Stacked Bar Chart: GPT-2 xl Head Distribution Across Layers ### Overview The image is a stacked bar chart visualizing the distribution of attention heads in the GPT-2 xl model across different layer groups. The x-axis represents the layer groups, and the y-axis represents the number of heads. Each bar is segmented into colored sections, each representing a different head type, with percentages indicating the proportion of each head type within that layer group. ### Components/Axes * **Title:** GPT-2 xl * **X-axis:** Layer, with categories \[0, 12), \[12, 24), \[24, 36), \[36, 48) * **Y-axis:** # heads, with a scale from 0 to 25 in increments of 5. * **Bar Segments:** Each bar is divided into sections, each with a percentage label. The colors are green, orange, and red. ### Detailed Analysis The chart displays the distribution of attention heads across four layer groups. The height of each segment represents the number of heads of that type within the layer group. * **Layer \[0, 12):** * Green: 28.6% * Orange: 42.9% * Red: 28.6% * Total height of the bar is approximately 7 heads. * **Layer \[12, 24):** * Green: 58.3% * Orange: 25.0% * Red: 8.3% * Total height of the bar is approximately 12 heads. * **Layer \[24, 36):** * Green: 74.1% * Orange: 7.4% * Red: 18.5% * Total height of the bar is approximately 27 heads. * **Layer \[36, 48):** * Green: 78.9% * Orange: 10.5% * Red: 10.5% * Total height of the bar is approximately 20 heads. ### Key Observations * The number of heads increases from layer group \[0, 12) to \[24, 36), then decreases slightly in \[36, 48). * The proportion of green heads (likely representing a specific type of attention) increases significantly from \[0, 12) to \[36, 48). * The proportion of orange heads decreases from \[0, 12) to \[24, 36), then increases slightly in \[36, 48). * The proportion of red heads decreases from \[0, 12) to \[12, 24), increases in \[24, 36), then decreases in \[36, 48). ### Interpretation The chart illustrates how the distribution of different types of attention heads changes across the layers of the GPT-2 xl model. The increasing proportion of green heads in later layers suggests that this type of attention may become more important as the model processes information through deeper layers. The variations in the proportions of orange and red heads may reflect different roles or specializations of these attention types within the model. The peak in the total number of heads in layer group \[24, 36) could indicate a critical processing stage within the model's architecture. </details> (b) Human annotation distribution for Question 2 across layers (GPT-2 xl). <details> <summary>x20.png Details</summary> ![14e0f8ed](/v1/image/14e0f8edc09005ecc4355adc60bfd7efe813c63803ef95060f38a2d32eeada8c) ### Visual Description ## Stacked Bar Chart: Pythia 6.9B ### Overview The image is a stacked bar chart titled "Pythia 6.9B". The chart displays the distribution of "# heads" across different "Layer" intervals. Each bar represents a layer interval, and the bar is segmented into colored sections, each representing a percentage. The y-axis represents the number of heads, ranging from 0 to 15. The x-axis represents the layer intervals: [0, 8), [8, 16), [16, 24), and [24, 32). The colors within each bar represent different categories, but the legend is missing, so the meaning of the colors is unknown. ### Components/Axes * **Title:** Pythia 6.9B * **Y-axis Label:** # heads * **Y-axis Scale:** 0, 5, 10, 15 * **X-axis Label:** Layer * **X-axis Categories:** [0, 8), [8, 16), [16, 24), [24, 32) * **Colors:** Green, Orange, Yellow, Red (The meaning of these colors is not specified in the image) ### Detailed Analysis The chart presents data for four layer intervals: [0, 8), [8, 16), [16, 24), and [24, 32). Each interval's bar is divided into colored segments, with percentages indicating the proportion of each segment. * **Layer [0, 8):** * Green: 45.5% * Orange: 27.3% * Yellow: 27.3% * Red: 0.0% * **Layer [8, 16):** * Green: 46.7% * Orange: 33.3% * Yellow: 20.0% * Red: 0.0% * **Layer [16, 24):** * Green: 75.0% * Orange: 12.5% * Yellow: 6.2% * Red: 6.2% * **Layer [24, 32):** * Green: 83.3% * Orange: 5.6% * Yellow: 11.1% * Red: 0.0% ### Key Observations * The green segment increases significantly from layer [0, 8) to [24, 32), indicating a higher proportion in the later layers. * The orange segment decreases from layer [8, 16) to [24, 32). * The yellow segment decreases from layer [0, 8) to [16, 24) and then increases slightly in [24, 32). * The red segment is only present in layer [16, 24). ### Interpretation The stacked bar chart illustrates the distribution of "# heads" across different layers of the Pythia 6.9B model. The increasing proportion of the green segment suggests that the component represented by green becomes more dominant in later layers. The presence of the red segment only in layer [16, 24) indicates that the component represented by red is specific to that layer interval. Without a legend, the exact meaning of each color is unknown, but the chart clearly shows how the composition of "# heads" changes across different layers. </details> (c) Human annotation distribution for Question 2 across layers (Pythia 6.9B). Figure 10: Quality of GPT-4o interpretation (§ E) - Human annotation distribution for Question 2. <details> <summary>x21.png Details</summary> ![bb5acd70](/v1/image/bb5acd70a63caee1101ecf166a1b836435324470d27708e24b1cb769207c5cde) ### Visual Description ## Pie Chart: Categorization of Prominent Pattern ### Overview The image is a pie chart presenting the results of a survey question (Q3) asking respondents to categorize the most prominent pattern. The question was only asked if the respondent's answer to question Q1 was 3 or 4. The pie chart is divided into four categories: Semantic, Language, General, and Unnatural, with each slice representing the percentage of respondents who selected that category. ### Components/Axes * **Title:** Q3 (answer only if your answer to Q1 is 3 or 4) How would you categorise the most prominent pattern? * **Categories:** * 1: Semantic (teal) * 2: Language (yellow) * 3: General (light purple) * 4: Unnatural (light gray) * **Percentages:** Each slice is labeled with its corresponding percentage. ### Detailed Analysis * **1: Semantic (teal):** 31.1% - Located in the top-right quadrant of the pie chart. * **2: Language (yellow):** 21.2% - Located in the bottom-left quadrant of the pie chart. * **3: General (light purple):** 28.8% - Located in the top-left quadrant of the pie chart. * **4: Unnatural (light gray):** 18.9% - Located in the bottom-right quadrant of the pie chart. ### Key Observations * The "Semantic" category has the highest percentage (31.1%), indicating it was the most frequently selected category. * The "Unnatural" category has the lowest percentage (18.9%). * The "General" category is the second most frequent with 28.8%. * The "Language" category is the third most frequent with 21.2%. ### Interpretation The pie chart illustrates the distribution of responses to the question of how respondents would categorize the most prominent pattern. The results suggest that "Semantic" patterns were the most commonly identified, followed by "General" patterns. "Unnatural" patterns were the least frequently identified. The data provides insight into how individuals perceive and categorize patterns, with a preference towards semantic interpretations. The fact that the question was conditional on the answer to Q1 being 3 or 4 suggests that the respondents may have already been primed to consider certain types of patterns. </details> (a) Human annotation distribution for Question 3. <details> <summary>x22.png Details</summary> ![843e868e](/v1/image/843e868e7348ada323edfbee403f8e800dc09ef37e5e3e24c9417e2e05c87d7d) ### Visual Description ## Stacked Bar Chart: GPT-2 xl Head Distribution by Layer ### Overview The image is a stacked bar chart visualizing the distribution of attention heads across different layers in the GPT-2 xl model. The x-axis represents the layer groups, and the y-axis represents the number of heads. Each bar is segmented into colored sections, each representing a different type of attention head. The percentage of each head type within each layer group is labeled on the bar segments. ### Components/Axes * **Title:** GPT-2 xl * **X-axis:** Layer, with categories: \[0, 12), \[12, 24), \[24, 36), \[36, 48) * **Y-axis:** # heads, with a scale from 0 to 25 in increments of 5. * **Bar Segments (Colors and Approximate Values):** * Teal: Represents the first segment of each bar. * \[0, 12): 0.0% * \[12, 24): 15.4% * \[24, 36): 21.4% * \[36, 48): 47.4% * Light Blue: Represents the second segment of each bar. * \[0, 12): 16.7% * \[12, 24): 0.0% * \[24, 36): 28.6% * \[36, 48): 10.5% * Yellow: Represents the third segment of each bar. * \[0, 12): 33.3% * \[12, 24): 53.8% * \[24, 36): 46.4% * \[36, 48): 31.6% * Grey: Represents the fourth segment of each bar. * \[0, 12): 50.0% * \[12, 24): 30.8% * \[24, 36): 3.6% * \[36, 48): 10.5% ### Detailed Analysis * **Layer \[0, 12):** * Teal: 0.0% * Light Blue: 16.7% * Yellow: 33.3% * Grey: 50.0% * **Layer \[12, 24):** * Teal: 15.4% * Light Blue: 0.0% * Yellow: 53.8% * Grey: 30.8% * **Layer \[24, 36):** * Teal: 21.4% * Light Blue: 28.6% * Yellow: 46.4% * Grey: 3.6% * **Layer \[36, 48):** * Teal: 47.4% * Light Blue: 10.5% * Yellow: 31.6% * Grey: 10.5% ### Key Observations * The distribution of head types varies significantly across the layers. * The teal segment increases as the layer increases. * The light blue segment is highest in the first layer group and then decreases. * The yellow segment is highest in the second layer group. * The grey segment is highest in the first layer group and then decreases. ### Interpretation The stacked bar chart illustrates how the composition of attention heads changes across different layer groups in the GPT-2 xl model. The data suggests that different types of attention heads may be more prominent or specialized in certain layers. The increasing proportion of the teal segment in later layers could indicate a shift in the type of attention being utilized as the model processes information through its layers. The other segments decrease as the layer increases. The chart provides insights into the model's internal workings and how it distributes its attention mechanisms across its depth. </details> (b) Human annotation distribution for Question 3 across layers (GPT-2 xl). <details> <summary>x23.png Details</summary> ![6714e271](/v1/image/6714e271f6e2206907aab0f047adf92b780b6c74f59edb6c1339d9b7b44b0893) ### Visual Description ## Stacked Bar Chart: Pythia 6.9B ### Overview The image is a stacked bar chart titled "Pythia 6.9B". It visualizes the distribution of "# heads" across different "Layer" intervals. Each bar represents a layer interval, and the bar is segmented into colored sections, each representing a percentage of the total "# heads" for that layer. ### Components/Axes * **Title:** Pythia 6.9B * **Y-axis:** "# heads", with a scale from 0 to 20. * **X-axis:** "Layer", with four categories: \[0, 8), \[8, 16), \[16, 24), and \[24, 32). * **Bar Segments (from bottom to top):** * Teal: Represents the lowest segment of each bar. * Purple: Represents the second segment from the bottom. * Yellow: Represents the third segment from the bottom. * Gray: Represents the top segment of each bar. ### Detailed Analysis Here's a breakdown of the percentage values for each layer interval: * **Layer \[0, 8):** * Teal: 30.8% * Purple: 15.4% * Yellow: 7.7% * Gray: 46.2% * **Layer \[8, 16):** * Teal: 31.2% * Purple: 25.0% * Yellow: 31.2% * Gray: 12.5% * **Layer \[16, 24):** * Teal: 41.2% * Purple: 23.5% * Yellow: 23.5% * Gray: 11.8% * **Layer \[24, 32):** * Teal: 25.0% * Purple: 35.0% * Yellow: 15.0% * Gray: 25.0% ### Key Observations * The teal segment (bottom) is largest in the \[16, 24) layer, representing 41.2%. * The gray segment (top) is largest in the \[0, 8) layer, representing 46.2%. * The distribution of percentages across segments varies significantly between layers. ### Interpretation The chart illustrates how the "# heads" are distributed across different layers in the Pythia 6.9B model. The varying percentages suggest that different layers may have different characteristics or roles within the model. For example, the \[0, 8) layer has a high percentage in the gray segment, while the \[16, 24) layer has a high percentage in the teal segment. This could indicate that certain types of heads are more prevalent in specific layers. </details> (c) Human annotation distribution for Question 3 across layers (Pythia 6.9B). Figure 11: Quality of GPT-4o interpretation (§ E) - Human annotation distribution for Question 3. Appendix F Analysis of Global Versus Specific Functionality We observe that the mappings in $M$ provide a broad view of the head’s functionality, particularly on how global the head’s operation is. For example, a head that maps any token to an end-of-sequence token has global functionality, whereas heads that map countries to their capitals, colors to their complementary pairs, and so on, demonstrate specific operations. In this section, we use properties of $M$ to analyze how global the functionalities of attention heads in LLMs are. Analysis We estimate how global the functionality of a given head is using two metrics: input skewness, which captures the skewness of the head’s operation towards specific inputs, and output space size, which estimates the number of tokens the head tends to output. For input skewness, we obtain the saliency scores $\sigma_{t}(W_{VO})\;∀ t∈\mathcal{V}$ according to the head (see § 3.2), and calculate the skewness of their distribution. For output space size, we compute for every token $s∈\mathcal{V}$ the highest-score token $t$ it is mapped into according to $M$ : $t=\arg\max(\mathbf{m}_{s})$ . Next, we define the output space size to be the portion of unique output tokens over the vocabulary. For instance, we expect the output space of a head that only maps strings to their first letters to be a small set of letter tokens. Similarly to the normalization of the saliency scores by the embedding norms, which we applied in § 3.2, here, when calculating $M$ , we normalize the unembeddings ( $U$ ’s columns). <details> <summary>x24.png Details</summary> ![30bb0af8](/v1/image/30bb0af84c8b1213aafcfaf179f11917d5ea1843899cf3ec5c1259390c995e03) ### Visual Description ## Chart: Input Skewness and Output Space Size vs. Layer ### Overview The image presents two line charts comparing "Input Skewness" and "Output Space Size" across layers for two different models: GPT2 xl and Pythia 6.9b. Each chart displays the variation of these two metrics as the layer number increases. The charts also include horizontal lines indicating "Global head" and "Specific head" levels. ### Components/Axes **General Layout:** * Two charts are positioned side-by-side. The left chart represents "GPT2 xl," and the right chart represents "Pythia 6.9b." * Each chart has two y-axes: the left y-axis represents "Input Skewness," and the right y-axis represents "Output Space Size." * The x-axis represents "layer" for both charts. * A legend is located at the top of the image, indicating that the blue line represents "Input skewness" and the orange line represents "Output space size." **Left Chart (GPT2 xl):** * **Title:** GPT2 xl * **X-axis:** "layer" ranging from 0 to 40 in increments of 20. * **Left Y-axis:** "Input Skewness" ranging from 0.0 to 2.0 in increments of 0.5. * **Right Y-axis:** "Output Space Size" ranging from 0.0 to 0.4 in increments of 0.1. * **Horizontal Lines:** * "Global head" at approximately 1.25 on the Input Skewness axis (approximately 0.25 on the Output Space Size axis). * "Specific head" at 0.0 on the Input Skewness axis (0.0 on the Output Space Size axis). **Right Chart (Pythia 6.9b):** * **Title:** Pythia 6.9b * **X-axis:** "layer" ranging from 0 to 30 in increments of 15. * **Left Y-axis:** "Input Skewness" ranging from 0.0 to 2.0 in increments of 0.5. * **Right Y-axis:** "Output Space Size" ranging from 0.0 to 0.4 in increments of 0.1. * **Horizontal Lines:** * "Global head" at approximately 0.8 on the Input Skewness axis (approximately 0.16 on the Output Space Size axis). * "Specific head" at 0.0 on the Input Skewness axis (0.0 on the Output Space Size axis). **Legend:** * Located at the top of the image. * Blue line: "Input skewness" * Orange line: "Output space size" ### Detailed Analysis **Left Chart (GPT2 xl):** * **Input skewness (blue line):** Starts high (around 1.4), drops sharply to approximately 0.6 around layer 5, fluctuates between 0.5 and 1.0, and ends around 0.6 at layer 40. * **Output space size (orange line):** Starts around 1.0, fluctuates between 0.4 and 1.1, and ends around 0.5 at layer 40. **Right Chart (Pythia 6.9b):** * **Input skewness (blue line):** Starts high (around 1.2), drops sharply to approximately 0.2 around layer 10, fluctuates slightly, and ends around 0.2 at layer 30. * **Output space size (orange line):** Starts high (around 1.7), drops sharply to approximately 0.1 around layer 15, fluctuates slightly, and ends around 0.1 at layer 30. ### Key Observations * Both models show a decrease in "Input skewness" and "Output space size" as the layer number increases. * The "Input skewness" drops more sharply in Pythia 6.9b compared to GPT2 xl. * The "Output space size" also decreases more sharply in Pythia 6.9b compared to GPT2 xl. * The "Global head" level is higher for GPT2 xl compared to Pythia 6.9b. ### Interpretation The charts illustrate how "Input Skewness" and "Output Space Size" change across different layers in the GPT2 xl and Pythia 6.9b models. The decreasing trends suggest that as the models process information through deeper layers, both the skewness of the input and the size of the output space tend to diminish. The difference in the "Global head" levels and the rate of decrease between the two models may indicate variations in their architectural design or training methodologies. The "Specific head" level at 0.0 likely represents a baseline or a point of reference for comparison. The shaded regions around the lines likely represent variance or uncertainty in the measurements. </details> Figure 12: Input skewness versus output space size for all attention heads per layer in Pythia 6.9B and GPT-2 xl, compared to baseline heads of global and specific functionalities. Lower input skewness indicates a larger input space. Additionally, we present two baselines. The first baseline, dubbed “specific head”, represents the output space size of a head that maps the entire vocabulary to 1 specific token (e.g. a head that always outputs the end of sequence token). The second baseline, called “global head”, represents the output space size of a head that maps the entire vocabulary to capitalized tokens with leading spaces - a subset whose size is 25% of the vocabulary of GPT-2 xl, and 16% of the vocabulary of Pythia 6.9B. An example of such a “global head” is a head that maps every word (or sub-word) in English to its capitalized version, and all other tokens to one specific token. Results Figure 12 shows the input skewness and output space sizes for all heads in Pythia 6.9B and GPT-2 xl. In both models, the input skewness rises and then sharply decreases in the early layers, after which it stabilizes. This implies that attention heads in shallower layers induce a salient effect into a specific set of inputs compared to later layers. In contrast, the output space size generally decreases across layers with a slight increase in the final layers, suggesting that head outputs across layers converge to smaller token subsets. Taken together, we hypothesize that early layer heads demonstrate their functionality on fewer inputs than deeper heads, which in turn map a larger set of possible inputs to a small set of outputs. Appendix G Resources and Packages In our experiments, we used models and code from the transformers Wolf et al. (2020) and TransformerLens Nanda and Bloom (2022) packages, and nanoGPT. https://github.com/karpathy/nanoGPT All the experiments were conducted using a single A100 80GB or H100 80GB GPU, aside from the experiments studying Llama-3.1 70B, which used nodes with 8 of these GPUs.

Rendering Paper...