# Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
> Corresponding author
Abstract
Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on questionâanswer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
Wen Luo $\heartsuit$ , Guangyue Peng $\heartsuit$ , Wei Li $\heartsuit$ , Shaohang Wei $\heartsuit$ , Feifan Song $\heartsuit$ , Liang Wang â , Nan Yang â , Xingxing Zhang â , Jing Jin $\heartsuit$ , Furu Wei â , Houfeng Wang $\heartsuit$ thanks: Corresponding author $\heartsuit$ State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University â Microsoft Research Asia
1 Introduction
Despite their remarkable capabilities in natural language understanding and generation, large language models (LLMs) often produce hallucinations âoutputs that appear plausible but are factually incorrect. This phenomenon poses a critical challenge for deploying LLMs in real-world applications where reliability and trustworthiness are paramount (Shi et al., 2024; Bai et al., 2024). One line of research tackles hallucination detection from an extrinsic perspective (Min et al., 2023; Hu et al., 2025; Huang et al., 2025), evaluating only the modelâs outputs while disregarding its internal dynamics. Although such approaches can identify surface-level textual inconsistencies, their extrinsic focus limits the insight they offer into the underlying causes of hallucinations. Complementing these efforts, another line of work investigates the intrinsic properties of LLMs, revealing that their internal representations encode rich truthfulness signals (Burns et al., 2023; Li et al., 2023; Chen et al., 2024; Orgad et al., 2025; Niu et al., 2025). These internal truthfulness signals can be exploited to detect an LLMâs own generative hallucinations by training a linear classifier (i.e., a probe) on its hidden representations. However, while prior work establishes the presence of such cues, the mechanisms by which they arise and operate remain largely unexplored. Recent studies indicate well-established mechanisms in LLMs that underpin complex capabilities such as in-context learning (Wang et al., 2023), long-context retrieval (Wu et al., 2025), and reasoning (Qian et al., 2025). This observation naturally leads to a key question: how do truthfulness cues arise and function within LLMs?
In this paper, we uncover that truthfulness signals in LLMs arise from two distinct information pathways: (1) a Question-Anchored (Q-Anchored) pathway, which depends on the flow of information from the input question to the generated answer, and (2) an Answer-Anchored (A-Anchored) pathway, which derives self-contained evidence directly from the modelâs own outputs. We begin with a preliminary study using saliency analysis to quantify information flow potentially relevant to hallucination detection. Results reveal a bimodal distribution of dependency on questionâanswer interactions, suggesting heterogeneous truthfulness encoding mechanisms. To validate this hypothesis, we design two experiments across 4 diverse datasets using 12 models that vary in both architecture and scale, including base, instruction-tuned, and reasoning-oriented models. By (i) blocking critical questionâanswer information flow through attention knockout (Geva et al., 2023; Fierro et al., 2025) and (ii) injecting hallucinatory cues into questions via token patching (Ghandeharioun et al., 2024; Todd et al., 2024), we disentangle these truthfulness pathways. Our analyses confirm that Q-Anchored signals rely heavily on question-derived cues, whereas A-Anchored signals are robust to their removal and primarily originate from the generated answer itself.
Building on this foundation, we further investigate emergent properties of these truthfulness pathways through large-scale experiments. Our findings highlight two intriguing characteristics: (1) Association with knowledge boundaries: Q-anchored encoding predominates for well-established facts that fall within the knowledge boundary, whereas A-anchored encoding is favored in long-tail cases. (2) Self-awareness: LLM internal states can distinguish which mechanism is being employed, suggesting intrinsic awareness of pathway distinctions.
Finally, these analyses not only deepen our mechanistic understanding of hallucinations but also enable practical applications. Specifically, by leveraging the fundamentally different dependencies of the truthfulness pathways and the modelâs intrinsic awareness, we propose two pathway-aware strategies to enhance hallucination detection. (1) Mixture-of-Probes (MoP): Motivated by the specialization of internal pathways, MoP employs a set of expert probing classifiers, each tailored to capture distinct truthfulness encoding mechanisms. (2) Pathway Reweighting (PR): From the perspective of selectively emphasizing pathway-relevant internal cues, PR modulates information intensity to amplify signals that are most informative for hallucination detection, aligning internal activations with pathway-specific evidence. Experiments demonstrate that our proposed methods consistently outperform competing approaches, achieving up to a 10% AUC gain across various datasets and models.
Overall, our key contributions are summarized as follows:
- (Mechanism) We conduct a systematic investigation into how internal truthfulness signals emerge and operate within LLMs, revealing two distinct information pathways: a Question-Anchored pathway that relies on questionâanswer information flow, and an Answer-Anchored pathway that derives self-contained evidence from the generated output.
- (Discovery) Through large-scale experiments across multiple datasets and model families, we identify two key properties of these mechanisms: (i) association with knowledge boundaries, and (ii) intrinsic self-awareness of pathway distinctions.
- (Application) Building on these findings, we propose two pathway-aware detection methods that exploit the complementary nature of the two mechanisms to enhance hallucination detection, providing new insights for building more reliable generative systems.
2 Background
2.1 Hallucination Detection
Given an LLM $f$ , we denote the dataset as $D=\{(q_{i},\hat{y}^{f}_{i},z^{f}_{i})\}_{i=1}^{N}$ , where $q_{i}$ is the question, $\hat{y}^{f}_{i}$ the modelâs answer in open-ended generation, and $z^{f}_{i}â\{0,1\}$ indicates whether the answer is hallucinatory. The task is to predict $z^{f}_{i}$ given the input $x^{f}_{i}=[q_{i},\hat{y}^{f}_{i}]$ for each instance. Cases in which the model refuses to answer are excluded, as they are not genuine hallucinations and can be trivially classified. Methods based on internal signals assume access to the modelâs hidden representations but no external resources (e.g., retrieval systems or factâchecking APIs) (Xue et al., 2025a). Within this paradigm, probing trains a lightweight linear classifier on hidden activations to discriminate between hallucinatory and factual outputs, and has been shown to be among the most effective approaches in this class of internal-signal-based methods (Orgad et al., 2025).
2.2 Exact Question and Answer Tokens
To analyze the origins and mechanisms of truthfulness signals in LLMs, we primarily focus on exact tokens in questionâanswer pairs. Not all tokens contribute equally to detecting factual errors: some carry core information essential to the meaning of the question or answer, while others provide peripheral details. We draw on semantic frame theory (Baker et al., 1998; Pagnoni et al., 2021), which represents a situation or event along with its participants and their roles. In the theory, frame elements are categorized as: (1) Core frame elements, which define the situation itself, and (2) Non-core elements, which provide additional, non-essential context.
As shown in Table 1, we define: (1) Exact question tokens: core frame elements in the question, typically including the exact subject and property tokens (i.e., South Carolina and capital). (2) Exact answer tokens: core frame elements in the answer that convey the critical information required to respond correctly (i.e., Columbia). Humans tend to rely more on core elements when detecting errors, as these tokens carry the most precise information. Consistent with this intuition, recent work (Orgad et al., 2025) shows that probing activations on the exact answer tokens offers the strongest signal for hallucination detection, outperforming all other token choices. Motivated by these findings, our analysis mainly centers on exact tokens to probe truthfulness signals in LLMs. Moreover, to validate the robustness of our conclusions, we also conduct comprehensive experiments using alternative, nonâexact-token configurations (see Appendix B.2).
| Question: What is the capital of South Carolina? |
| --- |
| Answer: It is Columbia, a hub for government, culture, and education that houses the South Carolina State House and the University of South Carolina. |
Table 1: Example of exact question and answer tokens. Colors indicate token types: â exact property, â exact subject, and â exact answer tokens.
3 Two Internal Truthfulness Pathways
We begin with a preliminary analysis using metrics based on saliency scores (§ 3.1). The quantitative results reveal two distinct information pathways for truthfulness encoding: (1) a Question-Anchored (Q-Anchored) Pathway, which relies heavily on exact question tokens (i.e., the questions), and (2) an Answer-Anchored (A-Anchored) Pathway, in which the truthfulness signal is largely independent of the question-to-answer information flow. Section 3.2 presents experiments validating this hypothesis. In particular, we show that Q-Anchored Pathway depends critically on information flowing from the question to the answer, whereas the signals along the A-Anchored Pathway are primarily derived from the LLM-generated answer itself.
3.1 Saliency-Driven Preliminary Study
This section investigates the intrinsic characteristics of LLM attention interactions and their potential role in truthfulness encoding. We employ saliency analysis (Simonyan et al., 2014), a widely used interpretability method, to reveal how attention among tokens influences probe decisions. Following common practice (Michel et al., 2019; Wang et al., 2023), we compute the saliency score as:
$$
S^{l}(i,j)=\left|A^{l}(i,j)\frac{\partial\mathcal{L}(x)}{\partial A^{l}(i,j)}\right|, \tag{1}
$$
where $S^{l}$ denotes the saliency score matrix of the $l$ -th layer, $A^{l}$ represents the attention weights of that layer, and $\mathcal{L}$ is the loss function for hallucination detection (i.e., the binary cross-entropy loss). Scores are averaged over all attention heads within each layer. In particular, $S^{l}(i,j)$ quantifies the saliency of attention from query $i$ to key $j$ , capturing how strongly the information flow from $j$ to $i$ contributes to the detection. We study two types of information flow: (1) $S_{E_{Q}â E_{A}}$ , the saliency of direct information flow from the exact question tokens to the exact answer tokens, and (2) $S_{E_{Q}â*}$ , the saliency of the total information disseminated by the exact question tokens.
Results
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Density Plots: Salience Score Distributions for Llama Models
### Overview
The image presents two density plots, side-by-side, visualizing the distribution of "Saliency Scores" for two different Llama models: Llama-3-8B (left) and Llama-3-70B (right). Each plot displays the density of scores for two different datasets: TriviaQA and NQ (Natural Questions). The y-axis represents "Density", and the x-axis represents "Saliency Score".
### Components/Axes
* **X-axis:** "Saliency Score". The scale ranges from approximately -0.5 to 1.5 for Llama-3-8B and from approximately -0.1 to 0.2 for Llama-3-70B.
* **Y-axis:** "Density". The scale ranges from 0.0 to 0.75 for Llama-3-8B and from 0.0 to 4.0 for Llama-3-70B.
* **Legend:** Located at the bottom of the image, the legend identifies the data series using color-coding:
* Light Blue: `S_E_Q -> E_A (TriviaQA)`
* Orange: `S_E_Q -> * (TriviaQA)`
* Light Green: `S_E_Q -> E_A (NQ)`
* Pink: `S_E_Q -> * (NQ)`
* **Titles:** Each plot has a title indicating the Llama model being analyzed: "Llama-3-8B" (left) and "Llama-3-70B" (right).
### Detailed Analysis or Content Details
**Llama-3-8B (Left Plot):**
* **S_E_Q -> E_A (TriviaQA) - Light Blue:** This line exhibits a primary peak around a Saliency Score of approximately 0.1, with a secondary, smaller peak around 0.7. The density decreases towards both ends of the scale.
* **S_E_Q -> * (TriviaQA) - Orange:** This line shows a broad distribution with a peak around a Saliency Score of 0.4. The density is relatively low across most of the range, with a slight increase towards the right.
* **S_E_Q -> E_A (NQ) - Light Green:** This line has a prominent peak around a Saliency Score of 0.0, with a rapid decline in density as the score increases.
* **S_E_Q -> * (NQ) - Pink:** This line displays a similar pattern to the green line, with a peak near 0.0, but with a slightly broader distribution and lower overall density.
**Llama-3-70B (Right Plot):**
* **S_E_Q -> E_A (TriviaQA) - Light Blue:** This line has a sharp peak around a Saliency Score of approximately 0.0, with a very rapid decline in density.
* **S_E_Q -> * (TriviaQA) - Orange:** This line shows a peak around a Saliency Score of approximately 0.05, with a similar rapid decline in density as the blue line.
* **S_E_Q -> E_A (NQ) - Light Green:** This line exhibits a peak around a Saliency Score of approximately 0.0, with a rapid decline in density.
* **S_E_Q -> * (NQ) - Pink:** This line shows a peak around a Saliency Score of approximately 0.1, with a rapid decline in density.
### Key Observations
* The Llama-3-70B model exhibits much higher density values overall compared to the Llama-3-8B model.
* For both models, the distributions for the NQ dataset (green and pink lines) are generally concentrated closer to a Saliency Score of 0.0 than the distributions for the TriviaQA dataset (blue and orange lines).
* The Llama-3-8B model shows a wider range of Saliency Scores, particularly for the TriviaQA dataset, compared to the Llama-3-70B model.
* The `S_E_Q -> *` distributions (orange and pink) are generally lower in density than the `S_E_Q -> E_A` distributions (blue and green) for both models and datasets.
### Interpretation
The plots demonstrate the distribution of saliency scores for two Llama models across two different question-answering datasets. Saliency scores likely represent the importance or relevance of certain parts of the input text when answering a question.
The higher density values for the Llama-3-70B model suggest that it may be more confident or consistent in its saliency assessments compared to the Llama-3-8B model. The concentration of scores near 0.0 for the NQ dataset indicates that the model finds less need to highlight specific parts of the input text when answering questions from this dataset, potentially because the questions are more straightforward or the relevant information is more readily available.
The difference between the `S_E_Q -> E_A` and `S_E_Q -> *` distributions suggests that the method used to calculate saliency (represented by `E_A` and `*`) impacts the resulting scores. The lower density of the `S_E_Q -> *` distributions could indicate that this method identifies fewer important parts of the input text.
The wider distribution of scores for the Llama-3-8B model on the TriviaQA dataset suggests that this dataset presents more complex or ambiguous questions, requiring the model to consider a broader range of input text segments.
</details>
Figure 1: Kernel density estimates of saliencyâscore distributions for critical question-to-answer information flows. The bimodal pattern suggests two distinct information mechanisms.
We demonstrate Kernel Density Estimation results of the saliency scores on TriviaQA (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019) datasets. As shown in Figure 1, probability densities reveal a clear bimodal distribution: for all examined information types originating from the question, the probability mass concentrates around two peaks, one near zero saliency and another at a substantially higher value. The near-zero peak suggests that, for a substantial subset of samples, the question-to-answer information flow contributes minimally to hallucination detection, whereas the higher peak reflects strong dependence on such flow.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Different Language Models and Anchoring Methods
### Overview
The image presents three line charts, each depicting the change in performance (ÎP) as a function of layer number for different language models: Llama-3-8B, Llama-3-70B, and Mistral-7B-v0.3. Each chart displays lines representing different anchoring methods (Q-Anchored and A-Anchored) applied to various question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ).
### Components/Axes
Each chart shares the following components:
* **X-axis:** "Layer", ranging from 0 to approximately 30 (Llama-3-8B and Mistral-7B-v0.3) or 80 (Llama-3-70B).
* **Y-axis:** "ÎP", ranging from approximately -80 to 0.
* **Legend:** Located at the bottom of each chart, identifying the different lines by anchoring method and dataset.
* Q-Anchored (PopQA) - Solid Blue Line
* Q-Anchored (TriviaQA) - Solid Green Line
* Q-Anchored (HotpotQA) - Dashed Blue Line
* Q-Anchored (NQ) - Dashed Green Line
* A-Anchored (PopQA) - Dashed Orange Line
* A-Anchored (TriviaQA) - Dashed Purple Line
* A-Anchored (HotpotQA) - Dashed Orange Line
* A-Anchored (NQ) - Dashed Purple Line
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = -2, decreases steadily to approximately ÎP = -70 at Layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -65 at Layer 30.
* **Q-Anchored (NQ):** Starts at approximately ÎP = -15, decreases to approximately ÎP = -65 at Layer 30.
* **A-Anchored (PopQA):** Starts at approximately ÎP = -5, decreases to approximately ÎP = -60 at Layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -55 at Layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 30.
* **A-Anchored (NQ):** Starts at approximately ÎP = -15, decreases to approximately ÎP = -60 at Layer 30.
**Llama-3-70B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = -2, decreases to approximately ÎP = -70 at Layer 80.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 80.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 80.
* **Q-Anchored (NQ):** Starts at approximately ÎP = -15, decreases to approximately ÎP = -60 at Layer 80.
* **A-Anchored (PopQA):** Starts at approximately ÎP = -5, decreases to approximately ÎP = -60 at Layer 80.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -55 at Layer 80.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 80.
* **A-Anchored (NQ):** Starts at approximately ÎP = -15, decreases to approximately ÎP = -60 at Layer 80.
**Mistral-7B-v0.3 Chart:**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = -2, decreases to approximately ÎP = -70 at Layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -65 at Layer 30.
* **Q-Anchored (NQ):** Starts at approximately ÎP = -15, decreases to approximately ÎP = -65 at Layer 30.
* **A-Anchored (PopQA):** Starts at approximately ÎP = -5, decreases to approximately ÎP = -60 at Layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -55 at Layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = -10, decreases to approximately ÎP = -60 at Layer 30.
* **A-Anchored (NQ):** Starts at approximately ÎP = -15, decreases to approximately ÎP = -60 at Layer 30.
In all three charts, all lines generally exhibit a downward trend, indicating a decrease in ÎP as the layer number increases. The A-Anchored lines consistently appear slightly above the corresponding Q-Anchored lines for each dataset.
### Key Observations
* The performance decrease (ÎP) is more pronounced in the Llama-3-70B model, as evidenced by the steeper slopes and lower final values on the Y-axis.
* The anchoring method (Q vs. A) has a noticeable impact on ÎP, with A-Anchored generally performing better (less negative ÎP) than Q-Anchored.
* The datasets (PopQA, TriviaQA, HotpotQA, NQ) also influence ÎP, with some datasets consistently showing higher or lower values than others.
* The trends are remarkably similar across the three models, suggesting a common underlying pattern in how performance changes with layer number and anchoring method.
### Interpretation
The charts demonstrate the impact of model depth (layer number) and anchoring method on performance, as measured by ÎP. The consistent downward trend across all models suggests that increasing the number of layers beyond a certain point may lead to performance degradation. This could be due to issues like vanishing gradients or overfitting.
The superior performance of A-Anchored methods compared to Q-Anchored methods indicates that the anchoring strategy plays a crucial role in mitigating these issues. Anchoring likely helps to stabilize training and prevent the model from diverging.
The differences in ÎP across datasets suggest that the difficulty and characteristics of the question-answering task also influence the impact of model depth and anchoring. Some datasets may be more sensitive to these factors than others.
The similarity in trends across the three models (Llama-3-8B, Llama-3-70B, and Mistral-7B-v0.3) suggests that these findings are not specific to a particular model architecture or training procedure. They may represent a general phenomenon in large language models.
</details>
Figure 2: $\Delta\mathrm{P}$ under attention knockout. The layer axis indicates the Transformer layer on which the probe is trained. Shaded regions indicate 95% confidence intervals. Full results in Appendix C.
Hypothesis
These observations lead to the hypothesis that there are two distinct mechanisms of internal truthfulness encoding for hallucination detection: (1) one characterized by strong reliance on the key question-to-answer information from the exact question tokens, and (2) one in which truthfulness encoding is largely independent of the question. We validate the proposed hypothesis through further experiments in the next section.
3.2 Disentangling Information Mechanisms
We hypothesize that the internal truthfulness encoding operates through two distinct information flow mechanisms, driven by the attention modules within Transformer blocks. To validate the hypothesis, we first block information flows associated with the exact question tokens and analyze the resulting changes in the probeâs predictions. Subsequently, we apply a complementary technique, called token patching, to further substantiate the existence of these two mechanisms. Finally, we demonstrate that the self-contained information from the LLM-generated answer itself drives the truthfulness encoding for the A-Anchored type.
3.2.1 Experimental Setup
Our analysis covers a diverse collection of 12 LLMs that vary in both scale and architectural design. Specifically, we consider three categories: (1) base models, including Llama-3.2-1B (Grattafiori et al., 2024), Llama-3.2-3B, Llama-3-8B, Llama-3-70B, Mistral-7B-v0.1 (Jiang et al., 2023), and Mistral-7B-v0.3; (2) instruction-tuned models, including Llama-3.2-3B-Instruct, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Mistral-7B-Instruct-v0.3; and (3) reasoning-oriented models, namely Qwen3-8B (Yang et al., 2025) and Qwen3-32B. We conduct experiments on 4 widely used question-answering datasets: PopQA (Mallen et al., 2023), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019). Additional implementation details are provided in Appendix B.
3.2.2 Identifying Anchored Modes via Attention Knockout
Experiment
To investigate whether internal truthfulness encoding operates via distinct information mechanisms, we perform an attention knockout experiment targeting the exact question tokens. Specifically, for a probe trained on representations from the $k$ -th layer, we set $A_{l}(i,E_{Q})=0$ for layers $lâ\{1,...,k\}$ and positions $i>E_{Q}$ . This procedure blocks the information flow from question tokens to subsequent positions in the representation. We then examine how the probeâs predictions respond to this intervention. To provide a clearer picture, instances are categorized according to whether their prediction $\hat{z}$ changes after the attention knockout:
$$
\text{Mode}(x)=\begin{cases}\text{Q-Anchored},&\text{if }\hat{z}\neq\tilde{\hat{z}}\\
\text{A-Anchored},&\text{otherwise}\end{cases} \tag{2}
$$
where $\hat{z}$ and $\tilde{\hat{z}}$ denote predictions before and after the attention knockout, respectively.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate by Model and Dataset
### Overview
The image presents a comparative bar chart showing the "Prediction Flip Rate" across three different language models (Llama-3-8B, Llama-3-70B, and Mistral-7B-v0.3) and four datasets (PopQA, TriviaQA, HotpotQA, and NQ). The flip rate is measured for two anchoring methods: "Q-Anchored" (based on the exact question) and "A-Anchored" (based on the exact answer), each with both "exact question" and "random" variations.
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "Prediction Flip Rate" ranging from 0 to 80 (approximate).
* **Models (Columns):** Three separate charts, one for each model: Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3. Each model's chart has the same x-axis.
* **Legend (Bottom-Center):**
* Q-Anchored (exact question) - Light Red
* A-Anchored (exact question) - Light Gray
* Q-Anchored (random) - Dark Red
* A-Anchored (random) - Dark Gray
### Detailed Analysis or Content Details
**Llama-3-8B:**
* **PopQA:** Q-Anchored (exact question) is approximately 45, A-Anchored (exact question) is approximately 30, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **TriviaQA:** Q-Anchored (exact question) is approximately 80, A-Anchored (exact question) is approximately 40, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **HotpotQA:** Q-Anchored (exact question) is approximately 75, A-Anchored (exact question) is approximately 20, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **NQ:** Q-Anchored (exact question) is approximately 25, A-Anchored (exact question) is approximately 10, Q-Anchored (random) is approximately 5, A-Anchored (random) is approximately 5.
**Llama-3-70B:**
* **PopQA:** Q-Anchored (exact question) is approximately 50, A-Anchored (exact question) is approximately 35, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **TriviaQA:** Q-Anchored (exact question) is approximately 80, A-Anchored (exact question) is approximately 45, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **HotpotQA:** Q-Anchored (exact question) is approximately 75, A-Anchored (exact question) is approximately 25, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **NQ:** Q-Anchored (exact question) is approximately 25, A-Anchored (exact question) is approximately 10, Q-Anchored (random) is approximately 5, A-Anchored (random) is approximately 5.
**Mistral-7B-v0.3:**
* **PopQA:** Q-Anchored (exact question) is approximately 40, A-Anchored (exact question) is approximately 30, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **TriviaQA:** Q-Anchored (exact question) is approximately 80, A-Anchored (exact question) is approximately 40, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **HotpotQA:** Q-Anchored (exact question) is approximately 75, A-Anchored (exact question) is approximately 20, Q-Anchored (random) is approximately 10, A-Anchored (random) is approximately 5.
* **NQ:** Q-Anchored (exact question) is approximately 25, A-Anchored (exact question) is approximately 10, Q-Anchored (random) is approximately 5, A-Anchored (random) is approximately 5.
Across all models, the "Q-Anchored (exact question)" consistently shows the highest flip rate, followed by "A-Anchored (exact question)". The "random" anchoring methods have significantly lower flip rates.
### Key Observations
* The "TriviaQA" dataset consistently results in the highest prediction flip rates across all models and anchoring methods.
* The "NQ" dataset consistently results in the lowest prediction flip rates across all models and anchoring methods.
* The difference between "exact question" and "random" anchoring is substantial, indicating that anchoring on the exact question significantly increases the likelihood of a prediction flip.
* Llama-3-70B generally shows slightly higher flip rates than Llama-3-8B, while Mistral-7B-v0.3 is generally similar to Llama-3-8B.
### Interpretation
The data suggests that the prediction flip rate is highly dependent on both the model used and the dataset being evaluated. The high flip rates observed on TriviaQA may indicate that this dataset contains questions that are particularly sensitive to subtle changes in input or context. The lower flip rates on NQ suggest that this dataset is more robust or that the models are more confident in their predictions for this dataset.
The significant difference between "exact question" and "random" anchoring highlights the importance of context in these models. Anchoring on the exact question provides a stronger signal, leading to a higher probability of a prediction flip. This could be due to the models being more sensitive to specific keywords or phrases in the question.
The relatively consistent performance of the three models suggests that they share similar vulnerabilities and strengths in terms of prediction stability. The slight advantage of Llama-3-70B could be attributed to its larger size and increased capacity for learning complex relationships. Overall, the data provides valuable insights into the behavior of these language models and the factors that influence their prediction stability.
</details>
Figure 3: Prediction flip rate under token patching. Q-Anchored samples demonstrate significantly higher sensitivity than the counterparts when hallucinatory cues are injected into exact questions. Full results in Appendix D.
Results
The results in Figure 2 and Appendix C reveal a clear bifurcation of behaviors: for one subset of instances, probabilities shift substantially, while for another subset, probabilities remain nearly unchanged across all layers. Shaded regions indicate 95% confidence intervals, confirming that this qualitative separation is statistically robust. This sharp divergence supports the hypothesis that internal truthfulness encoding operates via two distinct mechanisms with respect to questionâanswer information. In Appendix C, we conduct a comprehensive analysis of alternative configurations for token selection, activation extraction, and various instruction- or reasoning-oriented models, and observe consistent patterns across all settings. Moreover, Figure 16 in Appendix C shows that blocking information from randomly selected question tokens yields negligible changes, in contrast to blocking exact question tokens, underscoring the nontrivial nature of the identified mechanisms.
3.2.3 Further Validation via Token Patching
Experiment
To further validate our findings, we employ a critical token patching technique to investigate how the internal representations of the LLM respond to hallucinatory signals originating from exact question tokens under the two proposed mechanisms. Given a context sample $d_{c}$ , we randomly select a patch sample $d_{p}$ and replace the original question tokens $E_{Q}^{c}$ in $d_{c}$ with the exact question tokens $E_{Q}^{p}$ from $d_{p}$ . This operation introduces hallucinatory cues into the context sample, allowing us to assess whether the LLMâs internal states appropriately reflect the injected changes. We restrict our analysis to context instances where the original LLM answers are factual, ensuring that any observed changes can be attributed solely to the injected hallucinatory cues.
Results
We measure the sensitivity of the truthfulness signals using the prediction flip rate, defined as the frequency with which the probeâs prediction changes after hallucinatory cues are introduced. Figure 3 and Appendix D present the results of the best-performing layer of each model on four datasets when patching the exact subject tokens. Across models and datasets, Q-Anchored mode exhibits significantly higher sensitivity compared to A-Anchored mode when exposed to hallucination cues from the questions. Furthermore, within each pathway, the flip rates where exact question tokens are patched are substantially higher than those observed when random tokens are patched, ruling out the possibility that the observed effects are mainly due to general semantic disruption from token replacement. These consistent results provide further support for our hypothesis regarding distinct mechanisms of information pathways.
3.2.4 What Drives A-Anchored Encoding?
Experiment
Since the A-Anchored mode operates largely independently of the question-to-answer information flow, it is important to investigate the source of information it uses to identify hallucinations. To this end, we remove the questions entirely from each sample and perform a separate forward pass using only the LLM-generated answers. This procedure yields answer-only hidden states, which are subsequently provided as input to the probe. We then evaluate how the probeâs predictions change under this âanswer-onlyâ condition. This setup enables us to assess whether A-Anchored predictions rely primarily on the generated answer itself rather than on the original question.
Results
As shown in Figure 4 and Appendix E, Q-Anchored instances exhibit substantial changes in prediction probability when the question is removed, reflecting their dependence on question-to-answer information. In contrast, A-Anchored instances remain largely invariant, indicating that the probe continues to detect hallucinations using information encoded within the LLM-generated answer itself. These findings suggest that the A-Anchored mechanism primarily leverages self-contained answer information to build signals about truthfulness.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Language Models on Question Answering Datasets
### Overview
This image presents a comparative analysis of three language models â Llama-3-8B, Llama-3-70B, and Mistral-7B-v0.3 â across four question answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. The performance metric is represented as "-ÎP", indicating a change in performance. Each model's performance is shown for both "Q-Anchored" and "A-Anchored" approaches.
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "-ÎP" (Performance Difference), ranging from 0 to 80.
* **Legend:** Located at the bottom-center of the image.
* "Q-Anchored" (represented by a reddish-brown color)
* "A-Anchored" (represented by a gray color)
* **Titles:** Each chart has a title indicating the language model being evaluated: "Llama-3-8B", "Llama-3-70B", "Mistral-7B-v0.3". These titles are positioned at the top-center of each respective chart.
### Detailed Analysis
The image consists of three separate bar charts, each representing a different language model.
**1. Llama-3-8B:**
* **PopQA:** Q-Anchored â 52, A-Anchored â 8
* **TriviaQA:** Q-Anchored â 64, A-Anchored â 12
* **HotpotQA:** Q-Anchored â 56, A-Anchored â 24
* **NQ:** Q-Anchored â 28, A-Anchored â 10
**2. Llama-3-70B:**
* **PopQA:** Q-Anchored â 50, A-Anchored â 10
* **TriviaQA:** Q-Anchored â 68, A-Anchored â 12
* **HotpotQA:** Q-Anchored â 48, A-Anchored â 24
* **NQ:** Q-Anchored â 44, A-Anchored â 16
**3. Mistral-7B-v0.3:**
* **PopQA:** Q-Anchored â 72, A-Anchored â 18
* **TriviaQA:** Q-Anchored â 80, A-Anchored â 20
* **HotpotQA:** Q-Anchored â 64, A-Anchored â 20
* **NQ:** Q-Anchored â 48, A-Anchored â 16
For all three models, the "Q-Anchored" bars are consistently higher than the "A-Anchored" bars across all datasets, indicating better performance with the Q-Anchored approach.
### Key Observations
* **Model Performance:** Mistral-7B-v0.3 generally exhibits the highest "-ÎP" values for Q-Anchored, particularly on TriviaQA (â80).
* **Dataset Difficulty:** The performance difference between Q-Anchored and A-Anchored appears to be more pronounced on TriviaQA and PopQA for all models, suggesting these datasets may be more sensitive to the anchoring method.
* **Anchoring Impact:** The Q-Anchored approach consistently outperforms the A-Anchored approach across all models and datasets.
### Interpretation
The data suggests that the choice of anchoring method (Q-Anchored vs. A-Anchored) significantly impacts the performance of these language models on question answering tasks. The Q-Anchored approach consistently yields better results, implying that anchoring based on the question itself is more effective than anchoring based on the answer.
The varying performance across datasets indicates that the difficulty and characteristics of each dataset influence the effectiveness of the models. TriviaQA and PopQA seem to be more sensitive to the anchoring method, potentially due to the nature of the questions or the way answers are structured within those datasets.
Mistral-7B-v0.3 appears to be the strongest performer overall, particularly when using the Q-Anchored approach. This could be attributed to its model architecture, training data, or other factors. The consistent gap between Q-Anchored and A-Anchored performance highlights a potential area for further research and optimization in question answering systems. The "-ÎP" metric, while not explicitly defined, likely represents a performance *loss* relative to a baseline, as higher negative values indicate worse performance.
</details>
Figure 4: $-\Delta\mathrm{P}$ with only the LLM-generated answer. Q-Anchored instances exhibit substantial shifts, whereas A-Anchored instances remain stable, confirming that A-Anchored truthfulness encoding relies on information in the LLM-generated answer itself. Full results in Appendix E.
4 Properties of Truthfulness Pathways
This section examines notable properties and distinct behaviors of intrinsic truthfulness encoding: (1) Associations with knowledge boundaries: samples within the LLMâs knowledge boundary tend to encode truthfulness via the Q-Anchored pathway, whereas samples beyond the boundary often rely on the A-Anchored signal; (2) Self-awareness: internal representations can be used to predict which mechanism is being employed, suggesting that LLMs possess intrinsic awareness of pathway distinctions.
4.1 Associations with Knowledge Boundaries
We find that distinct patterns of truthfulness encoding are closely associated with the knowledge boundaries of LLMs. To characterize these boundaries, three complementary metrics are employed: (1) Answer accuracy, the most direct indicator of an LLMâs factual competence; (2) I-donât-know rate (shown in Appendix G), which reflects the modelâs ability to recognize and express its own knowledge limitations; (3) Entity popularity, which is widely used to distinguish between common and long-tail factual knowledge (Mallen et al., 2023).
As shown in Figure 5 and Appendix F, Q-Anchored samples achieve significantly higher accuracy than those driven by the A-Anchored pathway. The results for the I-donât-know rate, reported in Appendix G, exhibit trends consistent with answer accuracy, further indicating stronger knowledge handling in Q-Anchored samples. Moreover, entity popularity, shown in Figure 6, provides a more fine-grained perspective on knowledge boundaries. Specifically, Q-Anchored samples tend to involve more popular entities, whereas A-Anchored samples are more frequently associated with less popular, long-tail factual knowledge. These findings suggest that truthfulness encoding is strongly aligned with the availability of stored knowledge: when LLMs possess the requisite knowledge, they predominantly rely on questionâanswer information flow (Q-Anchored); when knowledge is unavailable, they instead draw upon internal patterns within their own generated outputs (A-Anchored).
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Different Language Models and Anchoring Methods
### Overview
This image presents three line charts, each displaying "Answer Accuracy" as a function of "Layer" for different language models: Llama-3-8B, Llama-3-70B, and Mistral-7B-v0.3. Each chart also shows results for different "Anchoring" methods: "Q-Anchored" and "A-Anchored", applied to different datasets: "PopQA", "TriviaQA", "HotpotQA", and "NQ". The charts visually compare how answer accuracy changes across layers for each model and anchoring/dataset combination.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to approximately 30 for Llama-3-8B and Mistral-7B-v0.3, and 0 to 80 for Llama-3-70B.
* **Y-axis:** "Answer Accuracy" - Ranges from 0 to 100.
* **Legends:** Located at the bottom of each chart.
* Llama-3-8B Chart:
* Solid Blue Line: Q-Anchored (PopQA)
* Dashed Brown Line: A-Anchored (PopQA)
* Solid Purple Line: Q-Anchored (TriviaQA)
* Dashed Orange Line: A-Anchored (TriviaQA)
* Solid Teal Line: Q-Anchored (HotpotQA)
* Dashed Gray Line: A-Anchored (HotpotQA)
* Llama-3-70B Chart:
* Solid Blue Line: Q-Anchored (PopQA)
* Dashed Brown Line: A-Anchored (PopQA)
* Solid Purple Line: Q-Anchored (TriviaQA)
* Dashed Orange Line: A-Anchored (TriviaQA)
* Solid Teal Line: Q-Anchored (HotpotQA)
* Dashed Gray Line: A-Anchored (HotpotQA)
* Mistral-7B-v0.3 Chart:
* Solid Blue Line: Q-Anchored (NQ)
* Dashed Brown Line: A-Anchored (NQ)
* Solid Purple Line: Q-Anchored (TriviaQA)
* Dashed Orange Line: A-Anchored (TriviaQA)
* Solid Teal Line: Q-Anchored (HotpotQA)
* Dashed Gray Line: A-Anchored (HotpotQA)
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* Q-Anchored (PopQA): Starts at ~10, rapidly increases to ~95 by layer 10, then fluctuates between ~80-95.
* A-Anchored (PopQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
* Q-Anchored (TriviaQA): Starts at ~20, increases to ~90 by layer 10, then fluctuates between ~70-90.
* A-Anchored (TriviaQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
* Q-Anchored (HotpotQA): Starts at ~10, increases to ~90 by layer 10, then fluctuates between ~70-90.
* A-Anchored (HotpotQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
**Llama-3-70B Chart:**
* Q-Anchored (PopQA): Starts at ~10, rapidly increases to ~95 by layer 10, then fluctuates between ~80-95.
* A-Anchored (PopQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
* Q-Anchored (TriviaQA): Starts at ~20, increases to ~90 by layer 10, then fluctuates between ~70-90.
* A-Anchored (TriviaQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
* Q-Anchored (HotpotQA): Starts at ~10, increases to ~90 by layer 10, then fluctuates between ~70-90.
* A-Anchored (HotpotQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
**Mistral-7B-v0.3 Chart:**
* Q-Anchored (NQ): Starts at ~10, rapidly increases to ~95 by layer 10, then fluctuates between ~80-95.
* A-Anchored (NQ): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
* Q-Anchored (TriviaQA): Starts at ~20, increases to ~90 by layer 10, then fluctuates between ~70-90.
* A-Anchored (TriviaQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
* Q-Anchored (HotpotQA): Starts at ~10, increases to ~90 by layer 10, then fluctuates between ~70-90.
* A-Anchored (HotpotQA): Starts at ~20, increases to ~40 by layer 10, then remains relatively stable around ~30-40.
### Key Observations
* Across all models and datasets, Q-Anchored methods consistently achieve significantly higher answer accuracy than A-Anchored methods.
* Accuracy for Q-Anchored methods generally peaks around layer 10 and then plateaus or fluctuates.
* A-Anchored methods show a modest increase in accuracy initially, but then level off at a much lower accuracy level.
* The Llama-3-70B chart extends to layer 80, showing a similar trend to the other charts, with accuracy stabilizing after the initial increase.
* The charts for Llama-3-8B and Llama-3-70B are nearly identical, suggesting similar behavior across model sizes for the datasets tested.
### Interpretation
The data suggests that "Q-Anchoring" is a more effective method for improving answer accuracy compared to "A-Anchoring" across the tested language models and datasets. The rapid increase in accuracy for Q-Anchored methods in the early layers indicates that the initial layers of these models are crucial for capturing relevant information for question answering. The plateauing of accuracy after layer 10 suggests that adding more layers beyond a certain point does not necessarily lead to further improvements in performance. The consistently lower accuracy of A-Anchored methods may indicate that the models struggle to effectively utilize answer-based anchoring for these tasks. The similarity between the Llama-3-8B and Llama-3-70B charts suggests that model size, within the tested range, does not significantly impact the observed trends. The consistent performance across datasets suggests a generalizable pattern in the effectiveness of the anchoring methods.
</details>
Figure 5: Comparisons of answer accuracy between pathways. Q-Anchored samples show higher accuracy than A-Anchored ones, highlighting the association between truthfulness encoding and LLM knowledge boundaries. Full results in Appendix F and G.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Bar Chart: Entity Frequency by Model
### Overview
This bar chart compares the entity frequency of two types of anchored entities ("Q-Anchored" and "A-Anchored") across four different language models: Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3, and Mistral-7B-v0.1. The y-axis represents the "Entity Frequency," while the x-axis represents the "Model."
### Components/Axes
* **X-axis:** "Model" with categories: Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3, Mistral-7B-v0.1.
* **Y-axis:** "Entity Frequency" ranging from 0 to 70000, with increments of 10000.
* **Legend:** Located at the bottom-center of the chart.
* "Q-Anchored" - represented by a reddish-brown color.
* "A-Anchored" - represented by a light gray color.
### Detailed Analysis
The chart consists of paired bars for each model, representing Q-Anchored and A-Anchored entity frequencies.
* **Llama-3-8B:**
* Q-Anchored: Approximately 65,000. The bar reaches slightly above the 60,000 mark.
* A-Anchored: Approximately 15,000. The bar reaches slightly above the 10,000 mark.
* **Llama-3-70B:**
* Q-Anchored: Approximately 50,000. The bar reaches slightly below the 50,000 mark.
* A-Anchored: Approximately 20,000. The bar reaches slightly above the 20,000 mark.
* **Mistral-7B-v0.3:**
* Q-Anchored: Approximately 55,000. The bar reaches slightly above the 50,000 mark.
* A-Anchored: Approximately 13,000. The bar reaches slightly above the 10,000 mark.
* **Mistral-7B-v0.1:**
* Q-Anchored: Approximately 70,000. The bar reaches slightly above the 70,000 mark.
* A-Anchored: Approximately 18,000. The bar reaches slightly below the 20,000 mark.
For each model, the Q-Anchored entity frequency is significantly higher than the A-Anchored entity frequency.
### Key Observations
* The highest Q-Anchored entity frequency is observed for Mistral-7B-v0.1 (approximately 70,000).
* The lowest Q-Anchored entity frequency is observed for Llama-3-70B (approximately 50,000).
* The A-Anchored entity frequencies are relatively consistent across all models, ranging from approximately 13,000 to 20,000.
* The difference between Q-Anchored and A-Anchored frequencies is most pronounced for Llama-3-8B.
### Interpretation
The data suggests that Q-Anchored entities are substantially more frequent than A-Anchored entities across all tested language models. This could indicate a bias in the data used to train these models, or a fundamental difference in how these models process or generate these types of entities. The variation in Q-Anchored entity frequency between models (ranging from 50,000 to 70,000) suggests that model architecture or training data significantly impacts the generation or identification of Q-Anchored entities. The relatively stable A-Anchored entity frequency across models might indicate that these entities are more consistently represented or processed by the models. The large difference between the two types of entities for Llama-3-8B could be a unique characteristic of that model, or a result of specific training data characteristics. Further investigation into the nature of Q-Anchored and A-Anchored entities, and the training data used for each model, would be necessary to fully understand these observations.
</details>
Figure 6: Entity frequency distributions for both pathways on PopQA. Q-Anchored samples concentrate on more popular entities, whereas A-Anchored samples skew toward long-tail entities.
4.2 Self-Awareness of Pathway Distinctions
Given that LLMs encode truthfulness via two distinct mechanisms, this section investigates whether their internal representations contain discriminative information that can be used to distinguish between these mechanisms. To this end, we train probing classifiers on the modelsâ original internal states (i.e., without knockout interventions) to predict which mechanism is being utilized.
Table 2 reports the pathway classification results of the best-performing layers in hallucination detection across different models. Our findings demonstrate that different mechanisms can be reliably inferred from internal representations, suggesting that, in addition to encoding truthfulness, LLMs exhibit intrinsic awareness of pathway distinctions. These findings highlight a potential avenue for fine-grained improvements targeting specific truthfulness encoding mechanisms.
Datasets Llama-3-8B Llama-3-70B Mistral-7B-v0.3 PopQA 87.80 92.66 87.64 TriviaQA 75.10 83.91 85.87 HotpotQA 86.31 87.34 92.13 NQ 78.31 84.14 84.83
Table 2: AUCs for encoding pathway classification. The predictability from internal representations indicates that LLMs possess intrinsic awareness of pathway distinctions.
5 Pathway-Aware Detection
Building on the intriguing findings, we explore how the discovered pathway distinctions can be leveraged to improve hallucination detection. Specifically, two simple yet effective pathway-aware strategies are proposed: (1) Mixture-of-Probes (MoP) (§ 5.1), which allows expert probes to specialize in Q-Anchored and A-Anchored pathways respectively, and (2) Pathway Reweighting (PR) (§ 5.2), a plug-and-play approach that amplifies pathway-relevant cues salient for detection.
5.1 Mixture-of-Probes
Motivated by the fundamentally different dependencies of the two encoding pathways and the LLMsâ intrinsic awareness of them, we propose a Mixture-of-Probes (MoP) framework that explicitly captures this heterogeneity. Rather than training a single probe to handle all inputs, MoP employs two pathway-specialized experts and leverages the self-awareness probe (§ 4.2) as a gating network to combine their predictions. Let $\mathbf{h}^{l^{*}}(x)\!â\!\mathbb{R}^{d}$ be the token hidden state from the best detection layer $l^{*}$ . Two expert probes $p_{Q}(·)$ and $p_{A}(·)$ are trained separately for two pathway samples, and the self-awareness probe provides a gating coefficient $\pi(\mathbf{h}^{l^{*}}(x))\!â\![0,1]$ . The final prediction is a convex combination, requiring no extra training:
$$
\displaystyle p_{\text{MoP}}(z\!=\!1\mid\mathbf{h}^{l^{*}}(x)) \displaystyle=\pi_{Q}\,p_{Q}(z\!=\!1\mid\mathbf{h}^{l^{*}}(x)) \displaystyle\quad+(1-\pi_{Q})\,p_{A}(z\!=\!1\mid\mathbf{h}^{l^{*}}(x)). \tag{3}
$$
5.2 Pathway Reweighting
From the perspective of emphasizing pathway-relevant internal cues, we introduce a plug-and-play Pathway Reweighting (PR) method that directly modulates the questionâanswer information flow. The key idea is to adjust the attention from exact answer to question tokens according to the predicted pathway, amplifying the signals most salient for hallucination detection. For each layer $l†l^{*}$ , two learnable scalars $\alpha_{Q}^{l},\alpha_{A}^{l}>0$ are introduced. Given self-awareness probability $\pi(\mathbf{h}^{l^{*}}(x))$ , we rescale attention edges $i\!â\!E_{A}$ , $j\!â\!E_{Q}$ to construct representations tailored for detection:
$$
\tilde{A}^{l}(i,j)=\begin{cases}\bigl[1+s(\mathbf{h}^{l^{*}}(x))\bigr]A^{l}(i,j),&i\!\in\!E_{A},j\!\in\!E_{Q},\\
A^{l}(i,j),&\text{otherwise},\end{cases} \tag{4}
$$
where
$$
s(\mathbf{h}^{l^{*}}(x))=\pi_{Q}\,\alpha_{Q}^{l}-(1-\pi_{Q})\,\alpha_{A}^{l}. \tag{5}
$$
The extra parameters serve as a lightweight adapter, used only during detection to guide salient truthfulness cues and omitted during generation, leaving the generation capacity unaffected.
Method Llama-3-8B Mistral-7B-v0.3 PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 55.85 49.92 52.14 53.27 45.49 47.61 57.87 52.79 Logits-mean 74.52 60.39 51.94 52.63 69.52 66.76 55.45 57.88 Logits-min 85.36 70.89 61.28 56.50 87.05 77.33 68.08 54.40 Probing Baseline 88.71 77.58 82.23 70.20 87.39 81.74 83.19 73.60 \rowcolor mygray MoP-RandomGate 75.52 69.17 79.88 66.56 79.81 70.88 72.23 61.19 \rowcolor mygray MoP-VanillaExperts 89.11 78.73 84.57 71.21 88.53 80.93 82.93 73.77 \rowcolor mygray MoP 92.11 81.18 85.45 74.64 91.66 83.57 85.82 76.87 \rowcolor mygray PR 94.01 83.13 87.81 79.10 93.09 84.36 89.03 79.09
Table 3: Comparison of hallucination detection performance (AUC). Full results in Appendix H.
5.3 Experiments
Setup
The experimental setup follows Section 3.2.1. We compare our method against several internal-based baselines, including (1) P(True) (Kadavath et al., 2022), (2) uncertainty-based metrics (Aichberger et al., 2024; Xue et al., 2025a), and (3) probing classifiers (Chen et al., 2024; Orgad et al., 2025). Results are averaged over three random seeds. Additional implementation details are provided in Appendix B.5 and B.6.
Results
As shown in Table 3 and Appendix H, both MoP and PR consistently outperform competing approaches across different datasets and model scales. Specifically, for MoP, we further examine two ablated variants: (1) MoP-RandomGate, which randomly routes the two pathway experts without leveraging the self-awareness probe; and (2) MoP-VanillaExperts, which replaces the expert probes with two vanilla probes to serve as a simple ensemble strategy. Both ablated variants exhibit substantially degraded performance compared to MoP, underscoring the roles of pathway specialization and self-awareness gating. For PR, the method proves particularly effective in improving performance by dynamically adjusting the focus on salient truthfulness cues. These results demonstrate that explicitly modeling truthfulness encoding heterogeneity can effectively translate the insights of our analysis into practical gains for hallucination detection.
6 Related Work
Hallucination detection in LLMs has received increasing attention because of its critical role in building reliable and trustworthy generative systems (Tian et al., 2024; Shi et al., 2024; Bai et al., 2024). Existing approaches can be broadly grouped by whether they rely on external resources (e.g., retrieval systems or factâchecking APIs). Externally assisted methods cross-verify output texts against external knowledge bases (Min et al., 2023; Hu et al., 2025; Huang et al., 2025) or specialized LLM judges (Luo et al., 2024; Bouchard and Chauhan, 2025; Zhang et al., 2025). Resource-free methods avoid external data and instead exploit the modelâs own intermediate computations. Some leverage the modelâs self-awareness of knowledge boundaries (Kadavath et al., 2022; Luo et al., 2025), while others use uncertainty-based measures (Aichberger et al., 2024; Xue et al., 2025a), treating confidence as a proxy for truthfulness. These techniques analyze output distributions (e.g., logits) (Aichberger et al., 2024), variance across multiple samples (e.g., consistency) (Min et al., 2023; Aichberger et al., 2025), or other statistical indicators of prediction uncertainty (Xue et al., 2025b). Another line of work trains linear probing classifiers on hidden representations to capture intrinsic truthfulness signals. Prior work (Burns et al., 2023; Li et al., 2023; Chen et al., 2024; Orgad et al., 2025) shows that LLMs encode rich latent features correlated with factual accuracy, enabling efficient detection with minimal overhead. Yet the mechanisms behind these internal truthfulness encoding remain poorly understood. Compared to previous approaches, our work addresses this gap by dissecting how such intrinsic signals emerge and operate, revealing distinct information pathways that not only yield explanatory insights but also enhance detection performance.
7 Conclusion
We investigate how LLMs encode truthfulness, revealing two complementary pathways: a Question-Anchored pathway relying on questionâanswer flow, and an Answer-Anchored pathway extracting self-contained evidence from generated outputs. Analyses across datasets and models highlight their ties to knowledge boundaries and intrinsic self-awareness. Building on these insights, we further propose two applications to improve hallucination detection. Overall, our findings not only advance mechanistic understanding of intrinsic truthfulness encoding but also offer practical applications for building more reliable generative systems.
Limitations
While this work provides a systematic analysis of intrinsic truthfulness encoding mechanisms in LLMs and demonstrates their utility for hallucination detection, one limitation is that, similar to prior work on mechanistic interpretability, our analyses and pathway-aware applications assume access to internal model representations. Such access may not always be available in strictly black-box settings. In these scenarios, additional engineering or alternative approximations may be required for practical deployment, which we leave for future work.
Ethics Statement
Our work presents minimal potential for negative societal impact, primarily due to the use of publicly available datasets and models. This accessibility inherently reduces the risk of adverse effects on individuals or society.
References
- Aichberger et al. (2024) Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. 2024. Semantically diverse language generation for uncertainty estimation in language models. arXiv preprint arXiv:2406.04306.
- Aichberger et al. (2025) Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. 2025. Improving uncertainty estimation through semantically diverse language generation. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.
- Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 7421â7454. Association for Computational Linguistics.
- Baker et al. (1998) Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86â90.
- Bouchard and Chauhan (2025) Dylan Bouchard and Mohit Singh Chauhan. 2025. Uncertainty quantification for language models: A suite of black-box, white-box, llm judge, and ensemble scorers. arXiv preprint arXiv:2504.19254.
- Burns et al. (2023) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2023. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: llmsâ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Fierro et al. (2025) Constanza Fierro, Negar Foroutan, Desmond Elliott, and Anders SĂžgaard. 2025. How do multilingual language models remember facts? In Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 16052â16106. Association for Computational Linguistics.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12216â12235. Association for Computational Linguistics.
- Ghandeharioun et al. (2024) Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. 2024. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Hu et al. (2025) Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, and Qing Li. 2025. Removal of hallucination on hallucination: Debate-augmented RAG. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 15839â15853. Association for Computational Linguistics.
- Huang et al. (2025) Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yuxuan Gu, Yangfan Ye, Liang Zhao, Weihong Zhong, Baoxin Wang, Dayong Wu, Guoping Hu, Lingpeng Kong, Tong Xiao, Ting Liu, and Bing Qin. 2025. Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24564â24579. Association for Computational Linguistics.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601â1611, Vancouver, Canada. Association for Computational Linguistics.
- Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, T. J. Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova Dassarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, and 17 others. 2022. Language models (mostly) know what they know. ArXiv, abs/2207.05221.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452â466.
- Li et al. (2023) Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Luo et al. (2024) Wen Luo, Tianshu Shen, Wei Li, Guangyue Peng, Richeng Xuan, Houfeng Wang, and Xi Yang. 2024. Halludial: A large-scale benchmark for automatic dialogue-level hallucination evaluation. Preprint, arXiv:2406.07070.
- Luo et al. (2025) Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, and Houfeng Wang. 2025. Odysseus navigates the sirensâ song: Dynamic focus decoding for factual and diverse open-ended text generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27200â27218, Vienna, Austria. Association for Computational Linguistics.
- Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802â9822, Toronto, Canada. Association for Computational Linguistics.
- Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? Advances in neural information processing systems, 32.
- Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12076â12100. Association for Computational Linguistics.
- Niu et al. (2025) Mengjia Niu, Hamed Haddadi, and Guansong Pang. 2025. Robust hallucination detection in llms via adaptive token selection. arXiv preprint arXiv:2504.07863.
- Orgad et al. (2025) Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. 2025. Llms know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.
- Pagnoni et al. (2021) Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812â4829.
- Qian et al. (2025) Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. 2025. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning. arXiv preprint arXiv:2506.02867.
- Shi et al. (2024) Zhengliang Shi, Shuo Zhang, Weiwei Sun, Shen Gao, Pengjie Ren, Zhumin Chen, and Zhaochun Ren. 2024. Generate-then-ground in retrieval-augmented generation for multi-hop question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 7339â7353. Association for Computational Linguistics.
- Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings.
- Tian et al. (2024) Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, and Yongdong Zhang. 2024. Chimed-gpt: A chinese medical large language model with full training regime and better alignment to human preferences. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 7156â7173. Association for Computational Linguistics.
- Todd et al. (2024) Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. 2024. Function vectors in large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Wang et al. (2023) Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840â9855.
- Wu et al. (2025) Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2025. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.
- Xue et al. (2025a) Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, and Kam-Fai Wong. 2025a. UAlign: Leveraging uncertainty estimations for factuality alignment on large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6002â6024, Vienna, Austria. Association for Computational Linguistics.
- Xue et al. (2025b) Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. 2025b. Verify when uncertain: Beyond self-consistency in black box hallucination detection. arXiv preprint arXiv:2502.15845.
- Yang et al. (2025) An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report. Preprint, arXiv:2505.09388.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369â2380, Brussels, Belgium. Association for Computational Linguistics.
- Zhang et al. (2025) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, and 1 others. 2025. Sirenâs song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics, pages 1â46.
Appendix A LLM Usage
In this work, we employ LLMs solely for language refinement to enhance clarity and explanatory quality. All content has been carefully verified for factual accuracy, and the authors take full responsibility for the entire manuscript. The core ideas, experimental design, and methodological framework are conceived and developed independently by the authors, without the use of LLMs.
Appendix B Implementation Details
B.1 Identifying Exact Question and Answer Tokens
To locate the exact question and answer tokens within a QA pair, we prompt GPT-4o (version gpt-4o_2024-11-20) to identify the precise positions of the core frame elements. The instruction templates are presented in Tables 5 and 6. A token is considered an exact question or exact answer if and only if it constitutes a valid substring of the corresponding question or answer. To mitigate potential biases, each example is prompted at most five times, and only successfully extracted instances are retained for downstream analysis. Prior work (Orgad et al., 2025) has shown that LLMs can accurately identify exact answer tokens, typically achieving over 95% accuracy. In addition, we manually verified GPT-4oâs identification quality in our setting. Specifically, it achieves 99.92%, 95.83%, and 96.62% accuracy on exact subject tokens, exact property tokens, and exact answer tokens, respectively. Furthermore, we also explore alternative configurations without the use of exact tokens to ensure the robustness of our findings (see Section B.2).
B.2 Probing Implementation Details
We investigate multiple probing configurations. For token selection, we consider three types of tokens: (1) the final token of the answer, which is the most commonly adopted choice in prior work due to its global receptive field under attention (Chen et al., 2024); (2) the token immediately preceding the exact answer span; and (3) the final token within the exact answer span. For activation extraction, we obtain representations from either (1) the output of each attention sublayer or (2) the output of the final multi-layer perceptron (MLP) in each transformer layer. Across all configurations, our experimental results exhibit consistent trends, indicating that the observed findings are robust to these design choices. For the probing classifier, we follow standard practice (Chen et al., 2024; Orgad et al., 2025) and employ a logistic regression model implemented in scikit-learn.
B.3 Models
Our analysis covers a diverse collection of 12 LLMs that vary in both scale and architectural design. Specifically, we consider three categories: (1) base models, including Llama-3.2-1B (Grattafiori et al., 2024), Llama-3.2-3B, Llama-3-8B, Llama-3-70B, Mistral-7B-v0.1 (Jiang et al., 2023), and Mistral-7B-v0.3; (2) instruction-tuned models, including Llama-3.2-3B-Instruct, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Mistral-7B-Instruct-v0.3; and (3) reasoning-oriented models, namely Qwen3-8B (Yang et al., 2025) and Qwen3-32B.
B.4 Datasets
We consider four widely used questionâanswering datasets: PopQA (Mallen et al., 2023), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019).
PopQA is an open-domain question-answering dataset that emphasizes entity-centric factual knowledge with a long-tail distribution. It is designed to probe LLMsâ ability to memorize less frequent facts, highlighting limitations in parametric knowledge.
TriviaQA is a reading comprehension dataset constructed by pairing trivia questions authored independently of evidence documents. The questions are often complex, requiring multi-sentence reasoning, and exhibit substantial lexical and syntactic variability.
HotpotQA is a challenging multi-hop question-answering dataset that requires reasoning. It includes diverse question typesâspan extraction, yes/no, and novel comparison questionsâalong with sentence-level supporting fact annotations, promoting the development of explainable QA systems.
Natural Questions is an open-domain dataset consisting of real, anonymized questions from Google search queries. Each question is annotated with both a long answer (paragraph or section) and a short answer (span or yes/no), or marked as null when no answer is available. Due to computational constraints, we randomly sample 2,000 training samples and 2,000 test samples for each dataset.
B.5 Implementation Details of Baselines
In our experiments regarding applications, we compare our proposed methods against several internal-based baselines for hallucination detection. These baselines leverage the LLMâs internal signals, such as output probabilities, logits, and hidden representations, without relying on external resources. Below, we detail the implementation of each baseline.
P(True)
P(True) (Kadavath et al., 2022) exploits the LLMâs self-awareness of its knowledge boundaries by prompting the model to assess the correctness of its own generated answer. Specifically, for each question-answer pair $(q_{i},\hat{y}^{f}_{i})$ , we prompt the LLM with a template that asks it to evaluate whether its answer is factually correct. Following Kadavath et al. (2022), the prompt template is shown in Table 4.
| Question: {Here is the question} |
| --- |
| Possible answer: {Here is the answer} |
| Is the possible answer: |
| (A) True |
| (B) False |
| The possible answer is: |
Table 4: Prompt template used for the P(True) baseline.
Logits-based Baselines
The logits-based baselines utilize the raw logits produced by the LLM during the generation of the exact answer tokens. Let $\hat{y}^{f}_{i,E_{A}}=[t_{1},t_{2},...,t_{m}]$ represent the sequence of exact answer tokens for a given question-answer pair, where $m$ is the number of exact answer tokens. For each token $t_{j}$ (where $jâ\{1,...,m\}$ ), the LLM produces a logit vector $L_{j}â\mathbb{R}^{V}$ , where $V$ is the vocabulary size, and the logit for the generated token $t_{j}$ is denoted $L_{j}[t_{j}]$ . The logits-based metrics are defined as follows:
- Logits-mean: The average of the logits across all exact answer tokens:
$$
\text{Logits-mean}=\frac{1}{m}\sum_{j=1}^{m}L_{j}[t_{j}] \tag{6}
$$
- Logits-max: The maximum logit value among the exact answer tokens:
$$
\text{Logits-max}=\max_{j\in\{1,\dots,m\}}L_{j}[t_{j}] \tag{7}
$$
- Logits-min: The minimum logit value among the exact answer tokens:
$$
\text{Logits-min}=\min_{j\in\{1,\dots,m\}}L_{j}[t_{j}] \tag{8}
$$
These metrics serve as proxies for the modelâs confidence in the generated answer, with lower logit values potentially indicating uncertainty or hallucination.
Scores-based Baselines
The scores-based baselines are derived from the softmax probabilities of the exact answer tokens. Using the same notation as above, for each exact answer token $t_{j}$ , the softmax probability is computed as:
$$
p_{j}[t_{j}]=\frac{\exp(L_{j}[t_{j}])}{\sum_{k=1}^{V}\exp(L_{j}[k])} \tag{9}
$$
where $L_{j}[k]$ is the logit for the $k$ -th token in the vocabulary. The scores-based metrics are defined as follows:
- Scores-mean: The average of the softmax probabilities across all exact answer tokens:
$$
\text{Scores-mean}=\frac{1}{m}\sum_{j=1}^{m}p_{j}[t_{j}] \tag{10}
$$
- Scores-max: The maximum softmax probability among the exact answer tokens:
$$
\text{Scores-max}=\max_{j\in\{1,\dots,m\}}p_{j}[t_{j}] \tag{11}
$$
- Scores-min: The minimum softmax probability among the exact answer tokens:
$$
\text{Scores-min}=\min_{j\in\{1,\dots,m\}}p_{j}[t_{j}] \tag{12}
$$
These probabilities provide a normalized measure of the modelâs confidence, bounded between 0 and 1, with lower values potentially indicating a higher likelihood of hallucination.
Probing Baseline
The probing baseline follows the standard approach described in Chen et al. (2024); Orgad et al. (2025). A linear classifier is trained on the hidden representations of the last exact answer token from the best-performing layer. The training and evaluation data for the probing classifier are constructed following the procedure described in Appendix B.4. The classifier is implemented using scikit-learn with default hyperparameters, consistent with the probing setup described in Appendix B.2. The probing baseline serves as a direct comparison to our proposed applications, as it relies on the same type of internal signals but does not account for the heterogeneity of truthfulness encoding pathways.
B.6 Implementation Details of MoP and PR
Model Backbone and Hidden Representations
All experiments use the same base LLM as in the main paper. Hidden representations $\mathbf{h}^{l^{*}}(x)$ are extracted from the best-performing layer $l^{*}$ determined on a held-out validation split.
Mixture-of-Probes (MoP)
Similar to Appendix B.5, the two expert probes $p_{Q}$ and $p_{A}$ are implemented using scikit-learn with default hyperparameters, consistent with the probing setup described in Appendix B.2. The gating network is directly from the self-awareness probe described in Section 4.2. The training and evaluation data for the probing classifier are the same as Appendix B.5. The proposed MoP framework requires no additional retraining: we directly combine the two expert probes with the pathway-discrimination classifier described in Section 4.2 and perform inference without further parameter updates.
Pathway Reweighting (PR)
The training and evaluation data used for the probing classifier are identical to those described in Appendix B.5. For each Transformer layer $l†l^{*}$ , we introduce two learnable scalars $\alpha_{Q}^{l}$ and $\alpha_{A}^{l}$ for every attention head. These parameters, together with the probe parameters, are optimized using the Adam optimizer with a learning rate of $1à 10^{-2}$ , $\beta_{1}=0.9$ , and $\beta_{2}=0.999$ . Training is conducted with a batch size of 512 for 10 epochs, while all original LLM parameters remain frozen.
| You are given a factual open-domain question-answer pair. |
| --- |
| Your task is to identify: |
| 1. Core Entity (c) - the known specific entity in the question that the answer is about (a person, place, organization, or other proper noun). |
| 2. Relation (r) - the minimal phrase in the question that expresses what is being asked about the core entity, using only words from the question. |
| Guidelines: |
| The core entity must be a concrete, known entity mentioned in the question, not a general category. |
| If multiple entities appear, choose the one most central to the questionâthe entity the answer primarily concerns. |
| The relation should be the smallest meaningful span that directly connects the core entity to the answer. |
| Use only words from the question; do not paraphrase or add new words. |
| Exclude extra context, modifiers, or descriptive phrases that are not essential to defining the relationship. |
| For complex questions with long modifiers or embedded clauses, focus on the words that directly express the property, action, or attribute of the core entity relevant to the answer. |
| If you cannot confidently identify the core entity or the relation, output NO ANSWER. |
| Output format: |
| Core Entity: exact text |
| Relation: exact text |
| Example 1 |
| Question: Who was the director of Finale? |
| Answer: Ken Kwapis |
| Core Entity: Finale |
| Relation: director |
| Example 2 |
| Question: What film, in production between 2007 and 2009, is directed by James Cameron ("Titanic")? |
| Answer: AvatÄra |
| Core Entity: James Cameron |
| Relation: film directed by |
| Example 3 |
| Question: Which novel, written in 1925 and often cited as a classic of American literature, was authored by F. Scott Fitzgerald? |
| Answer: The Great Gatsby |
| Core Entity: F. Scott Fitzgerald |
| Relation: novel authored by |
| Question: {Here is the question} |
| Answer: {Hese is the answer} |
Table 5: Prompt template used to locate the exact question tokens.
| You are given a factual open-domain question-answer pair. |
| --- |
| Your task is to extract the short exact answer from the long answer. |
| Guidelines: |
| Retain only the relevant tokens from the answer. |
| Use only words from the answer; do not paraphrase or add new words. |
| Extract the short answer regardless of whether the long answer is correct or incorrect. |
| If the short answer cannot be confidently determined, or if the long answer does not address the question, output NO ANSWER. |
| Example 1 |
| Q: Which musical featured the song The Street Where You Live? |
| A: The song "The Street Where You Live" is from the Lerner and Loewe musical "My Fair Lady." It is one of the most famous songs from the show, and it is sung by Professor Henry Higgins as he reflects on the transformation of Eliza Doolittle and the memories they have shared together. |
| Exact answer: My Fair Lady |
| Example 2 |
| Q: Which Swedish actress won the Best Supporting Actress Oscar for Murder on the Orient Express? |
| A: Iâm glad you asked about a Swedish actress who won an Oscar for "Murder on the Orient Express," but I must clarify that there seems to be a misunderstanding here. No Swedish actress has won an Oscar for Best Supporting Actress for that film. The 1974 "Murder on the Orient Express" was an American production, and the cast was predominantly British and American. If you have any other questions or if thereâs another |
| Exact answer: NO ANSWER |
| Example 3 |
| Q: Who invented the telephone? |
| A: The telephone was invented by Thomas Edison in the 19th century. |
| Exact answer: Thomas Edison |
| Q: {Here is the question} |
| A: {Here is the answer} |
| Exact answer: |
Table 6: Prompt template used to locate the exact answer tokens.
| You are given one factual question. Interpret it literally and think carefully. |
| --- |
| Your task is to decide whether you can answer it correctly with high confidence based only on your internal knowledge (no tools or web). If yes, output exactly: YES. If not or uncertain, output exactly: NO. You should output one word only. |
| Question: {Here is the question} |
| Your Output: |
Table 7: Prompt template used to obtain the i-donât-know rate.
Appendix C Attention Knockout
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, comparing the change in probability (ÎP) across different layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, and the y-axis represents ÎP.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** ÎP (ranging from approximately -60 to 0).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend (Bottom-Center):**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Purple Line: Q-Anchored (TriviaQA)
* Gray Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Green Line: Q-Anchored (NQ)
* Teal Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately ÎP = -2, decreases sharply to approximately ÎP = -45 around layer 7, then plateaus around ÎP = -40 to -45 until layer 15.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately ÎP = 0, decreases gradually to approximately ÎP = -15 around layer 10, then remains relatively stable around ÎP = -15 to -20 until layer 15.
* **Q-Anchored (TriviaQA) - Purple Line:** Starts at approximately ÎP = -5, decreases sharply to approximately ÎP = -40 around layer 7, then fluctuates between approximately ÎP = -35 and -45 until layer 15.
* **A-Anchored (TriviaQA) - Gray Line:** Starts at approximately ÎP = -2, decreases gradually to approximately ÎP = -20 around layer 10, then remains relatively stable around ÎP = -20 to -25 until layer 15.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately ÎP = 0, decreases to approximately ÎP = -20 around layer 5, then fluctuates between approximately ÎP = -20 and -30 until layer 15.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately ÎP = 0, decreases to approximately ÎP = -15 around layer 5, then fluctuates between approximately ÎP = -15 and -25 until layer 15.
* **Q-Anchored (NQ) - Green Line:** Starts at approximately ÎP = -3, decreases sharply to approximately ÎP = -40 around layer 7, then fluctuates between approximately ÎP = -35 and -45 until layer 15.
* **A-Anchored (NQ) - Teal Line:** Starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -15 around layer 10, then remains relatively stable around ÎP = -15 to -20 until layer 15.
**Llama-3.2-3B Chart (Right):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately ÎP = -2, decreases sharply to approximately ÎP = -50 around layer 10, then fluctuates between approximately ÎP = -45 and -55 until layer 25.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately ÎP = 0, decreases gradually to approximately ÎP = -20 around layer 15, then remains relatively stable around ÎP = -20 to -30 until layer 25.
* **Q-Anchored (TriviaQA) - Purple Line:** Starts at approximately ÎP = -5, decreases sharply to approximately ÎP = -45 around layer 10, then fluctuates between approximately ÎP = -40 and -50 until layer 25.
* **A-Anchored (TriviaQA) - Gray Line:** Starts at approximately ÎP = -2, decreases gradually to approximately ÎP = -25 around layer 15, then remains relatively stable around ÎP = -25 to -35 until layer 25.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately ÎP = 0, decreases to approximately ÎP = -25 around layer 10, then fluctuates between approximately ÎP = -25 and -35 until layer 25.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately ÎP = 0, decreases to approximately ÎP = -20 around layer 10, then fluctuates between approximately ÎP = -20 and -30 until layer 25.
* **Q-Anchored (NQ) - Green Line:** Starts at approximately ÎP = -3, decreases sharply to approximately ÎP = -50 around layer 10, then fluctuates between approximately ÎP = -45 and -55 until layer 25.
* **A-Anchored (NQ) - Teal Line:** Starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -20 around layer 15, then remains relatively stable around ÎP = -20 to -30 until layer 25.
### Key Observations
* In both charts, the Q-Anchored lines (blue, purple, brown, green) generally exhibit a steeper decrease in ÎP compared to the A-Anchored lines (orange, gray, red, teal).
* The 3B model (right chart) shows a more pronounced decrease in ÎP and extends to a greater number of layers compared to the 1B model (left chart).
* PopQA and NQ datasets consistently show the largest negative ÎP values, indicating the most significant change in probability.
* The HotpotQA dataset consistently shows the smallest negative ÎP values.
### Interpretation
The charts illustrate how the change in probability (ÎP) varies across layers for different question-answering datasets and anchoring methods in Llama models. The steeper decline in ÎP for Q-Anchored lines suggests that question-based anchoring leads to a more significant shift in the model's internal representations as it processes deeper layers. The larger negative ÎP values for PopQA and NQ datasets indicate that these datasets induce a greater change in the model's probability distribution. The 3B model's more pronounced decrease and extended layer range suggest that larger models exhibit more complex internal dynamics and potentially benefit more from deeper processing. The differences between A-Anchored and Q-Anchored lines suggest that the method of anchoring (question vs. answer) significantly impacts the model's learning process and internal representations. These findings could be used to optimize model architecture and training strategies for improved question-answering performance.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Delta P (ÎP) vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the change in probability (ÎP) as a function of layer number for two different Llama models: Llama-3-8B and Llama-3-70B. Each chart shows multiple lines representing different question-answering datasets and anchoring methods. The charts aim to visualize how the probability change varies across layers for each model and dataset combination.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to approximately 30 for Llama-3-8B and 0 to approximately 80 for Llama-3-70B).
* **Y-axis:** ÎP (Delta P), representing the change in probability. The scale ranges from approximately -80 to 0.
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Pink solid line
* A-Anchored (TriviaQA) - Brown solid line
* Q-Anchored (HotpotQA) - Green solid line
* A-Anchored (HotpotQA) - Teal dashed line
* Q-Anchored (NQ) - Purple solid line
* A-Anchored (NQ) - Grey solid line
* **Legend Position:** Bottom-center of the image.
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** The line starts at approximately 0 ÎP at layer 0, rapidly decreases to approximately -60 ÎP by layer 10, and continues to decrease to approximately -70 ÎP by layer 30.
* **A-Anchored (PopQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 5, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
* **Q-Anchored (TriviaQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 10, and continues to decrease to approximately -60 ÎP by layer 30.
* **A-Anchored (TriviaQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 5, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
* **Q-Anchored (HotpotQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 10, and continues to decrease to approximately -60 ÎP by layer 30.
* **A-Anchored (HotpotQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 5, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
* **Q-Anchored (NQ):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 10, and continues to decrease to approximately -60 ÎP by layer 30.
* **A-Anchored (NQ):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 5, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 20, and continues to decrease to approximately -60 ÎP by layer 60, then fluctuates.
* **A-Anchored (PopQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 10, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
* **Q-Anchored (TriviaQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 20, and continues to decrease to approximately -60 ÎP by layer 60, then fluctuates.
* **A-Anchored (TriviaQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 10, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
* **Q-Anchored (HotpotQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 20, and continues to decrease to approximately -60 ÎP by layer 60, then fluctuates.
* **A-Anchored (HotpotQA):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 10, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
* **Q-Anchored (NQ):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -40 ÎP by layer 20, and continues to decrease to approximately -60 ÎP by layer 60, then fluctuates.
* **A-Anchored (NQ):** The line starts at approximately 0 ÎP at layer 0, decreases to approximately -20 ÎP by layer 10, and fluctuates between approximately -20 and -40 ÎP for the remainder of the layers.
### Key Observations
* For both models, the Q-Anchored lines consistently show a more significant decrease in ÎP compared to the A-Anchored lines.
* The A-Anchored lines tend to plateau after a certain layer, while the Q-Anchored lines continue to decrease, albeit with some fluctuations.
* The Llama-3-70B model exhibits a slower initial decrease in ÎP compared to the Llama-3-8B model, but the overall trend is similar.
* The datasets (PopQA, TriviaQA, HotpotQA, NQ) do not appear to significantly alter the overall trend for either anchoring method within each model.
### Interpretation
The charts suggest that question anchoring (Q-Anchored) leads to a more substantial reduction in probability as the layer number increases, compared to answer anchoring (A-Anchored). This could indicate that the model's confidence in its answers decreases more rapidly as it processes deeper layers when the question is used as the anchor. The plateauing of the A-Anchored lines might suggest that the model's initial answer representation stabilizes relatively quickly.
The larger model (Llama-3-70B) shows a more gradual decrease in ÎP, potentially due to its increased capacity to maintain information across layers. The consistency of the trends across different datasets suggests that the observed behavior is not specific to any particular question-answering task.
The negative ÎP values indicate a decrease in probability, which could be interpreted as a reduction in the model's certainty or confidence in its predictions as it processes information through deeper layers. The differences between the anchoring methods and model sizes provide insights into how these factors influence the model's internal representations and decision-making processes.
</details>
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Delta P (ÎP) vs. Layer for Mistral-7B Models
### Overview
The image presents two line charts, side-by-side, displaying the change in probability (ÎP) as a function of layer number for two versions of the Mistral-7B language model: v0.1 and v0.3. Each chart shows multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, ranging from 0 to approximately 32. The y-axis represents ÎP, ranging from approximately -65 to 5.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 32, with gridlines at integer values).
* **Y-axis:** ÎP (Delta P, change in probability, ranging from approximately -65 to 5, with gridlines at intervals of 10).
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom-Center):** Contains labels for each line, indicating the anchoring method and dataset.
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Brown dashed line
* Q-Anchored (HotpotQA) - Green dashed-dotted line
* A-Anchored (HotpotQA) - Light Green dotted line
* Q-Anchored (NQ) - Teal solid line
* A-Anchored (NQ) - Grey dashed line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 2, decreases steadily to approximately -45 at layer 30.
* **A-Anchored (PopQA):** Starts at approximately 0, fluctuates around 0 until layer 20, then decreases to approximately -20 at layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, decreases to approximately -40 at layer 20, then continues to decrease to approximately -60 at layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 0, fluctuates around 0 until layer 10, then decreases to approximately -30 at layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to approximately -30 at layer 10, then decreases more rapidly to approximately -60 at layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 0, fluctuates around 0 until layer 15, then decreases to approximately -25 at layer 30.
* **Q-Anchored (NQ):** Starts at approximately 0, decreases to approximately -35 at layer 10, then decreases more rapidly to approximately -60 at layer 30.
* **A-Anchored (NQ):** Starts at approximately 0, fluctuates around 0 until layer 15, then decreases to approximately -20 at layer 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 2, decreases steadily to approximately -40 at layer 30.
* **A-Anchored (PopQA):** Starts at approximately 0, fluctuates around 0 until layer 20, then decreases to approximately -15 at layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, decreases to approximately -35 at layer 20, then continues to decrease to approximately -55 at layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 0, fluctuates around 0 until layer 10, then decreases to approximately -25 at layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to approximately -25 at layer 10, then decreases more rapidly to approximately -55 at layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 0, fluctuates around 0 until layer 15, then decreases to approximately -20 at layer 30.
* **Q-Anchored (NQ):** Starts at approximately 0, decreases to approximately -30 at layer 10, then decreases more rapidly to approximately -55 at layer 30.
* **A-Anchored (NQ):** Starts at approximately 0, fluctuates around 0 until layer 15, then decreases to approximately -15 at layer 30.
### Key Observations
* In both charts, the "Q-Anchored" lines generally exhibit a steeper decline in ÎP compared to the "A-Anchored" lines.
* The "HotpotQA" and "NQ" datasets consistently show the most significant decreases in ÎP, particularly for the "Q-Anchored" method.
* The v0.3 model generally shows less negative ÎP values across all datasets and anchoring methods compared to the v0.1 model, suggesting an improvement in performance.
* The "A-Anchored" lines tend to remain closer to 0 for a longer period before decreasing, indicating a more stable initial probability change.
### Interpretation
The charts illustrate how the change in probability (ÎP) varies across different layers of the Mistral-7B language model for different question-answering datasets and anchoring methods. The negative ÎP values suggest a decrease in the model's confidence or probability assignment as information propagates through the layers.
The steeper decline in ÎP for "Q-Anchored" lines suggests that anchoring based on the question itself leads to a more pronounced shift in probability distribution as the model processes the information. The datasets "HotpotQA" and "NQ" appear to be more challenging for the model, as they exhibit the largest decreases in ÎP.
The improvement observed in the v0.3 model, with less negative ÎP values, indicates that the model updates have likely enhanced its ability to maintain probability assignments across layers, potentially leading to more stable and accurate predictions. The "A-Anchored" lines' initial stability suggests that anchoring based on the answer might provide a more consistent starting point for probability calculations.
The differences between datasets highlight the varying difficulty levels and characteristics of each dataset, influencing how the model processes and assigns probabilities. This analysis provides insights into the model's internal workings and potential areas for improvement.
</details>
Figure 7: $\Delta\mathrm{P}$ under attention knockout, probing attention activations of the final token.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the change in probability (ÎP) across layers for two Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different anchoring and question-answering datasets. The x-axis represents the layer number, and the y-axis represents ÎP.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** ÎP (ranging from approximately -80 to 20).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding line styles/colors:
* Q-Anchored (PopQA) - Solid Blue Line
* A-Anchored (PopQA) - Dashed Orange Line
* Q-Anchored (TriviaQA) - Solid Purple Line
* A-Anchored (TriviaQA) - Dashed Pink Line
* Q-Anchored (HotpotQA) - Dashed Gray Line
* A-Anchored (HotpotQA) - Solid Green Line
* Q-Anchored (NQ) - Solid Cyan Line
* A-Anchored (NQ) - Dashed Magenta Line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA):** The blue line starts at approximately 5, decreases steadily to approximately -50 at layer 15.
* **A-Anchored (PopQA):** The orange dashed line starts at approximately 10, decreases to approximately -25 at layer 15.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 0, decreases to approximately -40 at layer 15.
* **A-Anchored (TriviaQA):** The pink dashed line starts at approximately 5, decreases to approximately -30 at layer 15.
* **Q-Anchored (HotpotQA):** The gray dashed line starts at approximately 5, decreases to approximately -30 at layer 15.
* **A-Anchored (HotpotQA):** The green line starts at approximately 10, decreases to approximately -40 at layer 15.
* **Q-Anchored (NQ):** The cyan line starts at approximately 10, decreases to approximately -60 at layer 15.
* **A-Anchored (NQ):** The magenta dashed line starts at approximately 5, decreases to approximately -50 at layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA):** The blue line starts at approximately 5, decreases to approximately -50 at layer 25.
* **A-Anchored (PopQA):** The orange dashed line starts at approximately 10, decreases to approximately -20 at layer 25.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 0, decreases to approximately -50 at layer 25.
* **A-Anchored (TriviaQA):** The pink dashed line starts at approximately 5, decreases to approximately -30 at layer 25.
* **Q-Anchored (HotpotQA):** The gray dashed line starts at approximately 5, decreases to approximately -30 at layer 25.
* **A-Anchored (HotpotQA):** The green line starts at approximately 10, decreases to approximately -40 at layer 25.
* **Q-Anchored (NQ):** The cyan line starts at approximately 10, decreases to approximately -70 at layer 25.
* **A-Anchored (NQ):** The magenta dashed line starts at approximately 5, decreases to approximately -60 at layer 25.
### Key Observations
* In both charts, all lines generally exhibit a downward trend, indicating a decrease in ÎP as the layer number increases.
* The Q-Anchored (NQ) lines consistently show the most significant decrease in ÎP across layers in both models.
* The A-Anchored lines are generally less negative than the Q-Anchored lines, suggesting a different behavior based on the anchoring method.
* The 3B model shows a more extended range of layers (up to 25) compared to the 1B model (up to 15).
* The magnitude of ÎP decrease appears to be similar between the 1B and 3B models, despite the difference in model size.
### Interpretation
The charts demonstrate how the change in probability (ÎP) evolves across different layers of the Llama models when using various question-answering datasets and anchoring methods. The consistent downward trend suggests that the models' internal representations become less sensitive to the initial input as information propagates through deeper layers.
The differences between Q-Anchored and A-Anchored lines indicate that the anchoring method significantly influences the model's behavior. Q-Anchored lines, particularly with the NQ dataset, show a more substantial decrease in ÎP, potentially indicating a stronger reliance on the question context.
The fact that the 3B model has more layers suggests a greater capacity for complex representation learning, but the similar magnitude of ÎP decrease implies that the fundamental behavior of information processing is comparable between the two models. The charts provide insights into the internal dynamics of these language models and how they process information at different levels of abstraction. The negative ÎP values suggest a reduction in the initial probability as the information flows through the layers, which could be related to the model refining its predictions or focusing on more relevant features.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across layers for two Llama models: Llama-3-8B and Llama-3-70B. The x-axis represents the layer number, and the y-axis represents ÎP. Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** ÎP (ranging from approximately -80 to 20).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Green solid line
* A-Anchored (TriviaQA) - Purple dashed line
* Q-Anchored (HotpotQA) - Brown dashed-dotted line
* A-Anchored (HotpotQA) - Pink dashed line
* Q-Anchored (NQ) - Light Blue solid line
* A-Anchored (NQ) - Teal solid line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately -20, sharply decreases to around -70 by layer 5, then fluctuates between -50 and -70 until layer 30.
* **A-Anchored (PopQA):** Starts at approximately 10, gradually decreases to around -10 by layer 30, with some fluctuations.
* **Q-Anchored (TriviaQA):** Starts at approximately -10, decreases to around -60 by layer 5, then fluctuates between -40 and -60 until layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 5, decreases to around -20 by layer 30, with some fluctuations.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -50 by layer 5, then fluctuates between -30 and -50 until layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 10, decreases to around -10 by layer 30, with some fluctuations.
* **Q-Anchored (NQ):** Starts at approximately -15, decreases to around -65 by layer 5, then fluctuates between -50 and -65 until layer 30.
* **A-Anchored (NQ):** Starts at approximately 5, decreases to around -20 by layer 30, with some fluctuations.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately -20, decreases to around -60 by layer 20, then fluctuates between -40 and -60 until layer 80.
* **A-Anchored (PopQA):** Starts at approximately 10, gradually decreases to around -10 by layer 80, with some fluctuations.
* **Q-Anchored (TriviaQA):** Starts at approximately -10, decreases to around -50 by layer 20, then fluctuates between -30 and -50 until layer 80.
* **A-Anchored (TriviaQA):** Starts at approximately 5, decreases to around -15 by layer 80, with some fluctuations.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -40 by layer 20, then fluctuates between -20 and -40 until layer 80.
* **A-Anchored (HotpotQA):** Starts at approximately 10, decreases to around -10 by layer 80, with some fluctuations.
* **Q-Anchored (NQ):** Starts at approximately -15, decreases to around -55 by layer 20, then fluctuates between -40 and -55 until layer 80.
* **A-Anchored (NQ):** Starts at approximately 5, decreases to around -15 by layer 80, with some fluctuations.
### Key Observations
* In both models, the Q-Anchored lines generally exhibit a more significant decrease in ÎP compared to the A-Anchored lines.
* The initial performance (ÎP at layer 0) is generally higher for A-Anchored methods.
* The rate of performance decrease appears to slow down in later layers for both models.
* The Llama-3-70B model shows a more gradual decline in ÎP across layers compared to the Llama-3-8B model.
* The lines for the different datasets tend to cluster together, suggesting a common trend in performance change.
### Interpretation
The charts illustrate how performance changes across layers for different question-answering datasets and anchoring methods in two Llama models. The negative ÎP values indicate a performance decrease as the model processes deeper layers. The steeper decline in Q-Anchored methods suggests that question-based anchoring might be more sensitive to layer depth or require more careful tuning. The larger model (Llama-3-70B) exhibits a more stable performance profile, indicating that increased model size can mitigate the performance degradation observed in the smaller model (Llama-3-8B). The consistent trend across datasets suggests that the observed performance changes are not specific to any particular question-answering task. The difference between A-Anchored and Q-Anchored methods could be due to the way information is incorporated into the model during training or inference. The charts provide valuable insights into the behavior of these models and can inform strategies for improving their performance and stability.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Chart: Delta P vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the change in probability (ÎP) across layers for two versions of the Mistral-7B language model (v0.1 and v0.3). Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number (0 to 30), and the y-axis represents ÎP, ranging from approximately -80 to 20. Shaded areas around each line indicate the standard deviation.
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** ÎP (Delta P, change in probability)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Blue Line: Q-Anchored (PopQA)
* Orange Dashed Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Light Green Dashed Line: A-Anchored (HotpotQA)
* Gray Line: Q-Anchored (NQ)
* Light Purple Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) (Blue Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -60 ÎP at layer 25, and then slightly recovers to around -50 ÎP at layer 30.
* **A-Anchored (PopQA) (Orange Dashed Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 15, then decreases to approximately -40 ÎP at layer 25, and recovers to around -30 ÎP at layer 30.
* **Q-Anchored (TriviaQA) (Green Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -50 ÎP at layer 20, and then decreases further to around -65 ÎP at layer 30.
* **A-Anchored (TriviaQA) (Purple Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 10, then decreases to approximately -40 ÎP at layer 20, and recovers to around -30 ÎP at layer 30.
* **Q-Anchored (HotpotQA) (Brown Dashed Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -50 ÎP at layer 20, and then decreases further to around -60 ÎP at layer 30.
* **A-Anchored (HotpotQA) (Light Green Dashed Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 10, then decreases to approximately -30 ÎP at layer 20, and recovers to around -20 ÎP at layer 30.
* **Q-Anchored (NQ) (Gray Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -50 ÎP at layer 20, and then decreases further to around -60 ÎP at layer 30.
* **A-Anchored (NQ) (Light Purple Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 10, then decreases to approximately -30 ÎP at layer 20, and recovers to around -20 ÎP at layer 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) (Blue Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -50 ÎP at layer 25, and then slightly recovers to around -40 ÎP at layer 30.
* **A-Anchored (PopQA) (Orange Dashed Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 15, then decreases to approximately -30 ÎP at layer 25, and recovers to around -20 ÎP at layer 30.
* **Q-Anchored (TriviaQA) (Green Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -50 ÎP at layer 20, and then decreases further to around -65 ÎP at layer 30.
* **A-Anchored (TriviaQA) (Purple Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 10, then decreases to approximately -30 ÎP at layer 20, and recovers to around -20 ÎP at layer 30.
* **Q-Anchored (HotpotQA) (Brown Dashed Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -40 ÎP at layer 20, and then decreases further to around -60 ÎP at layer 30.
* **A-Anchored (HotpotQA) (Light Green Dashed Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 10, then decreases to approximately -20 ÎP at layer 20, and recovers to around -10 ÎP at layer 30.
* **Q-Anchored (NQ) (Gray Line):** Starts at approximately 0 ÎP, decreases to around -20 ÎP at layer 5, continues decreasing to approximately -50 ÎP at layer 20, and then decreases further to around -60 ÎP at layer 30.
* **A-Anchored (NQ) (Light Purple Line):** Starts at approximately 0 ÎP, remains relatively stable around -10 to 0 ÎP until layer 10, then decreases to approximately -20 ÎP at layer 20, and recovers to around -10 ÎP at layer 30.
### Key Observations
* In both charts, the Q-Anchored lines generally exhibit a steeper decline in ÎP compared to the A-Anchored lines.
* The TriviaQA dataset consistently shows the most significant decrease in ÎP across layers for both models.
* The A-Anchored lines tend to plateau or even slightly recover in ÎP after layer 20, while the Q-Anchored lines continue to decline.
* The v0.3 model generally shows a less dramatic decline in ÎP compared to the v0.1 model, particularly for the A-Anchored lines.
* The shaded areas indicate a relatively consistent standard deviation across layers for most data series.
### Interpretation
The charts illustrate how the change in probability (ÎP) varies across layers of the Mistral language models when evaluated on different question-answering datasets using different anchoring methods. The negative ÎP values suggest a decrease in the model's confidence or probability assignment as information propagates through the layers.
The steeper decline in ÎP for Q-Anchored lines suggests that anchoring the questions has a more significant impact on reducing the model's confidence compared to anchoring the answers. The consistent negative trend for TriviaQA indicates that this dataset poses a greater challenge to the model, leading to a more substantial reduction in probability across layers.
The difference between v0.1 and v0.3 suggests that the model improvements in v0.3 have mitigated some of the confidence loss observed in v0.1, particularly when using A-Anchoring. The plateauing of A-Anchored lines in later layers could indicate that the model has stabilized its predictions or is less sensitive to further processing.
The data suggests that the model's confidence decreases as information flows through the layers, and this effect is influenced by the anchoring method and the complexity of the question-answering dataset. The improvements in v0.3 demonstrate the effectiveness of model refinements in preserving confidence and improving performance.
</details>
Figure 8: $\Delta\mathrm{P}$ under attention knockout, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Llama-3.2 Models
### Overview
The image presents two line charts, side-by-side, displaying the change in performance (ÎP) as a function of layer number for two different Llama-3.2 models: 1B and 3B. Each chart shows multiple lines representing different question-answering datasets and anchoring methods. The charts are visually similar, both showing a steep initial decline in ÎP followed by a leveling off.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** ÎP (ranging from approximately -80 to 0).
* **Models:** Llama-3.2-1B (left chart), Llama-3.2-3B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Light Orange dashed line
* Q-Anchored (TriviaQA) - Green line
* A-Anchored (TriviaQA) - Light Purple line
* Q-Anchored (HotpotQA) - Red dashed line
* A-Anchored (HotpotQA) - Light Blue line
* Q-Anchored (NQ) - Dark Orange line
* A-Anchored (NQ) - Light Gray line
The legend is positioned at the bottom of the image, spanning both charts.
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left):**
* **Q-Anchored (PopQA):** Starts at approximately 0 ÎP, rapidly declines to around -60 ÎP by layer 10, then plateaus around -60 ÎP.
* **A-Anchored (PopQA):** Starts at approximately 0 ÎP, declines to around -40 ÎP by layer 10, then plateaus around -40 ÎP.
* **Q-Anchored (TriviaQA):** Starts at approximately 0 ÎP, declines to around -50 ÎP by layer 8, then plateaus around -50 ÎP.
* **A-Anchored (TriviaQA):** Starts at approximately 0 ÎP, declines to around -45 ÎP by layer 8, then plateaus around -45 ÎP.
* **Q-Anchored (HotpotQA):** Starts at approximately 0 ÎP, declines to around -30 ÎP by layer 5, then plateaus around -30 ÎP.
* **A-Anchored (HotpotQA):** Starts at approximately 0 ÎP, declines to around -25 ÎP by layer 5, then plateaus around -25 ÎP.
* **Q-Anchored (NQ):** Starts at approximately 0 ÎP, declines to around -20 ÎP by layer 5, then plateaus around -20 ÎP.
* **A-Anchored (NQ):** Starts at approximately 0 ÎP, declines to around -15 ÎP by layer 5, then plateaus around -15 ÎP.
**Llama-3.2-3B Chart (Right):**
* **Q-Anchored (PopQA):** Starts at approximately 0 ÎP, rapidly declines to around -70 ÎP by layer 15, then plateaus around -70 ÎP.
* **A-Anchored (PopQA):** Starts at approximately 0 ÎP, declines to around -50 ÎP by layer 15, then plateaus around -50 ÎP.
* **Q-Anchored (TriviaQA):** Starts at approximately 0 ÎP, declines to around -60 ÎP by layer 12, then plateaus around -60 ÎP.
* **A-Anchored (TriviaQA):** Starts at approximately 0 ÎP, declines to around -50 ÎP by layer 12, then plateaus around -50 ÎP.
* **Q-Anchored (HotpotQA):** Starts at approximately 0 ÎP, declines to around -35 ÎP by layer 10, then plateaus around -35 ÎP.
* **A-Anchored (HotpotQA):** Starts at approximately 0 ÎP, declines to around -30 ÎP by layer 10, then plateaus around -30 ÎP.
* **Q-Anchored (NQ):** Starts at approximately 0 ÎP, declines to around -25 ÎP by layer 10, then plateaus around -25 ÎP.
* **A-Anchored (NQ):** Starts at approximately 0 ÎP, declines to around -20 ÎP by layer 10, then plateaus around -20 ÎP.
### Key Observations
* The ÎP consistently decreases with increasing layer number for all datasets and anchoring methods.
* Q-Anchored methods generally exhibit a larger decrease in ÎP compared to A-Anchored methods.
* PopQA and TriviaQA datasets show the most significant performance decline, while HotpotQA and NQ datasets show a more moderate decline.
* The 3B model exhibits a more pronounced decline in ÎP than the 1B model, suggesting that increasing model size exacerbates the performance degradation with layer depth.
* The rate of decline appears to slow down after a certain layer number (around 10-15 for the 1B model and 15-20 for the 3B model).
### Interpretation
These charts likely represent the impact of increasing model depth on performance, measured by ÎP (presumably a performance metric). The consistent decline in ÎP suggests that adding more layers to the Llama-3.2 models leads to a degradation in performance, potentially due to issues like vanishing gradients or overfitting.
The difference between Q-Anchored and A-Anchored methods suggests that the method used to anchor the questions or answers influences the extent of performance degradation. Q-Anchoring appears to be more susceptible to performance loss as the model deepens.
The varying degrees of decline across different datasets (PopQA, TriviaQA, HotpotQA, NQ) indicate that the type of question-answering task also plays a role. More complex or challenging datasets (PopQA and TriviaQA) seem to suffer more from increased model depth.
The larger decline observed in the 3B model compared to the 1B model suggests that the performance degradation becomes more pronounced as the model size increases. This could be due to the increased complexity of larger models making them more prone to overfitting or other issues associated with deep learning.
The leveling off of the curves after a certain layer number suggests that there might be a point of diminishing returns when adding more layers to these models. Beyond that point, the performance gains are minimal, and the risk of degradation outweighs the potential benefits.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the change in probability (ÎP) across layers for two Llama models: Llama-3-8B and Llama-3-70B. The charts display ÎP as a function of layer number, with different lines representing different question-answering datasets and anchoring methods.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** ÎP (ranging from approximately -80 to 20).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring:**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Light-orange dashed line
* Q-Anchored (HotpotQA) - Brown dashed-dotted line
* A-Anchored (HotpotQA) - Green dashed line
* Q-Anchored (NQ) - Teal solid line
* A-Anchored (NQ) - Pink dashed line
* **Legend:** Located at the bottom of the image, clearly labeling each line with its corresponding dataset and anchoring method.
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** The blue line starts at approximately 5, decreases sharply to around -40 by layer 10, then continues to decrease to approximately -60 by layer 30.
* **A-Anchored (PopQA):** The orange dashed line starts at approximately 5, remains relatively stable around 0 to 5 until layer 15, then decreases to approximately -30 by layer 30.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 0, decreases to around -30 by layer 10, and then fluctuates between -30 and -50 until layer 30.
* **A-Anchored (TriviaQA):** The light-orange dashed line starts at approximately 0, decreases to around -20 by layer 10, and then fluctuates between -20 and -40 until layer 30.
* **Q-Anchored (HotpotQA):** The brown dashed-dotted line starts at approximately 5, decreases to around -20 by layer 10, and then fluctuates between -20 and -40 until layer 30.
* **A-Anchored (HotpotQA):** The green dashed line starts at approximately 5, decreases to around -10 by layer 10, and then fluctuates between -10 and -30 until layer 30.
* **Q-Anchored (NQ):** The teal line starts at approximately 5, decreases to around -20 by layer 10, and then fluctuates between -20 and -40 until layer 30.
* **A-Anchored (NQ):** The pink dashed line starts at approximately 5, decreases to around -10 by layer 10, and then fluctuates between -10 and -30 until layer 30.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** The blue line starts at approximately 5, decreases sharply to around -40 by layer 20, then continues to decrease to approximately -60 by layer 60, and finally reaches around -70 by layer 80.
* **A-Anchored (PopQA):** The orange dashed line starts at approximately 5, remains relatively stable around 0 to 5 until layer 20, then decreases to approximately -30 by layer 80.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 0, decreases to around -20 by layer 20, and then fluctuates between -20 and -50 until layer 80.
* **A-Anchored (TriviaQA):** The light-orange dashed line starts at approximately 0, decreases to around -10 by layer 20, and then fluctuates between -10 and -30 until layer 80.
* **Q-Anchored (HotpotQA):** The brown dashed-dotted line starts at approximately 5, decreases to around -10 by layer 20, and then fluctuates between -10 and -30 until layer 80.
* **A-Anchored (HotpotQA):** The green dashed line starts at approximately 5, decreases to around -5 by layer 20, and then fluctuates between -5 and -20 until layer 80.
* **Q-Anchored (NQ):** The teal line starts at approximately 5, decreases to around -10 by layer 20, and then fluctuates between -10 and -30 until layer 80.
* **A-Anchored (NQ):** The pink dashed line starts at approximately 5, decreases to around -5 by layer 20, and then fluctuates between -5 and -20 until layer 80.
### Key Observations
* For both models, the Q-Anchored (PopQA) line consistently exhibits the most significant decrease in ÎP across layers.
* A-Anchored lines generally remain closer to 0 compared to Q-Anchored lines, indicating a smaller change in probability.
* The Llama-3-70B model shows a more prolonged decrease in ÎP across a larger number of layers compared to the Llama-3-8B model.
* The lines for different datasets and anchoring methods tend to converge at higher layer numbers, suggesting a similar behavior in the deeper layers of the models.
### Interpretation
The charts demonstrate how the change in probability (ÎP) varies across layers for different question-answering datasets and anchoring methods in Llama models. The steeper decline in ÎP for Q-Anchored (PopQA) suggests that this combination is more sensitive to changes in layers, potentially indicating a stronger reliance on specific layer features for answering questions from the PopQA dataset. The relatively stable ÎP for A-Anchored lines suggests a more robust and consistent performance across layers. The larger number of layers in the Llama-3-70B model allows for a more gradual and prolonged decrease in ÎP, potentially indicating a more complex and nuanced representation of information. The convergence of lines at higher layers suggests that the models may rely on similar features in the deeper layers, regardless of the dataset or anchoring method. These findings could be valuable for understanding the internal workings of Llama models and optimizing their performance for specific tasks.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the change in performance (ÎP) across layers for two versions of the Mistral-7B language model: v0.1 and v0.3. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, ranging from 0 to 30, and the y-axis represents ÎP, ranging from -80 to 20.
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** ÎP (Change in Performance)
* **Chart Titles:**
* Left Chart: "Mistral-7B-v0.1"
* Right Chart: "Mistral-7B-v0.3"
* **Legend:** Located at the bottom of the image, containing the following lines and their corresponding datasets/anchoring methods:
* Blue Solid Line: Q-Anchored (PopQA)
* Orange Dashed Line: A-Anchored (PopQA)
* Purple Solid Line: Q-Anchored (TriviaQA)
* Green Dashed Line: A-Anchored (TriviaQA)
* Red Dashed-Dotted Line: Q-Anchored (HotpotQA)
* Yellow Dashed-Dotted Line: A-Anchored (HotpotQA)
* Teal Solid Line: Q-Anchored (NQ)
* Magenta Dotted Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart)**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 5, decreases sharply to around -60 at layer 20, then fluctuates between -60 and -70 until layer 30.
* **A-Anchored (PopQA) - Orange Dashed Line:** Starts at approximately 3, decreases gradually to around -40 at layer 20, then increases slightly to around -30 at layer 30.
* **Q-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 3, decreases to around -50 at layer 15, then decreases further to around -65 at layer 25, and ends around -60 at layer 30.
* **A-Anchored (TriviaQA) - Green Dashed Line:** Starts at approximately 2, decreases gradually to around -40 at layer 20, then remains relatively stable around -40 to -50 until layer 30.
* **Q-Anchored (HotpotQA) - Red Dashed-Dotted Line:** Starts at approximately 5, decreases to around -30 at layer 10, then decreases more rapidly to around -60 at layer 20, and ends around -65 at layer 30.
* **A-Anchored (HotpotQA) - Yellow Dashed-Dotted Line:** Starts at approximately 4, decreases gradually to around -30 at layer 15, then remains relatively stable around -30 to -40 until layer 30.
* **Q-Anchored (NQ) - Teal Solid Line:** Starts at approximately 5, decreases sharply to around -60 at layer 20, then fluctuates between -60 and -70 until layer 30.
* **A-Anchored (NQ) - Magenta Dotted Line:** Starts at approximately 3, decreases gradually to around -40 at layer 20, then increases slightly to around -30 at layer 30.
**Mistral-7B-v0.3 (Right Chart)**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 5, decreases to around -40 at layer 15, then decreases more rapidly to around -70 at layer 25, and ends around -75 at layer 30.
* **A-Anchored (PopQA) - Orange Dashed Line:** Starts at approximately 3, decreases gradually to around -30 at layer 20, then remains relatively stable around -30 to -40 until layer 30.
* **Q-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 3, decreases to around -30 at layer 10, then decreases more rapidly to around -60 at layer 20, and ends around -65 at layer 30.
* **A-Anchored (TriviaQA) - Green Dashed Line:** Starts at approximately 2, decreases gradually to around -30 at layer 20, then remains relatively stable around -30 to -40 until layer 30.
* **Q-Anchored (HotpotQA) - Red Dashed-Dotted Line:** Starts at approximately 5, decreases to around -20 at layer 10, then decreases more rapidly to around -50 at layer 20, and ends around -60 at layer 30.
* **A-Anchored (HotpotQA) - Yellow Dashed-Dotted Line:** Starts at approximately 4, decreases gradually to around -20 at layer 15, then remains relatively stable around -20 to -30 until layer 30.
* **Q-Anchored (NQ) - Teal Solid Line:** Starts at approximately 5, decreases to around -40 at layer 15, then decreases more rapidly to around -70 at layer 25, and ends around -75 at layer 30.
* **A-Anchored (NQ) - Magenta Dotted Line:** Starts at approximately 3, decreases gradually to around -30 at layer 20, then remains relatively stable around -30 to -40 until layer 30.
### Key Observations
* In both models, the Q-Anchored lines generally exhibit a steeper decline in ÎP compared to the A-Anchored lines.
* The PopQA and NQ datasets show the most significant drops in ÎP, particularly in the v0.3 model.
* The A-Anchored lines tend to stabilize at lower negative values of ÎP, suggesting a more consistent performance across layers.
* The v0.3 model generally shows a larger decrease in ÎP across layers compared to the v0.1 model, especially for the Q-Anchored lines.
### Interpretation
The charts illustrate how performance changes across the layers of the Mistral-7B models when evaluated on different question-answering datasets using different anchoring methods. The ÎP metric likely represents the difference between some baseline performance and the performance at a given layer.
The steeper decline in ÎP for Q-Anchored lines suggests that question-based anchoring leads to a more significant performance degradation as the model progresses through deeper layers. This could indicate that the model's ability to answer questions effectively diminishes with increasing layer depth when using this anchoring method.
The more stable performance of A-Anchored lines suggests that answer-based anchoring might be more robust to layer depth.
The larger decrease in ÎP in the v0.3 model compared to v0.1 suggests that the model updates in v0.3 have altered the performance characteristics across layers. This could be due to changes in the training data, model architecture, or training procedure.
The differences between datasets (PopQA, TriviaQA, HotpotQA, NQ) indicate that the model's performance is sensitive to the type of questions it is asked. The larger drops for PopQA and NQ suggest these datasets are more challenging for the model to handle as it goes deeper into the layers.
Overall, the data suggests that the choice of anchoring method and the nature of the question-answering dataset significantly impact the model's performance across layers. The v0.3 model exhibits different performance characteristics compared to v0.1, indicating that model updates have altered its behavior.
</details>
Figure 9: $\Delta\mathrm{P}$ under attention knockout, probing attention activations of the last exact answer token.
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Different QA Datasets and Model Sizes
### Overview
The image presents two line charts, side-by-side, comparing the change in perplexity (ÎP) across different layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different question-answering (QA) datasets and anchoring methods. The charts aim to visualize how perplexity changes with layer depth for each configuration.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** ÎP (change in perplexity), ranging from approximately -80 to 0.
* **Title (Left Chart):** Llama-3.2-1B
* **Title (Right Chart):** Llama-3.2-3B
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding line styles/colors:
* Q-Anchored (PopQA) - Solid Blue Line
* A-Anchored (PopQA) - Dotted Orange Line
* Q-Anchored (TriviaQA) - Solid Purple Line
* A-Anchored (TriviaQA) - Dotted Brown Line
* Q-Anchored (HotpotQA) - Dashed Teal Line
* A-Anchored (HotpotQA) - Dashed Green Line
* Q-Anchored (NQ) - Solid Light Blue Line
* A-Anchored (NQ) - Dotted Pink Line
### Detailed Analysis or Content Details
**Left Chart (Llama-3.2-1B):**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = -15 at Layer 0, decreases to approximately ÎP = -50 at Layer 10, and then slightly recovers to approximately ÎP = -45 at Layer 15.
* **A-Anchored (PopQA):** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -25 at Layer 10, and then slightly recovers to approximately ÎP = -20 at Layer 15.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = -20 at Layer 0, decreases to approximately ÎP = -45 at Layer 10, and then decreases further to approximately ÎP = -55 at Layer 15.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -30 at Layer 10, and then decreases further to approximately ÎP = -40 at Layer 15.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -35 at Layer 10, and then decreases further to approximately ÎP = -50 at Layer 15.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -30 at Layer 10, and then decreases further to approximately ÎP = -45 at Layer 15.
* **Q-Anchored (NQ):** Starts at approximately ÎP = -15 at Layer 0, decreases to approximately ÎP = -40 at Layer 10, and then decreases further to approximately ÎP = -50 at Layer 15.
* **A-Anchored (NQ):** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -30 at Layer 10, and then decreases further to approximately ÎP = -40 at Layer 15.
**Right Chart (Llama-3.2-3B):**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = -15 at Layer 0, decreases to approximately ÎP = -40 at Layer 10, and then slightly recovers to approximately ÎP = -35 at Layer 25.
* **A-Anchored (PopQA):** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -25 at Layer 10, and then slightly recovers to approximately ÎP = -20 at Layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = -20 at Layer 0, decreases to approximately ÎP = -50 at Layer 10, and then decreases further to approximately ÎP = -65 at Layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -35 at Layer 10, and then decreases further to approximately ÎP = -50 at Layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -40 at Layer 10, and then decreases further to approximately ÎP = -60 at Layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -30 at Layer 10, and then decreases further to approximately ÎP = -50 at Layer 25.
* **Q-Anchored (NQ):** Starts at approximately ÎP = -15 at Layer 0, decreases to approximately ÎP = -45 at Layer 10, and then decreases further to approximately ÎP = -60 at Layer 25.
* **A-Anchored (NQ):** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -30 at Layer 10, and then decreases further to approximately ÎP = -45 at Layer 25.
### Key Observations
* In both charts, all lines generally exhibit a downward trend, indicating that ÎP decreases as the layer number increases. This suggests that perplexity generally increases with layer depth.
* The Q-Anchored lines consistently have lower ÎP values (more negative) than the A-Anchored lines for the same dataset, indicating that question-anchored methods generally perform better than answer-anchored methods.
* The TriviaQA dataset consistently shows the largest decrease in ÎP (most negative values) across both models, suggesting it is the most challenging dataset.
* The 3B model (right chart) shows a more pronounced decrease in ÎP across all datasets compared to the 1B model (left chart), indicating that the larger model benefits more from increased layer depth.
### Interpretation
The data suggests that increasing the depth of the Llama models (increasing the layer number) generally leads to increased perplexity, implying a potential degradation in performance. However, the specific impact varies depending on the QA dataset and the anchoring method used. Question-anchored methods consistently outperform answer-anchored methods, suggesting that framing the task from the question's perspective is more effective. The larger 3B model demonstrates a more significant improvement with increased depth, indicating that larger models are better able to leverage the benefits of deeper architectures. The differences in performance across datasets highlight the importance of dataset characteristics in model evaluation. The consistent downward trend across all lines suggests a potential need for regularization or architectural modifications to mitigate the increase in perplexity with depth. The charts provide valuable insights into the behavior of these models and can inform future research and development efforts.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Llama-3 Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across different layers of two Llama-3 models: Llama-3-8B and Llama-3-70B. The charts display ÎP as a function of layer number, with different lines representing different question-answering datasets and anchoring methods.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** ÎP (ranging from approximately -80 to 0).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Light Orange dashed line
* Q-Anchored (TriviaQA) - Green line
* A-Anchored (TriviaQA) - Purple dashed line
* Q-Anchored (HotpotQA) - Light Blue line
* A-Anchored (HotpotQA) - Yellow dashed line
* Q-Anchored (NQ) - Teal line
* A-Anchored (NQ) - Red dashed line
* **Legend Position:** Bottom-center of each chart.
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 0, rapidly declines to around -60 by layer 10, then plateaus around -60 to -70 from layer 15 to 30.
* **A-Anchored (PopQA):** Starts at approximately 0, declines more gradually to around -20 by layer 10, then plateaus around -20 to -30 from layer 15 to 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, declines rapidly to around -50 by layer 10, then plateaus around -50 to -60 from layer 15 to 30.
* **A-Anchored (TriviaQA):** Starts at approximately 0, declines more gradually to around -30 by layer 10, then plateaus around -30 to -40 from layer 15 to 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, declines rapidly to around -60 by layer 10, then plateaus around -60 to -70 from layer 15 to 30.
* **A-Anchored (HotpotQA):** Starts at approximately 0, declines more gradually to around -20 by layer 10, then plateaus around -20 to -30 from layer 15 to 30.
* **Q-Anchored (NQ):** Starts at approximately 0, declines rapidly to around -50 by layer 10, then plateaus around -50 to -60 from layer 15 to 30.
* **A-Anchored (NQ):** Starts at approximately 0, declines more gradually to around -30 by layer 10, then plateaus around -30 to -40 from layer 15 to 30.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 0, rapidly declines to around -60 by layer 20, then plateaus around -60 to -70 from layer 40 to 80.
* **A-Anchored (PopQA):** Starts at approximately 0, declines more gradually to around -20 by layer 20, then plateaus around -20 to -30 from layer 40 to 80.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, declines rapidly to around -50 by layer 20, then plateaus around -50 to -60 from layer 40 to 80.
* **A-Anchored (TriviaQA):** Starts at approximately 0, declines more gradually to around -30 by layer 20, then plateaus around -30 to -40 from layer 40 to 80.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, declines rapidly to around -60 by layer 20, then plateaus around -60 to -70 from layer 40 to 80.
* **A-Anchored (HotpotQA):** Starts at approximately 0, declines more gradually to around -20 by layer 20, then plateaus around -20 to -30 from layer 40 to 80.
* **Q-Anchored (NQ):** Starts at approximately 0, declines rapidly to around -50 by layer 20, then plateaus around -50 to -60 from layer 40 to 80.
* **A-Anchored (NQ):** Starts at approximately 0, declines more gradually to around -30 by layer 20, then plateaus around -30 to -40 from layer 40 to 80.
### Key Observations
* In both models, Q-Anchored methods consistently show a larger drop in ÎP compared to A-Anchored methods.
* The decline in ÎP appears to stabilize after a certain layer number (around 15-20 for the 8B model and 40 for the 70B model).
* The 70B model exhibits a slower initial decline in ÎP compared to the 8B model, but the overall magnitude of the decline is similar.
* PopQA and HotpotQA datasets show the most significant drops in ÎP for Q-Anchored methods.
### Interpretation
The charts demonstrate how performance changes across layers in the Llama-3 models when evaluated on different question-answering datasets using different anchoring methods. The negative ÎP values indicate a decrease in performance as the layer number increases. The consistent difference between Q-Anchored and A-Anchored methods suggests that the method used to anchor the questions or answers significantly impacts performance, with Q-Anchoring generally leading to a more substantial performance drop.
The stabilization of ÎP after a certain layer suggests that the models reach a point where adding more layers does not significantly improve (or even degrades) performance on these datasets. The slower decline in the 70B model might indicate that larger models are more robust to the performance degradation associated with increasing layer depth.
The differences in performance across datasets (PopQA, TriviaQA, HotpotQA, NQ) highlight the sensitivity of the models to the specific characteristics of each dataset. The larger drops observed for PopQA and HotpotQA could indicate that these datasets are more challenging for the models, or that the models are more prone to overfitting on these datasets. The data suggests that the models' ability to generalize decreases with depth, and that the anchoring method plays a crucial role in mitigating this effect.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Delta P vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the change in probability (ÎP) across layers for two versions of the Mistral-7B language model: v0.1 and v0.3. Each chart displays multiple lines representing different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, and NQ). The x-axis represents the layer number, ranging from approximately 0 to 32, while the y-axis represents ÎP, ranging from approximately -80 to 20.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 32, with tick marks at intervals of 5)
* **Y-axis:** ÎP (Delta P, change in probability, ranging from -80 to 20)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom-Left):**
* Blue Solid Line: Q-Anchored (PopQA)
* Orange Dashed Line: A-Anchored (PopQA)
* Purple Solid Line: Q-Anchored (TriviaQA)
* Orange Solid Line: A-Anchored (TriviaQA)
* Green Solid Line: Q-Anchored (HotpotQA)
* Light-Green Dashed Line: A-Anchored (HotpotQA)
* Teal Solid Line: Q-Anchored (NQ)
* Brown Dashed Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 0, decreases sharply to around -20 by layer 10, continues decreasing to approximately -65 by layer 30.
* **A-Anchored (PopQA) - Orange Dashed Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -40 by layer 30.
* **Q-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 0, decreases to around -25 by layer 10, continues decreasing to approximately -60 by layer 30.
* **A-Anchored (TriviaQA) - Orange Solid Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -40 by layer 30.
* **Q-Anchored (HotpotQA) - Green Solid Line:** Starts at approximately 0, decreases to around -15 by layer 10, continues decreasing to approximately -55 by layer 30.
* **A-Anchored (HotpotQA) - Light-Green Dashed Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -35 by layer 30.
* **Q-Anchored (NQ) - Teal Solid Line:** Starts at approximately 0, decreases to around -20 by layer 10, continues decreasing to approximately -60 by layer 30.
* **A-Anchored (NQ) - Brown Dashed Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -40 by layer 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 0, decreases to around -20 by layer 10, continues decreasing to approximately -60 by layer 30.
* **A-Anchored (PopQA) - Orange Dashed Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -35 by layer 30.
* **Q-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 0, decreases to around -20 by layer 10, continues decreasing to approximately -55 by layer 30.
* **A-Anchored (TriviaQA) - Orange Solid Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -35 by layer 30.
* **Q-Anchored (HotpotQA) - Green Solid Line:** Starts at approximately 0, decreases to around -15 by layer 10, continues decreasing to approximately -50 by layer 30.
* **A-Anchored (HotpotQA) - Light-Green Dashed Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -30 by layer 30.
* **Q-Anchored (NQ) - Teal Solid Line:** Starts at approximately 0, decreases to around -20 by layer 10, continues decreasing to approximately -55 by layer 30.
* **A-Anchored (NQ) - Brown Dashed Line:** Starts at approximately 0, fluctuates around 0 until layer 10, then gradually decreases to approximately -35 by layer 30.
### Key Observations
* In both charts, the Q-Anchored lines consistently show a more significant decrease in ÎP across layers compared to the A-Anchored lines.
* The decrease in ÎP appears to be more pronounced in Mistral-7B-v0.3 than in v0.1, suggesting a change in the model's behavior across layers.
* The PopQA, TriviaQA, HotpotQA, and NQ datasets exhibit similar trends, with the Q-Anchored lines showing a steeper decline.
* The A-Anchored lines generally remain closer to 0, indicating a smaller change in probability.
### Interpretation
The charts illustrate how the change in probability (ÎP) varies across layers for different anchoring methods and datasets in the Mistral-7B language model. The consistent downward trend in ÎP for Q-Anchored lines suggests that the model's confidence or probability assigned to the correct answer decreases as information propagates through deeper layers when using question anchoring. Conversely, the A-Anchored lines, which remain closer to zero, indicate a more stable probability distribution.
The difference between v0.1 and v0.3 suggests that the model architecture or training process has been modified, leading to a more pronounced effect of layer depth on probability changes in the newer version. The similarity in trends across datasets indicates that this behavior is not specific to a particular type of question or knowledge source.
The steeper decline in ÎP for Q-Anchored lines could be interpreted as a potential issue with information loss or degradation as the model processes information through deeper layers. This might suggest a need for further investigation into the model's internal representations and the effectiveness of different anchoring strategies. The fact that A-Anchored lines are more stable suggests that answer anchoring might be a more robust approach for maintaining probability consistency across layers.
</details>
Figure 10: $\Delta\mathrm{P}$ under attention knockout, probing mlp activations of the final token.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the change in probability (ÎP) across layers for two Llama models: Llama-3.2-1B and Llama-3.2-3B. The charts display ÎP as a function of layer number, with different lines representing different anchoring methods and question-answering datasets.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** ÎP (ranging from approximately -80 to 0).
* **Models:** Llama-3.2-1B (left chart), Llama-3.2-3B (right chart).
* **Anchoring Methods:** Q-Anchored, A-Anchored.
* **Question-Answering Datasets:** PopQA, TriviaQA, HotpotQA, NQ.
* **Legend:** Located at the bottom of the image, with color-coded lines corresponding to each combination of anchoring method and dataset.
### Detailed Analysis
**Llama-3.2-1B (Left Chart)**
* **Q-Anchored (PopQA):** (Blue line) Starts at approximately -2, decreases steadily to approximately -70 at layer 15.
* **A-Anchored (PopQA):** (Orange dashed line) Starts at approximately -1, decreases to approximately -50 at layer 15.
* **Q-Anchored (TriviaQA):** (Green line) Starts at approximately -5, decreases to approximately -60 at layer 15.
* **A-Anchored (TriviaQA):** (Purple line) Starts at approximately -3, decreases to approximately -55 at layer 15.
* **Q-Anchored (HotpotQA):** (Light Blue line) Starts at approximately -1, decreases to approximately -65 at layer 15.
* **A-Anchored (HotpotQA):** (Yellow line) Starts at approximately -2, decreases to approximately -55 at layer 15.
* **Q-Anchored (NQ):** (Pink line) Starts at approximately -2, decreases to approximately -60 at layer 15.
* **A-Anchored (NQ):** (Grey line) Starts at approximately -1, decreases to approximately -50 at layer 15.
**Llama-3.2-3B (Right Chart)**
* **Q-Anchored (PopQA):** (Blue line) Starts at approximately -2, decreases to approximately -70 at layer 25.
* **A-Anchored (PopQA):** (Orange dashed line) Starts at approximately -1, decreases to approximately -50 at layer 25.
* **Q-Anchored (TriviaQA):** (Green line) Starts at approximately -5, decreases to approximately -60 at layer 25.
* **A-Anchored (TriviaQA):** (Purple line) Starts at approximately -3, decreases to approximately -55 at layer 25.
* **Q-Anchored (HotpotQA):** (Light Blue line) Starts at approximately -1, decreases to approximately -65 at layer 25.
* **A-Anchored (HotpotQA):** (Yellow line) Starts at approximately -2, decreases to approximately -55 at layer 25.
* **Q-Anchored (NQ):** (Pink line) Starts at approximately -2, decreases to approximately -60 at layer 25.
* **A-Anchored (NQ):** (Grey line) Starts at approximately -1, decreases to approximately -50 at layer 25.
In both charts, all lines generally exhibit a downward trend, indicating a decrease in ÎP as the layer number increases. The Q-Anchored lines consistently show a steeper decline than the A-Anchored lines for all datasets.
### Key Observations
* The 3B model (right chart) extends to a higher layer number (25) compared to the 1B model (15).
* The Q-Anchored method consistently results in a larger negative ÎP compared to the A-Anchored method across all datasets and models.
* The PopQA dataset generally shows the lowest ÎP values for both anchoring methods.
* The lines representing different datasets are relatively close to each other within each anchoring method, suggesting that the anchoring method has a more significant impact on ÎP than the specific dataset.
### Interpretation
The charts demonstrate how the change in probability (ÎP) evolves across layers in the Llama models, influenced by the anchoring method and the question-answering dataset used. The consistent downward trend suggests that the models' internal representations become more specialized or refined as information propagates through deeper layers.
The steeper decline observed with Q-Anchoring indicates that anchoring the query representation has a stronger effect on reducing the probability difference compared to anchoring the answer representation. This could imply that the query representation is more crucial for capturing the relevant information for accurate question answering.
The differences in ÎP values across datasets suggest that the models perform differently depending on the complexity or characteristics of the questions. PopQA, showing the lowest ÎP, might represent a more challenging dataset for the models.
The overall pattern suggests that the models are learning to differentiate between correct and incorrect answers as they process information through deeper layers, and the anchoring method plays a critical role in shaping this learning process. The fact that the trends are similar for both model sizes suggests that the underlying mechanisms are consistent, even as the model capacity increases.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Llama-3 Models
### Overview
The image presents two line charts, side-by-side, comparing the change in probability (ÎP) across layers for two Llama-3 models: Llama-3-8B and Llama-3-70B. The charts display ÎP as a function of layer number, with different lines representing different anchoring strategies and question-answering datasets.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to approximately 30 for the 8B model and 0 to 80 for the 70B model).
* **Y-axis:** ÎP (ranging from approximately -100 to 0).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Legend:**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Green line
* A-Anchored (TriviaQA) - Purple dashed line
* Q-Anchored (HotpotQA) - Light-blue line
* A-Anchored (HotpotQA) - Red dashed line
* Q-Anchored (NQ) - Teal line
* A-Anchored (NQ) - Brown dashed line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately -10, rapidly decreases to approximately -70 by layer 5, then plateaus around -70 to -80 from layer 10 onwards.
* **A-Anchored (PopQA) - Orange Dashed Line:** Remains relatively stable around 0 to -10 across all layers.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately -5, decreases to approximately -60 by layer 5, and then continues to decrease, reaching approximately -85 by layer 30.
* **A-Anchored (TriviaQA) - Purple Dashed Line:** Starts at approximately -10, decreases to approximately -50 by layer 5, and then plateaus around -50 to -60.
* **Q-Anchored (HotpotQA) - Light-blue Line:** Starts at approximately -10, decreases to approximately -60 by layer 5, and then continues to decrease, reaching approximately -80 by layer 30.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Remains relatively stable around 0 to -10 across all layers.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately -10, decreases to approximately -60 by layer 5, and then continues to decrease, reaching approximately -80 by layer 30.
* **A-Anchored (NQ) - Brown Dashed Line:** Remains relatively stable around 0 to -10 across all layers.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately -10, rapidly decreases to approximately -70 by layer 10, then plateaus around -70 to -80 from layer 20 onwards.
* **A-Anchored (PopQA) - Orange Dashed Line:** Remains relatively stable around 0 to -10 across all layers.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately -5, decreases to approximately -60 by layer 10, and then continues to decrease, reaching approximately -90 by layer 80.
* **A-Anchored (TriviaQA) - Purple Dashed Line:** Starts at approximately -10, decreases to approximately -50 by layer 10, and then plateaus around -50 to -60.
* **Q-Anchored (HotpotQA) - Light-blue Line:** Starts at approximately -10, decreases to approximately -60 by layer 10, and then continues to decrease, reaching approximately -80 by layer 80.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Remains relatively stable around 0 to -10 across all layers.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately -10, decreases to approximately -60 by layer 10, and then continues to decrease, reaching approximately -80 by layer 80.
* **A-Anchored (NQ) - Brown Dashed Line:** Remains relatively stable around 0 to -10 across all layers.
### Key Observations
* For both models, the Q-Anchored lines (PopQA, TriviaQA, HotpotQA, NQ) exhibit a similar trend: a steep initial decrease in ÎP followed by a plateau.
* The A-Anchored lines (PopQA, TriviaQA, HotpotQA, NQ) remain relatively stable around 0 to -10, indicating minimal change in probability.
* The 70B model shows a more extended plateau phase compared to the 8B model.
* TriviaQA consistently shows a lower ÎP value than other datasets for Q-Anchored lines.
### Interpretation
The charts demonstrate how the change in probability (ÎP) varies across layers for different question-answering datasets and anchoring strategies in Llama-3 models. The significant decrease in ÎP for Q-Anchored lines suggests that the model's confidence in its predictions changes substantially as information propagates through the layers. The plateau indicates that the model reaches a point where further processing doesn't significantly alter its predictions.
The stability of A-Anchored lines suggests that anchoring on the answer side doesn't lead to substantial changes in probability across layers. This could imply that the model's initial understanding of the question is more influential than refining the answer during processing.
The differences between datasets (e.g., TriviaQA consistently showing lower ÎP) suggest that the model's behavior is sensitive to the characteristics of the training data. The larger model (70B) exhibits a more prolonged plateau, potentially indicating a greater capacity to process information and maintain stable predictions across deeper layers.
The consistent negative ÎP values for Q-Anchored lines suggest that the model is becoming more confident in its predictions as it processes the input, which is expected. The magnitude of the decrease and the plateau level provide insights into the model's learning dynamics and its ability to generalize across different datasets.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the change in probability (ÎP) across layers for different question-answering datasets using two versions of the Mistral-7B model (v0.1 and v0.3). The x-axis represents the layer number (0 to 30), and the y-axis represents ÎP, ranging from approximately 0 to -80. Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** ÎP (Change in Probability)
* **Chart Titles:**
* Left Chart: "Mistral-7B-v0.1"
* Right Chart: "Mistral-7B-v0.3"
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding line styles/colors:
* Q-Anchored (PopQA) - Blue Solid Line
* A-Anchored (PopQA) - Orange Dashed Line
* Q-Anchored (TriviaQA) - Green Solid Line
* A-Anchored (TriviaQA) - Purple Dashed Line
* Q-Anchored (HotpotQA) - Teal Dotted Line
* A-Anchored (HotpotQA) - Red Dotted Line
* Q-Anchored (NQ) - Cyan Solid Line
* A-Anchored (NQ) - Magenta Dashed Line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 0, decreases steadily to around -60 at layer 20, and continues to decrease to approximately -75 at layer 30.
* **A-Anchored (PopQA):** Starts at approximately 0, decreases gradually to around -20 at layer 10, then plateaus around -30 to -40 from layer 15 to 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, decreases rapidly to around -40 at layer 10, and continues to decrease to approximately -65 at layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 0, decreases gradually to around -20 at layer 10, then plateaus around -30 to -40 from layer 15 to 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -20 at layer 10, and continues to decrease to approximately -50 at layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 0, remains relatively flat around -10 to 0 until layer 20, then decreases to approximately -30 at layer 30.
* **Q-Anchored (NQ):** Starts at approximately 0, decreases to around -20 at layer 10, and continues to decrease to approximately -55 at layer 30.
* **A-Anchored (NQ):** Starts at approximately 0, decreases gradually to around -20 at layer 10, then plateaus around -30 to -40 from layer 15 to 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 0, decreases steadily to around -50 at layer 20, and continues to decrease to approximately -70 at layer 30.
* **A-Anchored (PopQA):** Starts at approximately 0, decreases gradually to around -20 at layer 10, then plateaus around -30 to -40 from layer 15 to 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, decreases rapidly to around -40 at layer 10, and continues to decrease to approximately -60 at layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 0, decreases gradually to around -20 at layer 10, then plateaus around -30 to -40 from layer 15 to 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -20 at layer 10, and continues to decrease to approximately -50 at layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 0, remains relatively flat around -10 to 0 until layer 20, then decreases to approximately -30 at layer 30.
* **Q-Anchored (NQ):** Starts at approximately 0, decreases to around -20 at layer 10, and continues to decrease to approximately -50 at layer 30.
* **A-Anchored (NQ):** Starts at approximately 0, decreases gradually to around -20 at layer 10, then plateaus around -30 to -40 from layer 15 to 30.
### Key Observations
* In both charts, the "Q-Anchored" lines generally exhibit a steeper decline in ÎP compared to the "A-Anchored" lines.
* The "A-Anchored" lines tend to plateau after layer 10, indicating a stabilization of the change in probability.
* The datasets "PopQA", "TriviaQA", and "NQ" show similar trends, with "PopQA" generally having the most significant decrease in ÎP.
* "HotpotQA" shows a less pronounced decrease in ÎP compared to the other datasets.
* The v0.3 model generally shows a slightly more pronounced decrease in ÎP for the "Q-Anchored" lines compared to v0.1.
### Interpretation
The charts illustrate how the change in probability (ÎP) evolves across the layers of the Mistral-7B model for different question-answering datasets and anchoring methods. The negative ÎP values suggest a decreasing confidence or probability as the information propagates through the layers.
The steeper decline in ÎP for "Q-Anchored" lines suggests that anchoring the probability change to the question itself leads to a more significant reduction in confidence as the model processes the information. Conversely, anchoring to the answer ("A-Anchored") results in a more stable ÎP, indicating a sustained level of confidence.
The differences between datasets likely reflect the inherent difficulty and complexity of each dataset. "PopQA" showing the largest decrease suggests it is the most challenging for the model to process, leading to a greater reduction in confidence.
The slight differences between v0.1 and v0.3 suggest that the model improvements in v0.3 have a subtle impact on the ÎP trends, potentially indicating a more refined understanding of the question-answering process.
The plateaus observed in the "A-Anchored" lines could indicate that the model reaches a point where further processing does not significantly alter its confidence in the answer. This could be due to the model having extracted sufficient information to form a stable prediction.
</details>
Figure 11: $\Delta\mathrm{P}$ under attention knockout, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, depicting the change in performance (ÎP) as a function of layer depth in two different Llama language models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The charts aim to visualize how performance changes across layers for each model and dataset combination.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model). The axis is labeled "Layer".
* **Y-axis:** ÎP (ranging from approximately -80 to 0). The axis is labeled "ÎP".
* **Chart Titles:**
* Left Chart: "Llama-3.2-1B"
* Right Chart: "Llama-3.2-3B"
* **Legend:** Located at the bottom of the image, spanning both charts. The legend identifies the different lines based on dataset and anchoring method.
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Blue dashed-dotted line
* A-Anchored (TriviaQA) - Orange dashed-dotted line
* Q-Anchored (HotpotQA) - Purple dashed line
* A-Anchored (HotpotQA) - Purple dotted line
* Q-Anchored (NQ) - Green solid line
* A-Anchored (NQ) - Green dashed line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left)**
* **Q-Anchored (PopQA):** Starts at approximately 0 ÎP at Layer 0, rapidly decreases to approximately -60 ÎP by Layer 5, and then plateaus around -60 ÎP for layers 5-15.
* **A-Anchored (PopQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -40 ÎP by Layer 5, and then plateaus around -40 ÎP for layers 5-15.
* **Q-Anchored (TriviaQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -50 ÎP by Layer 5, and then plateaus around -50 ÎP for layers 5-15.
* **A-Anchored (TriviaQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -30 ÎP by Layer 5, and then plateaus around -30 ÎP for layers 5-15.
* **Q-Anchored (HotpotQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -40 ÎP by Layer 5, and then plateaus around -40 ÎP for layers 5-15.
* **A-Anchored (HotpotQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -20 ÎP by Layer 5, and then plateaus around -20 ÎP for layers 5-15.
* **Q-Anchored (NQ):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -50 ÎP by Layer 5, and then plateaus around -50 ÎP for layers 5-15.
* **A-Anchored (NQ):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -30 ÎP by Layer 5, and then plateaus around -30 ÎP for layers 5-15.
**Llama-3.2-3B Chart (Right)**
* **Q-Anchored (PopQA):** Starts at approximately 0 ÎP at Layer 0, rapidly decreases to approximately -60 ÎP by Layer 5, and then continues to decrease, reaching approximately -80 ÎP by Layer 25.
* **A-Anchored (PopQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -40 ÎP by Layer 5, and then continues to decrease, reaching approximately -60 ÎP by Layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -50 ÎP by Layer 5, and then continues to decrease, reaching approximately -70 ÎP by Layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -30 ÎP by Layer 5, and then continues to decrease, reaching approximately -50 ÎP by Layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -40 ÎP by Layer 5, and then continues to decrease, reaching approximately -60 ÎP by Layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -20 ÎP by Layer 5, and then continues to decrease, reaching approximately -40 ÎP by Layer 25.
* **Q-Anchored (NQ):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -50 ÎP by Layer 5, and then continues to decrease, reaching approximately -70 ÎP by Layer 25.
* **A-Anchored (NQ):** Starts at approximately 0 ÎP at Layer 0, decreases to approximately -30 ÎP by Layer 5, and then continues to decrease, reaching approximately -50 ÎP by Layer 25.
### Key Observations
* In both models, all lines exhibit a decreasing trend in ÎP as the layer depth increases, indicating a performance degradation with deeper layers.
* The 3B model shows a more pronounced and continuous decrease in ÎP across all datasets and anchoring methods compared to the 1B model.
* Q-Anchored lines generally have lower ÎP values than A-Anchored lines for the same dataset, suggesting that question-anchoring leads to a greater performance drop with increasing layer depth.
* The performance drop appears to stabilize after a certain layer depth in the 1B model, while it continues to decrease in the 3B model.
### Interpretation
The charts demonstrate that increasing the depth of the Llama models (moving to deeper layers) generally leads to a decrease in performance, as measured by ÎP. This suggests that the later layers may not be contributing positively to the model's ability to answer questions accurately. The more significant performance drop in the 3B model could indicate that deeper models are more susceptible to issues like overfitting or vanishing gradients.
The difference between Q-Anchored and A-Anchored lines suggests that the method used to anchor the questions affects how performance degrades with depth. Question-anchoring might be more sensitive to the complexities introduced by deeper layers.
The stabilization of performance in the 1B model after a certain layer depth could be due to the model's limited capacity. Once the model reaches its capacity, adding more layers does not necessarily lead to further performance degradation. The continued decrease in the 3B model suggests that it has not yet reached its capacity and that deeper layers are still actively contributing to the performance drop.
These findings have implications for model architecture design and training strategies. It may be beneficial to explore techniques to mitigate performance degradation in deeper layers, such as regularization or layer pruning.
</details>
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, comparing the change in probability (ÎP) across layers for two Llama models: Llama-3-8B and Llama-3-70B. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, and the y-axis represents ÎP.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** ÎP (ranging from approximately -90 to 0).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Green dashed line
* Q-Anchored (HotpotQA) - Brown dashed-dotted line
* A-Anchored (HotpotQA) - Red dashed-dotted line
* Q-Anchored (NQ) - Teal solid line
* A-Anchored (NQ) - Gray solid line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** The line starts at approximately ÎP = -2, decreases steadily to approximately ÎP = -70 at layer 25, and then plateaus.
* **A-Anchored (PopQA):** The line starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -50 at layer 25, and then plateaus.
* **Q-Anchored (TriviaQA):** The line starts at approximately ÎP = -3, decreases rapidly to approximately ÎP = -60 at layer 15, and then continues to decrease to approximately ÎP = -80 at layer 30.
* **A-Anchored (TriviaQA):** The line starts at approximately ÎP = -2, decreases gradually to approximately ÎP = -50 at layer 20, and then continues to decrease to approximately ÎP = -70 at layer 30.
* **Q-Anchored (HotpotQA):** The line starts at approximately ÎP = -1, decreases rapidly to approximately ÎP = -60 at layer 10, and then continues to decrease to approximately ÎP = -75 at layer 30.
* **A-Anchored (HotpotQA):** The line starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -40 at layer 20, and then continues to decrease to approximately ÎP = -60 at layer 30.
* **Q-Anchored (NQ):** The line starts at approximately ÎP = -2, decreases steadily to approximately ÎP = -60 at layer 20, and then continues to decrease to approximately ÎP = -75 at layer 30.
* **A-Anchored (NQ):** The line starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -50 at layer 20, and then continues to decrease to approximately ÎP = -65 at layer 30.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** The line starts at approximately ÎP = -2, decreases steadily to approximately ÎP = -60 at layer 40, and then plateaus.
* **A-Anchored (PopQA):** The line starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -50 at layer 40, and then plateaus.
* **Q-Anchored (TriviaQA):** The line starts at approximately ÎP = -3, decreases rapidly to approximately ÎP = -60 at layer 20, and then continues to decrease to approximately ÎP = -80 at layer 70.
* **A-Anchored (TriviaQA):** The line starts at approximately ÎP = -2, decreases gradually to approximately ÎP = -50 at layer 30, and then continues to decrease to approximately ÎP = -70 at layer 70.
* **Q-Anchored (HotpotQA):** The line starts at approximately ÎP = -1, decreases rapidly to approximately ÎP = -60 at layer 10, and then continues to decrease to approximately ÎP = -75 at layer 70.
* **A-Anchored (HotpotQA):** The line starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -40 at layer 20, and then continues to decrease to approximately ÎP = -60 at layer 70.
* **Q-Anchored (NQ):** The line starts at approximately ÎP = -2, decreases steadily to approximately ÎP = -60 at layer 30, and then continues to decrease to approximately ÎP = -80 at layer 70.
* **A-Anchored (NQ):** The line starts at approximately ÎP = -1, decreases gradually to approximately ÎP = -50 at layer 30, and then continues to decrease to approximately ÎP = -70 at layer 70.
### Key Observations
* All lines exhibit a downward trend, indicating a decrease in ÎP as the layer number increases.
* The rate of decrease varies depending on the dataset and anchoring method.
* Q-Anchored lines generally decrease more rapidly than A-Anchored lines.
* The Llama-3-70B model shows a more extended decrease in ÎP across more layers compared to the Llama-3-8B model.
* The HotpotQA dataset consistently shows a steeper decline in ÎP compared to other datasets.
### Interpretation
The charts demonstrate how the change in probability (ÎP) evolves across different layers of the Llama models for various question-answering tasks. The negative ÎP values suggest a decreasing confidence or probability associated with the model's predictions as information propagates through deeper layers.
The difference between Q-Anchored and A-Anchored lines suggests that anchoring based on the question (Q-Anchored) leads to a more pronounced decrease in ÎP compared to anchoring based on the answer (A-Anchored). This could indicate that the question provides more informative cues for the model's reasoning process.
The steeper decline observed for the HotpotQA dataset might be attributed to the complexity of the questions in this dataset, requiring more extensive reasoning and potentially leading to greater uncertainty in deeper layers.
The extended decrease in ÎP for the Llama-3-70B model, compared to the Llama-3-8B model, could be a result of the larger model size and increased capacity for learning complex relationships, which also leads to a more nuanced and potentially less confident representation of information in deeper layers. The plateauing of the lines suggests a point where further processing through additional layers does not significantly alter the model's probability distribution.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the change in performance (ÎP) across layers for two versions of the Mistral-7B language model: v0.1 and v0.3. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, ranging from 0 to 30, while the y-axis represents ÎP, ranging from approximately -80 to 0.
### Components/Axes
* **Title (Left Chart):** Mistral-7B-v0.1
* **Title (Right Chart):** Mistral-7B-v0.3
* **X-axis Label (Both Charts):** Layer
* **Y-axis Label (Both Charts):** ÎP
* **Legend (Bottom Center):**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Light Blue Line: A-Anchored (TriviaQA)
* Red Line: Q-Anchored (HotpotQA)
* Brown Line: A-Anchored (HotpotQA)
* Purple Line: Q-Anchored (NQ)
* Pink Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 0 at Layer 0, decreases steadily to approximately -65 at Layer 30.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 0 at Layer 0, decreases to approximately -55 at Layer 30.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 0 at Layer 0, decreases to approximately -60 at Layer 30.
* **A-Anchored (TriviaQA) - Light Blue Line:** Starts at approximately 0 at Layer 0, decreases to approximately -50 at Layer 30.
* **Q-Anchored (HotpotQA) - Red Line:** Starts at approximately 0 at Layer 0, decreases to approximately -70 at Layer 30.
* **A-Anchored (HotpotQA) - Brown Line:** Starts at approximately 0 at Layer 0, decreases to approximately -60 at Layer 30.
* **Q-Anchored (NQ) - Purple Line:** Starts at approximately 0 at Layer 0, decreases to approximately -75 at Layer 30.
* **A-Anchored (NQ) - Pink Line:** Starts at approximately 0 at Layer 0, decreases to approximately -65 at Layer 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 0 at Layer 0, decreases to approximately -60 at Layer 30.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 0 at Layer 0, decreases to approximately -50 at Layer 30.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 0 at Layer 0, decreases to approximately -55 at Layer 30.
* **A-Anchored (TriviaQA) - Light Blue Line:** Starts at approximately 0 at Layer 0, decreases to approximately -45 at Layer 30.
* **Q-Anchored (HotpotQA) - Red Line:** Starts at approximately 0 at Layer 0, decreases to approximately -65 at Layer 30.
* **A-Anchored (HotpotQA) - Brown Line:** Starts at approximately 0 at Layer 0, decreases to approximately -55 at Layer 30.
* **Q-Anchored (NQ) - Purple Line:** Starts at approximately 0 at Layer 0, decreases to approximately -70 at Layer 30.
* **A-Anchored (NQ) - Pink Line:** Starts at approximately 0 at Layer 0, decreases to approximately -60 at Layer 30.
In both charts, all lines generally exhibit a downward trend, indicating a decrease in ÎP as the layer number increases. The rate of decrease appears to be relatively consistent across layers, with some variations between datasets and anchoring methods.
### Key Observations
* The NQ dataset consistently shows the largest decrease in ÎP for both Q-Anchored and A-Anchored methods in both models.
* The PopQA dataset generally shows the smallest decrease in ÎP.
* The v0.3 model consistently shows a smaller decrease in ÎP across all datasets and anchoring methods compared to the v0.1 model. This suggests an improvement in performance with the newer version.
* The Q-Anchored methods generally show a larger decrease in ÎP than the A-Anchored methods for each dataset.
### Interpretation
The charts illustrate how the performance of the Mistral-7B models changes across different layers of the network when evaluated on various question-answering datasets. The ÎP metric likely represents a measure of performance difference, with negative values indicating a decline in performance.
The consistent downward trend across all lines suggests that the models' performance degrades as information propagates through deeper layers. This could be due to issues like vanishing gradients or the loss of relevant information during processing.
The differences between datasets highlight the models' varying abilities to handle different types of questions. The NQ dataset, which likely contains more complex or nuanced questions, results in a more significant performance drop.
The improvement observed in the v0.3 model suggests that the updates made between versions have mitigated some of the performance degradation issues, potentially through architectural changes or improved training techniques. The difference between Q-Anchored and A-Anchored methods could be related to how the model utilizes question versus answer context during processing. The anchoring method impacts the performance, suggesting that the way the model is prompted or conditioned affects its ability to maintain performance across layers.
</details>
Figure 12: $\Delta\mathrm{P}$ under attention knockout, probing mlp activations of the last exact answer token.
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Qwen Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across different layers of two Qwen language models: Qwen3-8B and Qwen3-32B. The charts display performance differences for various question-answering datasets, distinguished by anchoring methods (Q-Anchored and A-Anchored).
### Components/Axes
* **X-axis:** Layer (ranging from 0 to approximately 30 for Qwen3-8B and 0 to 60 for Qwen3-32B).
* **Y-axis:** ÎP (ranging from approximately -100 to 20).
* **Models:** Qwen3-8B (left chart), Qwen3-32B (right chart).
* **Datasets/Anchoring:**
* PopQA (Q-Anchored and A-Anchored)
* TriviaQA (Q-Anchored and A-Anchored)
* HotpotQA (Q-Anchored and A-Anchored)
* NQ (Q-Anchored and A-Anchored)
* **Legend:** Located at the bottom of the image, associating colors with specific datasets and anchoring methods.
### Detailed Analysis or Content Details
**Qwen3-8B (Left Chart)**
* **Q-Anchored (PopQA):** (Blue line) Starts at approximately 0 ÎP at Layer 0, declines steadily to approximately -60 ÎP at Layer 20, then fluctuates between -60 and -80 ÎP until Layer 30.
* **A-Anchored (PopQA):** (Brown dashed line) Remains relatively stable around 0-10 ÎP until Layer 15, then declines to approximately -20 ÎP at Layer 30.
* **Q-Anchored (TriviaQA):** (Green line) Starts at approximately 0 ÎP, declines to approximately -40 ÎP at Layer 10, then fluctuates between -40 and -70 ÎP until Layer 30.
* **A-Anchored (TriviaQA):** (Purple dashed line) Starts at approximately 0 ÎP, declines to approximately -30 ÎP at Layer 10, then fluctuates between -30 and -60 ÎP until Layer 30.
* **Q-Anchored (HotpotQA):** (Orange line) Starts at approximately 0 ÎP, declines to approximately -20 ÎP at Layer 10, then fluctuates between -20 and -50 ÎP until Layer 30.
* **A-Anchored (HotpotQA):** (Red dashed line) Starts at approximately 0 ÎP, declines to approximately -10 ÎP at Layer 10, then fluctuates between -10 and -40 ÎP until Layer 30.
* **Q-Anchored (NQ):** (Light Blue line) Starts at approximately 0 ÎP, declines to approximately -20 ÎP at Layer 10, then fluctuates between -20 and -50 ÎP until Layer 30.
* **A-Anchored (NQ):** (Gray dashed line) Starts at approximately 0 ÎP, declines to approximately -10 ÎP at Layer 10, then fluctuates between -10 and -40 ÎP until Layer 30.
**Qwen3-32B (Right Chart)**
* **Q-Anchored (PopQA):** (Blue line) Starts at approximately 0 ÎP, declines to approximately -20 ÎP at Layer 10, then fluctuates between -20 and -60 ÎP until Layer 60.
* **A-Anchored (PopQA):** (Brown dashed line) Remains relatively stable around 0-10 ÎP until Layer 20, then declines to approximately -20 ÎP at Layer 60.
* **Q-Anchored (TriviaQA):** (Green line) Starts at approximately 0 ÎP, declines to approximately -20 ÎP at Layer 10, then fluctuates between -20 and -80 ÎP until Layer 60.
* **A-Anchored (TriviaQA):** (Purple dashed line) Starts at approximately 0 ÎP, declines to approximately -10 ÎP at Layer 10, then fluctuates between -10 and -60 ÎP until Layer 60.
* **Q-Anchored (HotpotQA):** (Orange line) Starts at approximately 0 ÎP, declines to approximately -20 ÎP at Layer 10, then fluctuates between -20 and -80 ÎP until Layer 60.
* **A-Anchored (HotpotQA):** (Red dashed line) Starts at approximately 0 ÎP, declines to approximately -10 ÎP at Layer 10, then fluctuates between -10 and -50 ÎP until Layer 60.
* **Q-Anchored (NQ):** (Light Blue line) Starts at approximately 0 ÎP, declines to approximately -20 ÎP at Layer 10, then fluctuates between -20 and -80 ÎP until Layer 60.
* **A-Anchored (NQ):** (Gray dashed line) Starts at approximately 0 ÎP, declines to approximately -10 ÎP at Layer 10, then fluctuates between -10 and -50 ÎP until Layer 60.
### Key Observations
* All datasets exhibit a negative trend in ÎP as the layer number increases, indicating a performance decrease with depth in both models.
* The Q-Anchored lines generally show a more significant decline in ÎP compared to the A-Anchored lines.
* The Qwen3-32B model shows a more pronounced decline in ÎP across all datasets compared to the Qwen3-8B model.
* The PopQA and NQ datasets consistently show the most significant declines in ÎP.
### Interpretation
The charts demonstrate that as the model depth (layer number) increases, performance on question-answering tasks tends to decrease. This suggests that deeper layers may not always contribute positively to performance and could potentially introduce noise or hinder the model's ability to generalize. The difference between Q-Anchored and A-Anchored lines suggests that the anchoring method impacts performance, with Q-Anchored generally performing worse. The larger decline in ÎP for Qwen3-32B compared to Qwen3-8B could indicate that the larger model is more susceptible to performance degradation with depth, or that the optimal depth for the larger model is different. The consistent performance decline on PopQA and NQ datasets suggests these datasets may be more sensitive to the effects of model depth. These findings could inform strategies for model pruning, layer selection, or architectural modifications to improve performance and efficiency.
</details>
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Qwen Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across layers for two Qwen language models: Qwen3-8B and Qwen3-32B. Each chart displays multiple lines representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) and anchoring methods (Q-Anchored and A-Anchored). The charts aim to visualize how performance changes as the model depth (layer) increases.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to approximately 35 for Qwen3-8B and 0 to 60 for Qwen3-32B).
* **Y-axis:** ÎP (ranging from approximately -90 to 0).
* **Models:** Qwen3-8B (left chart), Qwen3-32B (right chart).
* **Datasets/Anchoring:**
* PopQA (Q-Anchored - Blue, A-Anchored - Orange)
* TriviaQA (Q-Anchored - Teal, A-Anchored - Pink)
* HotpotQA (Q-Anchored - Purple, A-Anchored - Brown)
* NQ (Q-Anchored - Light Blue, A-Anchored - Red)
* **Legend:** Located at the bottom of the image, clearly associating colors with datasets and anchoring methods.
### Detailed Analysis or Content Details
**Qwen3-8B (Left Chart):**
* **Q-Anchored (PopQA) - Blue:** Starts at approximately ÎP = -10 at Layer 0, rapidly decreases to approximately ÎP = -80 at Layer 10, and then fluctuates between -70 and -85 until Layer 35, ending at approximately ÎP = -82.
* **A-Anchored (PopQA) - Orange:** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -70 at Layer 10, and then fluctuates between -65 and -80 until Layer 35, ending at approximately ÎP = -75.
* **Q-Anchored (TriviaQA) - Teal:** Starts at approximately ÎP = -15 at Layer 0, decreases to approximately ÎP = -75 at Layer 10, and then fluctuates between -70 and -85 until Layer 35, ending at approximately ÎP = -80.
* **A-Anchored (TriviaQA) - Pink:** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -65 at Layer 10, and then fluctuates between -60 and -80 until Layer 35, ending at approximately ÎP = -75.
* **Q-Anchored (HotpotQA) - Purple:** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -60 at Layer 10, and then fluctuates between -55 and -75 until Layer 35, ending at approximately ÎP = -70.
* **A-Anchored (HotpotQA) - Brown:** Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -50 at Layer 10, and then fluctuates between -45 and -65 until Layer 35, ending at approximately ÎP = -60.
* **Q-Anchored (NQ) - Light Blue:** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -70 at Layer 10, and then fluctuates between -65 and -80 until Layer 35, ending at approximately ÎP = -78.
* **A-Anchored (NQ) - Red:** Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -50 at Layer 10, and then fluctuates between -45 and -65 until Layer 35, ending at approximately ÎP = -62.
**Qwen3-32B (Right Chart):**
* **Q-Anchored (PopQA) - Blue:** Starts at approximately ÎP = -10 at Layer 0, rapidly decreases to approximately ÎP = -80 at Layer 10, and then fluctuates between -75 and -85 until Layer 60, ending at approximately ÎP = -80.
* **A-Anchored (PopQA) - Orange:** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -70 at Layer 10, and then fluctuates between -65 and -80 until Layer 60, ending at approximately ÎP = -75.
* **Q-Anchored (TriviaQA) - Teal:** Starts at approximately ÎP = -15 at Layer 0, decreases to approximately ÎP = -75 at Layer 10, and then fluctuates between -70 and -85 until Layer 60, ending at approximately ÎP = -80.
* **A-Anchored (TriviaQA) - Pink:** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -65 at Layer 10, and then fluctuates between -60 and -80 until Layer 60, ending at approximately ÎP = -75.
* **Q-Anchored (HotpotQA) - Purple:** Starts at approximately ÎP = -5 at Layer 0, decreases to approximately ÎP = -60 at Layer 10, and then fluctuates between -55 and -75 until Layer 60, ending at approximately ÎP = -70.
* **A-Anchored (HotpotQA) - Brown:** Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -50 at Layer 10, and then fluctuates between -45 and -65 until Layer 60, ending at approximately ÎP = -60.
* **Q-Anchored (NQ) - Light Blue:** Starts at approximately ÎP = -10 at Layer 0, decreases to approximately ÎP = -70 at Layer 10, and then fluctuates between -65 and -80 until Layer 60, ending at approximately ÎP = -78.
* **A-Anchored (NQ) - Red:** Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -50 at Layer 10, and then fluctuates between -45 and -65 until Layer 60, ending at approximately ÎP = -62.
### Key Observations
* All lines exhibit a downward trend, indicating a decrease in performance (ÎP) as the layer number increases.
* The rate of decrease is steepest in the initial layers (0-10) for both models.
* The Q-Anchored lines generally show a slightly larger decrease in ÎP compared to the A-Anchored lines for each dataset.
* The HotpotQA dataset consistently shows the least negative ÎP values (i.e., the smallest performance decrease) across both models and anchoring methods.
* The Qwen3-32B model shows a similar trend to Qwen3-8B, but extends to a greater number of layers.
### Interpretation
The charts demonstrate that performance on question-answering tasks tends to degrade as the depth of the Qwen models increases. This could be due to issues like vanishing gradients, overfitting, or the model learning irrelevant features in deeper layers. The difference between Q-Anchored and A-Anchored lines suggests that the anchoring method influences how performance changes with depth. The consistently better performance of HotpotQA might indicate that this dataset is less sensitive to the issues affecting deeper layers, or that the models are better at capturing the reasoning patterns required for this dataset. The fact that the performance decrease plateaus after a certain number of layers suggests that there might be a limit to the benefits of increasing model depth for these specific tasks and datasets. The similar trends observed in both Qwen3-8B and Qwen3-32B suggest that the underlying phenomenon is not simply a result of model size.
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Qwen Models
### Overview
The image presents two line charts comparing the change in probability (ÎP) across layers for two Qwen language models: Qwen3-8B and Qwen3-32B. Each chart displays multiple lines representing different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, and NQ). The x-axis represents the layer number, and the y-axis represents ÎP.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 35 for Qwen3-8B and 0 to 60 for Qwen3-32B).
* **Y-axis:** ÎP (ranging from approximately -90 to 0).
* **Models:** Qwen3-8B (left chart), Qwen3-32B (right chart).
* **Anchoring Methods:** Q-Anchored, A-Anchored.
* **Datasets:** PopQA, TriviaQA, HotpotQA, NQ.
* **Legend:** Located at the bottom of the image, associating colors with specific anchoring method/dataset combinations.
### Detailed Analysis or Content Details
**Qwen3-8B Chart (Left):**
* **Q-Anchored (PopQA):** (Dark Blue Line) Starts at approximately ÎP = 0 at Layer 0, rapidly decreases to approximately ÎP = -80 at Layer 10, and continues to decrease, reaching approximately ÎP = -85 at Layer 30, then slightly increases to approximately ÎP = -82 at Layer 35.
* **A-Anchored (PopQA):** (Light Brown Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -20 to -30 for layers 5 to 35.
* **Q-Anchored (TriviaQA):** (Medium Blue Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -40 at Layer 5, continues to decrease to approximately ÎP = -70 at Layer 20, and reaches approximately ÎP = -75 at Layer 35.
* **A-Anchored (TriviaQA):** (Light Purple Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -25 to -35 for layers 5 to 35.
* **Q-Anchored (HotpotQA):** (Dark Purple Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -30 at Layer 5, continues to decrease to approximately ÎP = -60 at Layer 20, and reaches approximately ÎP = -70 at Layer 35.
* **A-Anchored (HotpotQA):** (Light Green Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -25 to -35 for layers 5 to 35.
* **Q-Anchored (NQ):** (Teal Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -30 at Layer 5, continues to decrease to approximately ÎP = -60 at Layer 20, and reaches approximately ÎP = -70 at Layer 35.
* **A-Anchored (NQ):** (Orange Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -25 to -35 for layers 5 to 35.
**Qwen3-32B Chart (Right):**
* **Q-Anchored (PopQA):** (Dark Blue Line) Starts at approximately ÎP = 0 at Layer 0, rapidly decreases to approximately ÎP = -80 at Layer 10, and continues to decrease, reaching approximately ÎP = -85 at Layer 20, then slightly increases to approximately ÎP = -80 at Layer 60.
* **A-Anchored (PopQA):** (Light Brown Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -20 to -30 for layers 5 to 60.
* **Q-Anchored (TriviaQA):** (Medium Blue Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -40 at Layer 5, continues to decrease to approximately ÎP = -70 at Layer 20, and reaches approximately ÎP = -75 at Layer 60.
* **A-Anchored (TriviaQA):** (Light Purple Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -25 to -35 for layers 5 to 60.
* **Q-Anchored (HotpotQA):** (Dark Purple Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -30 at Layer 5, continues to decrease to approximately ÎP = -60 at Layer 20, and reaches approximately ÎP = -70 at Layer 60.
* **A-Anchored (HotpotQA):** (Light Green Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -25 to -35 for layers 5 to 60.
* **Q-Anchored (NQ):** (Teal Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -30 at Layer 5, continues to decrease to approximately ÎP = -60 at Layer 20, and reaches approximately ÎP = -70 at Layer 60.
* **A-Anchored (NQ):** (Orange Line) Starts at approximately ÎP = 0 at Layer 0, decreases to approximately ÎP = -20 at Layer 5, then plateaus around ÎP = -25 to -35 for layers 5 to 60.
### Key Observations
* For both models, Q-Anchored lines consistently show a steeper decrease in ÎP compared to A-Anchored lines.
* A-Anchored lines tend to plateau after a certain layer, indicating a stabilization of the probability change.
* The datasets (PopQA, TriviaQA, HotpotQA, NQ) exhibit similar trends for both anchoring methods, but the magnitude of ÎP varies.
* The Qwen3-32B model shows a similar trend to Qwen3-8B, but extends to a larger number of layers.
### Interpretation
The charts demonstrate the impact of different anchoring methods and datasets on the change in probability across layers in Qwen language models. The steeper decline in ÎP for Q-Anchored lines suggests that this method leads to a more significant shift in the model's internal representations as it processes information through deeper layers. The plateauing of A-Anchored lines indicates that this method may result in more stable, but potentially less adaptable, representations. The consistent trends across datasets suggest that these observations are not specific to any particular type of question-answering task. The larger number of layers in Qwen3-32B allows for a more extended exploration of these trends, potentially revealing further insights into the model's behavior. The negative ÎP values indicate a decrease in probability, which could be interpreted as a reduction in confidence or a shift in the model's focus as it processes information.
</details>
Figure 13: $\Delta\mathrm{P}$ under attention knockout for reasoning models. Probing attention activations for the final token (top), the token immediately preceding the exact answer tokens (middle), and the last exact answer token (bottom).
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Qwen Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across different layers for two Qwen language models: Qwen3-8B and Qwen3-32B. The charts display ÎP as a function of layer number, with different lines representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) and anchoring methods (Q-Anchored and A-Anchored).
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 35 for Qwen3-8B and 0 to 60 for Qwen3-32B).
* **Y-axis:** ÎP (ranging from approximately -100 to 0).
* **Models:** Qwen3-8B (left chart), Qwen3-32B (right chart).
* **Datasets/Anchoring:**
* PopQA (Q-Anchored - Blue solid line, A-Anchored - Brown dashed-dotted line)
* TriviaQA (Q-Anchored - Purple solid line, A-Anchored - Orange dashed-dotted line)
* HotpotQA (Q-Anchored - Green dashed line, A-Anchored - Pink solid line)
* NQ (Q-Anchored - Teal dashed line, A-Anchored - Magenta solid line)
* **Legend:** Located at the bottom of the image, clearly associating colors with datasets and anchoring methods.
### Detailed Analysis
**Qwen3-8B (Left Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts around ÎP = -5, decreases steadily to approximately -85 by layer 35.
* **A-Anchored (PopQA) - Brown Dashed-Dotted Line:** Remains relatively stable around ÎP = -2 to -5 across all layers.
* **Q-Anchored (TriviaQA) - Purple Solid Line:** Starts around ÎP = -15, decreases to approximately -70 by layer 35.
* **A-Anchored (TriviaQA) - Orange Dashed-Dotted Line:** Similar to PopQA A-Anchored, remains relatively stable around ÎP = -5 to -10.
* **Q-Anchored (HotpotQA) - Green Dashed Line:** Starts around ÎP = -20, decreases to approximately -60 by layer 35.
* **A-Anchored (HotpotQA) - Pink Solid Line:** Starts around ÎP = -10, decreases to approximately -50 by layer 35.
* **Q-Anchored (NQ) - Teal Dashed Line:** Starts around ÎP = -25, decreases to approximately -80 by layer 35.
* **A-Anchored (NQ) - Magenta Solid Line:** Starts around ÎP = -15, decreases to approximately -60 by layer 35.
**Qwen3-32B (Right Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts around ÎP = -5, decreases steadily to approximately -90 by layer 60.
* **A-Anchored (PopQA) - Brown Dashed-Dotted Line:** Remains relatively stable around ÎP = -2 to -5 across all layers.
* **Q-Anchored (TriviaQA) - Purple Solid Line:** Starts around ÎP = -15, decreases to approximately -75 by layer 60.
* **A-Anchored (TriviaQA) - Orange Dashed-Dotted Line:** Similar to PopQA A-Anchored, remains relatively stable around ÎP = -5 to -10.
* **Q-Anchored (HotpotQA) - Green Dashed Line:** Starts around ÎP = -20, decreases to approximately -65 by layer 60.
* **A-Anchored (HotpotQA) - Pink Solid Line:** Starts around ÎP = -10, decreases to approximately -55 by layer 60.
* **Q-Anchored (NQ) - Teal Dashed Line:** Starts around ÎP = -25, decreases to approximately -85 by layer 60.
* **A-Anchored (NQ) - Magenta Solid Line:** Starts around ÎP = -15, decreases to approximately -65 by layer 60.
### Key Observations
* **General Trend:** For both models, Q-Anchored lines consistently show a decreasing trend in ÎP as the layer number increases, indicating a performance degradation with depth.
* **A-Anchored Stability:** A-Anchored lines remain relatively stable across all layers, suggesting that performance is less affected by depth when using this anchoring method.
* **Dataset Variation:** The magnitude of ÎP varies depending on the dataset. PopQA and NQ generally exhibit larger decreases in ÎP for Q-Anchored lines compared to TriviaQA and HotpotQA.
* **Model Size Impact:** The Qwen3-32B model shows a more pronounced decrease in ÎP for Q-Anchored lines compared to the Qwen3-8B model, suggesting that the performance degradation with depth is more significant in larger models.
### Interpretation
The charts demonstrate the impact of model depth on performance for different question-answering datasets and anchoring methods. The consistent decline in ÎP for Q-Anchored lines suggests that deeper layers may introduce noise or hinder the model's ability to effectively process information for these datasets. The stability of A-Anchored lines indicates that this anchoring method may mitigate the negative effects of depth.
The differences in ÎP across datasets suggest that the optimal model depth may vary depending on the specific task. The more pronounced performance degradation in the larger Qwen3-32B model highlights the challenges of scaling deep learning models and the need for techniques to maintain performance as models grow in size.
The data suggests that A-Anchoring is a more robust method for maintaining performance across layers, while Q-Anchoring suffers from performance degradation as the model gets deeper. This could be due to the way information is processed or the types of features learned in deeper layers. Further investigation is needed to understand the underlying mechanisms driving these trends.
</details>
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Qwen Models
### Overview
The image presents two line charts, side-by-side, displaying the change in probability (ÎP) as a function of layer number for two different Qwen models: Qwen3-8B and Qwen3-32B. Each chart contains multiple lines representing different anchoring and question-answering (QA) datasets. The charts visually compare how ÎP changes across layers for each model and dataset combination.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to approximately 35 for Qwen3-8B and 0 to approximately 60 for Qwen3-32B).
* **Y-axis:** ÎP (ranging from approximately -90 to 0).
* **Models:** Qwen3-8B (left chart), Qwen3-32B (right chart).
* **Datasets/Anchoring:**
* PopQA
* TriviaQA
* HotpotQA
* NQ (Natural Questions)
* **Anchoring Types:**
* Q-Anchored
* A-Anchored
* **Legend:** Located at the bottom of the image, providing color-coded labels for each line.
### Detailed Analysis or Content Details
**Qwen3-8B (Left Chart)**
* **Q-Anchored (PopQA):** A solid blue line. Starts at approximately -10 at Layer 0, decreases steeply to approximately -80 at Layer 10, and then plateaus around -80 to -70 from Layer 20 to 35.
* **A-Anchored (PopQA):** A light orange dashed line. Starts at approximately -5 at Layer 0, decreases to approximately -60 at Layer 10, and then plateaus around -60 to -50 from Layer 20 to 35.
* **Q-Anchored (TriviaQA):** A solid purple line. Starts at approximately -15 at Layer 0, decreases steeply to approximately -70 at Layer 10, and then plateaus around -70 to -60 from Layer 20 to 35.
* **A-Anchored (TriviaQA):** A light purple dashed line. Starts at approximately -10 at Layer 0, decreases to approximately -60 at Layer 10, and then plateaus around -60 to -50 from Layer 20 to 35.
* **Q-Anchored (HotpotQA):** A solid green line. Starts at approximately -20 at Layer 0, decreases steeply to approximately -70 at Layer 10, and then plateaus around -70 to -60 from Layer 20 to 35.
* **A-Anchored (HotpotQA):** A light green dashed line. Starts at approximately -15 at Layer 0, decreases to approximately -60 at Layer 10, and then plateaus around -60 to -50 from Layer 20 to 35.
* **Q-Anchored (NQ):** A solid teal line. Starts at approximately -5 at Layer 0, decreases to approximately -40 at Layer 10, and then plateaus around -40 to -30 from Layer 20 to 35.
* **A-Anchored (NQ):** A light teal dashed line. Starts at approximately 0 at Layer 0, decreases to approximately -30 at Layer 10, and then plateaus around -30 to -20 from Layer 20 to 35.
**Qwen3-32B (Right Chart)**
* **Q-Anchored (PopQA):** A solid blue line. Starts at approximately -10 at Layer 0, decreases steeply to approximately -80 at Layer 10, and then plateaus around -80 to -70 from Layer 20 to 60.
* **A-Anchored (PopQA):** A light orange dashed line. Starts at approximately -5 at Layer 0, decreases to approximately -60 at Layer 10, and then plateaus around -60 to -50 from Layer 20 to 60.
* **Q-Anchored (TriviaQA):** A solid purple line. Starts at approximately -15 at Layer 0, decreases steeply to approximately -70 at Layer 10, and then plateaus around -70 to -60 from Layer 20 to 60.
* **A-Anchored (TriviaQA):** A light purple dashed line. Starts at approximately -10 at Layer 0, decreases to approximately -60 at Layer 10, and then plateaus around -60 to -50 from Layer 20 to 60.
* **Q-Anchored (HotpotQA):** A solid green line. Starts at approximately -20 at Layer 0, decreases steeply to approximately -70 at Layer 10, and then plateaus around -70 to -60 from Layer 20 to 60.
* **A-Anchored (HotpotQA):** A light green dashed line. Starts at approximately -15 at Layer 0, decreases to approximately -60 at Layer 10, and then plateaus around -60 to -50 from Layer 20 to 60.
* **Q-Anchored (NQ):** A solid teal line. Starts at approximately -5 at Layer 0, decreases to approximately -40 at Layer 10, and then plateaus around -40 to -30 from Layer 20 to 60.
* **A-Anchored (NQ):** A light teal dashed line. Starts at approximately 0 at Layer 0, decreases to approximately -30 at Layer 10, and then plateaus around -30 to -20 from Layer 20 to 60.
### Key Observations
* All lines exhibit a steep decrease in ÎP within the first 10 layers, regardless of the model, dataset, or anchoring type.
* After the initial decrease, ÎP plateaus, indicating diminishing changes in probability with increasing layer depth.
* Q-Anchored lines generally have lower ÎP values than A-Anchored lines for the same dataset.
* The NQ dataset consistently shows the smallest decrease in ÎP compared to other datasets.
* The Qwen3-32B model shows a similar trend to Qwen3-8B, but extends to a greater layer depth (60 layers vs. 35 layers).
### Interpretation
The charts demonstrate how the change in probability (ÎP) evolves across layers in the Qwen models when evaluated on different question-answering datasets. The initial steep decline in ÎP suggests that the early layers of the models are responsible for capturing the most significant changes in probability related to the QA tasks. The subsequent plateau indicates that deeper layers contribute less to these changes.
The difference between Q-Anchored and A-Anchored lines suggests that the anchoring method influences the model's probability distribution. Q-Anchoring, which likely involves anchoring on the question itself, leads to a more pronounced decrease in ÎP, potentially indicating a stronger focus on the question's context.
The relatively small decrease in ÎP for the NQ dataset might suggest that this dataset is easier for the models to process or that the models already have a strong understanding of the concepts involved in NQ questions.
The similarity in trends between the two models (Qwen3-8B and Qwen3-32B) suggests that the underlying architecture and learning process are consistent, despite the difference in model size. The extended layer depth in Qwen3-32B allows for a more prolonged plateau, potentially indicating a greater capacity for nuanced representation.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Qwen Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across different layers of two Qwen language models: Qwen3-8B and Qwen3-32B. The charts display performance differences for question-answering tasks using different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, and NQ). Each line represents a specific combination of anchoring method and dataset. The charts are positioned side-by-side for easy comparison.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to approximately 35 for Qwen3-8B and 0 to approximately 65 for Qwen3-32B).
* **Y-axis:** ÎP (ranging from approximately -90 to 0).
* **Models:** Qwen3-8B (left chart), Qwen3-32B (right chart).
* **Legend:** Located at the bottom of the image, detailing the data series:
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Gray Dashed Line: A-Anchored (HotpotQA)
* Teal Line: Q-Anchored (NQ)
* Red Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Qwen3-8B (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately -5, decreases steadily to approximately -80 by layer 35.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately -20, decreases to approximately -70 by layer 35.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately -10, decreases to approximately -75 by layer 35.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately -25, decreases to approximately -75 by layer 35.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately -10, decreases to approximately -60 by layer 35.
* **A-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately -20, decreases to approximately -65 by layer 35.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately -15, decreases to approximately -70 by layer 35.
* **A-Anchored (NQ) - Red Line:** Starts at approximately -25, decreases to approximately -75 by layer 35.
**Qwen3-32B (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately -5, decreases to approximately -80 by layer 65.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately -20, decreases to approximately -70 by layer 65.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately -10, decreases to approximately -75 by layer 65.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately -25, decreases to approximately -75 by layer 65.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately -10, decreases to approximately -60 by layer 65.
* **A-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately -20, decreases to approximately -65 by layer 65.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately -15, decreases to approximately -70 by layer 65.
* **A-Anchored (NQ) - Red Line:** Starts at approximately -25, decreases to approximately -75 by layer 65.
All lines in both charts exhibit a downward trend, indicating a decrease in ÎP as the layer number increases. The shaded areas around each line represent the uncertainty or variance in the data.
### Key Observations
* The performance decrease (ÎP) is more pronounced in the initial layers (0-20) for both models.
* The Q-Anchored methods generally start with higher ÎP values than the A-Anchored methods for all datasets.
* The HotpotQA dataset consistently shows the least negative ÎP values (closest to 0) across all layers and anchoring methods.
* The trends are remarkably similar between the Qwen3-8B and Qwen3-32B models, suggesting that increasing model size doesn't fundamentally alter the performance degradation pattern across layers.
* The uncertainty bands are relatively wide, especially in the initial layers, indicating higher variability in the performance measurements.
### Interpretation
The charts demonstrate a consistent performance degradation across layers for the Qwen models when evaluated on various question-answering datasets. This suggests that deeper layers may not contribute as significantly to performance on these tasks, or that the benefits of increased depth are offset by other factors like overfitting or vanishing gradients.
The difference between Q-Anchored and A-Anchored methods indicates that the method used to anchor the questions influences performance, with Q-Anchored generally performing better. The dataset-specific performance differences (HotpotQA being the least negative) suggest that the complexity and characteristics of the dataset play a role in how performance degrades across layers.
The similarity in trends between the 8B and 32B models is noteworthy. It implies that simply increasing model size does not necessarily address the underlying issue of performance degradation with depth. Further investigation is needed to understand the root cause of this phenomenon and explore techniques to mitigate it, such as more effective regularization or architectural modifications. The wide uncertainty bands suggest that the observed trends may not be statistically significant in some cases, and further experimentation with larger sample sizes may be necessary to confirm these findings.
</details>
Figure 14: $\Delta\mathrm{P}$ under attention knockout for reasoning models. Probing mlp activations for the final token (top), the token immediately preceding the exact answer tokens (middle), and the last exact answer token (bottom).
<details>
<summary>x31.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Different Models and Datasets
### Overview
The image presents two line charts, side-by-side, displaying the change in probability (ÎP) as a function of layer number. The left chart focuses on the "Llama-3.2-3B-Instruct" model, while the right chart focuses on the "Llama-3-8B-Instruct" model. Each chart shows multiple lines representing different datasets and anchoring methods. The charts appear to be evaluating the impact of model depth (layers) on performance, potentially related to knowledge retention or transfer.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to 25 for the left chart and 0 to 30 for the right chart.
* **Y-axis:** "ÎP" - Ranges from approximately -100 to 0.
* **Legend:** Located at the bottom of the image, identifying each line with its corresponding dataset and anchoring method.
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Light Brown line
* Q-Anchored (TriviaQA) - Purple line
* A-Anchored (TriviaQA) - Green line
* Q-Anchored (HotpotQA) - Orange dashed line
* A-Anchored (HotpotQA) - Pink dashed line
* Q-Anchored (NQ) - Cyan line
* A-Anchored (NQ) - Magenta line
### Detailed Analysis or Content Details
**Left Chart (Llama-3.2-3B-Instruct):**
* **Q-Anchored (PopQA):** Starts at approximately 0, rapidly decreases to around -60 by layer 5, and continues to decrease, reaching approximately -80 by layer 25.
* **A-Anchored (PopQA):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 25.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, decreases to around -40 by layer 5, and continues to decrease, reaching approximately -70 by layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -30 by layer 5, and then plateaus around -40 to -50 from layer 10 to 25.
* **A-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 25.
* **Q-Anchored (NQ):** Starts at approximately 0, decreases to around -30 by layer 5, and then plateaus around -40 to -50 from layer 10 to 25.
* **A-Anchored (NQ):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 25.
**Right Chart (Llama-3-8B-Instruct):**
* **Q-Anchored (PopQA):** Starts at approximately 0, rapidly decreases to around -60 by layer 5, and continues to decrease, reaching approximately -90 by layer 30.
* **A-Anchored (PopQA):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 0, decreases to around -40 by layer 5, and continues to decrease, reaching approximately -70 by layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -30 by layer 5, and then plateaus around -40 to -50 from layer 10 to 30.
* **A-Anchored (HotpotQA):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 30.
* **Q-Anchored (NQ):** Starts at approximately 0, decreases to around -30 by layer 5, and then plateaus around -40 to -50 from layer 10 to 30.
* **A-Anchored (NQ):** Starts at approximately 0, decreases to around -20 by layer 5, and then plateaus around -30 to -40 from layer 10 to 30.
### Key Observations
* In both charts, the "Q-Anchored" lines consistently show a steeper decline in ÎP compared to the "A-Anchored" lines. This suggests that question-anchored methods lead to a more significant loss of probability as the model depth increases.
* The "A-Anchored" lines tend to plateau after a certain number of layers, indicating that the change in probability stabilizes with depth.
* The 8B model (right chart) exhibits a more pronounced decline in ÎP for the Q-Anchored lines, reaching lower values than the 3B model (left chart).
* The datasets (PopQA, TriviaQA, HotpotQA, NQ) show relatively similar trends within each anchoring method.
### Interpretation
The data suggests that increasing model depth (layers) can lead to a loss of information or a decrease in the model's ability to accurately represent the initial probability distribution, as measured by ÎP. This effect is more pronounced when using question-anchored methods. The plateauing of the "A-Anchored" lines suggests that answer-anchored methods may be more robust to the effects of depth, potentially by preserving information related to the answer itself.
The larger decline observed in the 8B model could indicate that larger models are more susceptible to this loss of information, or that the effect is simply more noticeable due to the model's increased capacity. The consistent trends across different datasets suggest that this phenomenon is not specific to any particular type of knowledge or question-answering task.
This data could be used to inform decisions about model architecture and training strategies, such as exploring methods to mitigate the loss of information with depth or focusing on answer-anchored approaches for deeper models. The negative ÎP values suggest a divergence from the initial probability distribution, which could be interpreted as a form of catastrophic forgetting or a loss of calibration.
</details>
<details>
<summary>x32.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Mistral Models
### Overview
The image presents two line charts comparing the change in probability (ÎP) across layers for two versions of the Mistral-7B-Instruct model (v0.1 and v0.3). Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, ranging from 0 to approximately 32, while the y-axis represents ÎP, ranging from 0 to approximately -80.
### Components/Axes
* **X-axis:** Layer (0 to ~32)
* **Y-axis:** ÎP (Change in Probability, approximately -80 to 0)
* **Chart Titles:**
* Left Chart: "Mistral-7B-Instruct-v0.1"
* Right Chart: "Mistral-7B-Instruct-v0.3"
* **Legend:** Located at the bottom of the image, containing the following data series:
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Teal Line: Q-Anchored (NQ)
* Peach Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Left Chart (Mistral-7B-Instruct-v0.1):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately -2, decreases sharply to approximately -20 at layer 5, continues decreasing to approximately -65 at layer 25, and then decreases to approximately -75 at layer 32.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately -1, decreases to approximately -15 at layer 5, continues decreasing to approximately -60 at layer 25, and then decreases to approximately -70 at layer 32.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately -3, decreases to approximately -18 at layer 5, continues decreasing to approximately -60 at layer 25, and then decreases to approximately -70 at layer 32.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately -2, decreases to approximately -16 at layer 5, continues decreasing to approximately -55 at layer 25, and then decreases to approximately -65 at layer 32.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately -1, decreases to approximately -10 at layer 5, continues decreasing to approximately -45 at layer 25, and then decreases to approximately -60 at layer 32.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately -1, decreases to approximately -10 at layer 5, continues decreasing to approximately -40 at layer 25, and then decreases to approximately -55 at layer 32.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately -2, decreases to approximately -18 at layer 5, continues decreasing to approximately -60 at layer 25, and then decreases to approximately -70 at layer 32.
* **A-Anchored (NQ) - Peach Line:** Starts at approximately -1, decreases to approximately -15 at layer 5, continues decreasing to approximately -55 at layer 25, and then decreases to approximately -65 at layer 32.
**Right Chart (Mistral-7B-Instruct-v0.3):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately -1, decreases sharply to approximately -15 at layer 5, continues decreasing to approximately -55 at layer 25, and then decreases to approximately -70 at layer 32.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately -1, decreases to approximately -12 at layer 5, continues decreasing to approximately -50 at layer 25, and then decreases to approximately -60 at layer 32.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately -1, decreases to approximately -13 at layer 5, continues decreasing to approximately -50 at layer 25, and then decreases to approximately -60 at layer 32.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately -1, decreases to approximately -10 at layer 5, continues decreasing to approximately -45 at layer 25, and then decreases to approximately -55 at layer 32.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately -1, decreases to approximately -8 at layer 5, continues decreasing to approximately -35 at layer 25, and then decreases to approximately -50 at layer 32.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately -1, decreases to approximately -8 at layer 5, continues decreasing to approximately -30 at layer 25, and then decreases to approximately -45 at layer 32.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately -1, decreases to approximately -13 at layer 5, continues decreasing to approximately -50 at layer 25, and then decreases to approximately -60 at layer 32.
* **A-Anchored (NQ) - Peach Line:** Starts at approximately -1, decreases to approximately -10 at layer 5, continues decreasing to approximately -45 at layer 25, and then decreases to approximately -55 at layer 32.
### Key Observations
* All lines in both charts exhibit a generally downward trend, indicating a decreasing ÎP as the layer number increases.
* The Q-Anchored lines (blue, green, teal) generally have lower ÎP values than the A-Anchored lines (orange, purple, peach) across all datasets.
* The HotpotQA dataset (brown and red dashed lines) consistently shows the smallest decrease in ÎP compared to other datasets.
* The v0.3 model (right chart) shows a less steep decrease in ÎP compared to the v0.1 model (left chart), suggesting improved performance or stability in later layers.
### Interpretation
The charts illustrate how the change in probability (ÎP) evolves across the layers of the Mistral-7B-Instruct model for different question-answering datasets and anchoring methods. The negative ÎP values suggest a decrease in the model's confidence or probability as it processes information through deeper layers.
The difference between Q-Anchored and A-Anchored lines suggests that anchoring the question (Q) provides a more consistent and potentially more informative signal than anchoring the answer (A). The shallower slope for HotpotQA indicates that this dataset might be easier for the model to process or that the model has already learned relevant patterns from it.
The improvement in the v0.3 model, as evidenced by the less steep decline in ÎP, suggests that the model updates have resulted in a more stable and reliable representation of information across layers. This could be due to changes in the training data, model architecture, or optimization techniques.
The overall trend suggests that the model's initial confidence decreases as it processes information, but the extent of this decrease varies depending on the dataset and anchoring method. This information could be used to further refine the model's training process or to identify areas where the model struggles to maintain consistent performance.
</details>
Figure 15: $\Delta\mathrm{P}$ under attention knockout for instruct models.
<details>
<summary>x33.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the change in performance (ÎP) across different layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart shows multiple lines representing different question-answering datasets and anchoring methods. The charts aim to visualize how performance changes as the model depth (layer) increases.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 16 for the first chart and 0 to 25 for the second chart).
* **Y-axis:** ÎP (ranging from approximately -15 to 0).
* **Chart Titles:**
* Left Chart: "Llama-3.2-1B"
* Right Chart: "Llama-3.2-3B"
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding line styles/colors:
* Q-Anchored (PopQA) - Solid Blue Line
* A-Anchored (PopQA) - Dashed Orange Line
* Q-Anchored (TriviaQA) - Solid Purple Line
* A-Anchored (TriviaQA) - Dashed Brown Line
* Q-Anchored (HotpotQA) - Dashed Pink Line
* A-Anchored (HotpotQA) - Dashed Green Line
* Q-Anchored (NQ) - Solid Teal Line
* A-Anchored (NQ) - Dashed Light Green Line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left):**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, fluctuates between -2 and 0, and ends around -1 at layer 16.
* **A-Anchored (PopQA):** Starts at approximately ÎP = 0, remains relatively stable around -1 to -2, and ends around -2 at layer 16.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = 0, dips to around -3 at layer 3, rises to around -1 at layer 8, and ends around -5 at layer 16.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, remains relatively stable around -2 to -3, and ends around -4 at layer 16.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = 0, dips to around -3 at layer 3, rises to around -1 at layer 8, and ends around -6 at layer 16.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, remains relatively stable around -2 to -3, and ends around -5 at layer 16.
* **Q-Anchored (NQ):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, rises to around -1 at layer 8, and ends around -8 at layer 16.
* **A-Anchored (NQ):** Starts at approximately ÎP = 0, dips to around -1 at layer 3, remains relatively stable around -1 to -2, and ends around -6 at layer 16.
**Llama-3.2-3B Chart (Right):**
* **Q-Anchored (PopQA):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, fluctuates between -2 and 0, and ends around -3 at layer 25.
* **A-Anchored (PopQA):** Starts at approximately ÎP = 0, remains relatively stable around -1 to -2, and ends around -3 at layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately ÎP = 0, dips to around -3 at layer 3, rises to around -1 at layer 8, and ends around -7 at layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, remains relatively stable around -2 to -3, and ends around -6 at layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately ÎP = 0, dips to around -3 at layer 3, rises to around -1 at layer 8, and ends around -8 at layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, remains relatively stable around -2 to -3, and ends around -7 at layer 25.
* **Q-Anchored (NQ):** Starts at approximately ÎP = 0, dips to around -2 at layer 3, rises to around -1 at layer 8, and ends around -10 at layer 25.
* **A-Anchored (NQ):** Starts at approximately ÎP = 0, dips to around -1 at layer 3, remains relatively stable around -1 to -2, and ends around -8 at layer 25.
### Key Observations
* In both charts, the ÎP generally decreases as the layer number increases, indicating a performance drop with model depth.
* The Q-Anchored lines tend to exhibit more significant fluctuations than the A-Anchored lines.
* The NQ dataset consistently shows the largest negative ÎP values, suggesting the most significant performance degradation with depth.
* The 3B model (right chart) generally exhibits a more pronounced performance drop (lower ÎP values) compared to the 1B model (left chart).
### Interpretation
The charts demonstrate that increasing model depth (layers) does not necessarily lead to improved performance, as indicated by the decreasing ÎP values. This suggests the presence of issues like vanishing gradients or overfitting as the model becomes deeper. The differences between Q-Anchored and A-Anchored lines might indicate that the method of anchoring questions or answers impacts the model's ability to maintain performance across layers. The consistently poor performance of the NQ dataset suggests that this dataset is particularly challenging for the models, and its performance degrades more rapidly with depth. The larger performance drop in the 3B model could be due to the increased complexity making it more susceptible to these issues. These findings highlight the importance of careful model design and training strategies to mitigate performance degradation in deep neural networks. The data suggests that simply adding more layers does not guarantee better results and that dataset selection and anchoring methods play a crucial role in maintaining performance.
</details>
<details>
<summary>x34.png Details</summary>

### Visual Description
## Line Chart: ÎP vs. Layer for Llama-3 Models
### Overview
The image presents two line charts comparing the change in performance (ÎP) across layers for two Llama-3 models: Llama-3-8B and Llama-3-70B. The x-axis represents the layer number, and the y-axis represents ÎP. Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **Title (Left Chart):** Llama-3-8B
* **Title (Right Chart):** Llama-3-70B
* **X-axis Label (Both Charts):** Layer
* **Y-axis Label (Both Charts):** ÎP
* **X-axis Scale (Left Chart):** 0 to 30
* **X-axis Scale (Right Chart):** 0 to 80
* **Y-axis Scale (Both Charts):** -30 to 0
* **Legend (Bottom-Center):**
* Q-Anchored (PopQA) - Blue Line
* A-Anchored (PopQA) - Orange Dotted Line
* Q-Anchored (TriviaQA) - Green Line
* A-Anchored (TriviaQA) - Purple Dotted Line
* Q-Anchored (HotpotQA) - Brown Dotted Line
* A-Anchored (HotpotQA) - Pink Dotted Line
* Q-Anchored (NQ) - Teal Line
* A-Anchored (NQ) - Red Dotted Line
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* **Q-Anchored (PopQA):** The blue line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -11 at Layer 30.
* **A-Anchored (PopQA):** The orange dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -8 at Layer 30.
* **Q-Anchored (TriviaQA):** The green line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -9 at Layer 30.
* **A-Anchored (TriviaQA):** The purple dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -7 at Layer 30.
* **Q-Anchored (HotpotQA):** The brown dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -8 at Layer 30.
* **A-Anchored (HotpotQA):** The pink dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -7 at Layer 30.
* **Q-Anchored (NQ):** The teal line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -10 at Layer 30.
* **A-Anchored (NQ):** The red dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 15, then declines steadily to approximately ÎP = -9 at Layer 30.
**Llama-3-70B Chart:**
* **Q-Anchored (PopQA):** The blue line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -12 at Layer 80.
* **A-Anchored (PopQA):** The orange dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -10 at Layer 80.
* **Q-Anchored (TriviaQA):** The green line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -13 at Layer 80.
* **A-Anchored (TriviaQA):** The purple dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -11 at Layer 80.
* **Q-Anchored (HotpotQA):** The brown dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -12 at Layer 80.
* **A-Anchored (HotpotQA):** The pink dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -10 at Layer 80.
* **Q-Anchored (NQ):** The teal line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -15 at Layer 80.
* **A-Anchored (NQ):** The red dotted line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 to -2 from Layer 0 to Layer 50, then declines more rapidly to approximately ÎP = -13 at Layer 80.
### Key Observations
* In both charts, all lines generally exhibit a downward trend, indicating a decrease in ÎP as the layer number increases.
* The Llama-3-70B model shows a more pronounced decline in ÎP compared to the Llama-3-8B model, especially in the later layers.
* The Q-Anchored lines generally have lower ÎP values than the A-Anchored lines for the same dataset.
* The NQ dataset consistently shows the lowest ÎP values across both models.
### Interpretation
The charts demonstrate how performance changes across layers in the Llama-3 models when evaluated on different question-answering datasets using question-anchored (Q-Anchored) and answer-anchored (A-Anchored) methods. The negative ÎP values suggest a degradation in performance as the model progresses through its layers.
The steeper decline in ÎP for the Llama-3-70B model indicates that the larger model may be more susceptible to performance degradation in deeper layers. The difference between Q-Anchored and A-Anchored lines suggests that the method of anchoring the questions or answers impacts performance, with question-anchoring generally leading to lower ÎP values. The consistently lower performance on the NQ dataset suggests that this dataset presents a greater challenge for the models.
The overall trend suggests that while the models perform well in the initial layers, their performance deteriorates as they process information through deeper layers. This could be due to issues like vanishing gradients or the accumulation of errors during processing. Further investigation is needed to understand the underlying causes of this performance degradation and to develop strategies for mitigating it.
</details>
<details>
<summary>x35.png Details</summary>

### Visual Description
\n
## Line Chart: ÎP vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the change in performance (ÎP) across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The x-axis represents the layer number, and the y-axis represents ÎP.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 32).
* **Y-axis:** ÎP (ranging from approximately -15 to 0).
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom-Left):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Red dashed line
* Q-Anchored (HotpotQA) - Green dashed-dotted line
* A-Anchored (HotpotQA) - Light Green dashed-dotted line
* Q-Anchored (NQ) - Cyan solid line
* A-Anchored (NQ) - Gray dashed line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = 0 until Layer 18, then sharply declines to approximately ÎP = -14 at Layer 32.
* **A-Anchored (PopQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -12 at Layer 32.
* **Q-Anchored (TriviaQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = 0 until Layer 18, then declines to approximately ÎP = -10 at Layer 32.
* **A-Anchored (TriviaQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -11 at Layer 32.
* **Q-Anchored (HotpotQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -10 at Layer 32.
* **A-Anchored (HotpotQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -11 at Layer 32.
* **Q-Anchored (NQ):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = 0 until Layer 18, then declines to approximately ÎP = -8 at Layer 32.
* **A-Anchored (NQ):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -10 at Layer 32.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = 0 until Layer 18, then declines to approximately ÎP = -8 at Layer 32.
* **A-Anchored (PopQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -7 at Layer 32.
* **Q-Anchored (TriviaQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = 0 until Layer 18, then declines to approximately ÎP = -6 at Layer 32.
* **A-Anchored (TriviaQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -7 at Layer 32.
* **Q-Anchored (HotpotQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -7 at Layer 32.
* **A-Anchored (HotpotQA):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -8 at Layer 32.
* **Q-Anchored (NQ):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = 0 until Layer 18, then declines to approximately ÎP = -5 at Layer 32.
* **A-Anchored (NQ):** The line starts at approximately ÎP = 0 at Layer 0, remains relatively stable around ÎP = -1 until Layer 18, then declines to approximately ÎP = -6 at Layer 32.
### Key Observations
* In both models, all lines exhibit a relatively flat trend until approximately Layer 18, after which they begin to decline.
* The decline in ÎP is more pronounced in Mistral-7B-v0.1 than in Mistral-7B-v0.3.
* Q-Anchored lines generally show a larger decline in ÎP compared to A-Anchored lines.
* PopQA consistently shows the largest decline in ÎP.
### Interpretation
The charts illustrate how the performance of the Mistral models changes across different layers, as measured by ÎP for various question-answering datasets and anchoring methods. The initial stability suggests that the early layers contribute relatively equally to performance across datasets. The subsequent decline, particularly after Layer 18, indicates that deeper layers may be less effective or even detrimental to performance, potentially due to overfitting or the emergence of undesirable behaviors.
The difference between the two models (v0.1 vs. v0.3) suggests that the newer version (v0.3) exhibits improved stability and less performance degradation in deeper layers. The variations between datasets (PopQA, TriviaQA, HotpotQA, NQ) highlight the sensitivity of performance to the specific characteristics of the question-answering task. The anchoring method (Q-Anchored vs. A-Anchored) also influences performance, with Q-Anchored generally showing a greater decline, potentially indicating that anchoring questions is more sensitive to layer depth than anchoring answers.
The consistent decline across most lines suggests a systematic issue, while the differences between lines point to specific areas for further investigation and optimization. The charts provide valuable insights into the behavior of these language models and can guide efforts to improve their performance and robustness.
</details>
Figure 16: $\Delta\mathrm{P}$ under attention knockout with randomly masked question tokens. Unlike selectively blocking the exact question tokens, both Q-Anchored and A-Anchored samples exhibit similar patterns, with substantially smaller probability changes when question tokens are masked at random. This suggests that exact question tokens play a critical role in conveying the semantic information of core frame elements.
Appendix D Token Patching
<details>
<summary>x36.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama-3.2-1B
### Overview
The image presents two identical bar charts comparing the "Prediction Flip Rate" for the Llama-3.2-1B model across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured for two anchoring methods: "Q-Anchored (exact_question)" and "A-Anchored (exact_question)". Each chart displays the flip rate as a function of the dataset, with separate bars for each anchoring method within each dataset.
### Components/Axes
* **Title:** "Llama-3.2-1B" (appears above each chart)
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "Prediction Flip Rate" with a scale ranging from 0 to 80.
* **Legend:** Located at the bottom-center of the image.
* "Q-Anchored (exact\_question)" - represented by a reddish-brown color.
* "A-Anchored (exact\_question)" - represented by a gray color.
### Detailed Analysis
**Chart 1:**
* **PopQA:**
* Q-Anchored: Approximately 80.
* A-Anchored: Approximately 10.
* **TriviaQA:**
* Q-Anchored: Approximately 70.
* A-Anchored: Approximately 30.
* **HotpotQA:**
* Q-Anchored: Approximately 45.
* A-Anchored: Approximately 15.
* **NQ:**
* Q-Anchored: Approximately 50.
* A-Anchored: Approximately 35.
**Chart 2:** (Identical to Chart 1)
* **PopQA:**
* Q-Anchored: Approximately 80.
* A-Anchored: Approximately 10.
* **TriviaQA:**
* Q-Anchored: Approximately 70.
* A-Anchored: Approximately 30.
* **HotpotQA:**
* Q-Anchored: Approximately 45.
* A-Anchored: Approximately 15.
* **NQ:**
* Q-Anchored: Approximately 50.
* A-Anchored: Approximately 35.
The Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The Q-Anchored flip rate is highest for PopQA and TriviaQA, and lower for HotpotQA and NQ. The A-Anchored flip rate is relatively consistent across all datasets, ranging from approximately 10 to 35.
### Key Observations
* The Q-Anchored method consistently results in a significantly higher prediction flip rate compared to the A-Anchored method.
* The PopQA and TriviaQA datasets exhibit the highest flip rates for the Q-Anchored method.
* The A-Anchored method shows a relatively stable flip rate across all datasets.
* The two charts are identical, suggesting a replication or confirmation of the results.
### Interpretation
The data suggests that anchoring predictions using the exact question ("Q-Anchored") leads to a higher rate of prediction flips compared to anchoring with the exact answer ("A-Anchored") for the Llama-3.2-1B model. This could indicate that the model is more sensitive to variations in the question phrasing than variations in the answer. The higher flip rates observed for PopQA and TriviaQA might suggest that these datasets are more challenging for the model, or that the model's initial predictions are less confident on these datasets. The relatively stable flip rate for the A-Anchored method suggests that the model is more consistent in its predictions when anchored to the answer. The duplication of the charts implies a robustness check, reinforcing the observed trends. The "Prediction Flip Rate" likely refers to the percentage of times the model changes its prediction when presented with slightly different inputs (e.g., rephrased questions or answers).
</details>
<details>
<summary>x37.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate for Llama-3 Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two Llama-3 models (8B and 70B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured on the Y-axis, while the datasets are displayed on the X-axis. Two bars are shown for each dataset, representing "Q-Anchored" and "A-Anchored" predictions.
### Components/Axes
* **Title (Left):** Llama-3-8B
* **Title (Right):** Llama-3-70B
* **X-axis Label:** Dataset
* **Y-axis Label:** Prediction Flip Rate
* **Legend:**
* Red Bar: Q-Anchored (exact_question)
* Gray Bar: A-Anchored (exact_question)
* **Datasets (X-axis):** PopQA, TriviaQA, HotpotQA, NQ
### Detailed Analysis
**Llama-3-8B (Left Chart)**
* **PopQA:**
* Q-Anchored: Approximately 68% (± 2%)
* A-Anchored: Approximately 24% (± 2%)
* **TriviaQA:**
* Q-Anchored: Approximately 92% (± 2%)
* A-Anchored: Approximately 52% (± 2%)
* **HotpotQA:**
* Q-Anchored: Approximately 46% (± 2%)
* A-Anchored: Approximately 8% (± 2%)
* **NQ:**
* Q-Anchored: Approximately 70% (± 2%)
* A-Anchored: Approximately 26% (± 2%)
**Llama-3-70B (Right Chart)**
* **PopQA:**
* Q-Anchored: Approximately 82% (± 2%)
* A-Anchored: Approximately 28% (± 2%)
* **TriviaQA:**
* Q-Anchored: Approximately 64% (± 2%)
* A-Anchored: Approximately 40% (± 2%)
* **HotpotQA:**
* Q-Anchored: Approximately 46% (± 2%)
* A-Anchored: Approximately 16% (± 2%)
* **NQ:**
* Q-Anchored: Approximately 88% (± 2%)
* A-Anchored: Approximately 44% (± 2%)
**Trends:**
* In both models, the Q-Anchored flip rate is consistently higher than the A-Anchored flip rate across all datasets.
* For the 8B model, TriviaQA shows the highest Q-Anchored flip rate.
* For the 70B model, NQ shows the highest Q-Anchored flip rate.
* HotpotQA consistently shows the lowest Q-Anchored flip rate for both models.
### Key Observations
* The 70B model generally exhibits higher Q-Anchored flip rates than the 8B model, particularly on PopQA and NQ.
* The difference between Q-Anchored and A-Anchored flip rates is substantial across all datasets, suggesting that anchoring the prediction to the question (Q-Anchored) leads to more frequent flips than anchoring to the answer (A-Anchored).
* The A-Anchored flip rates are relatively low across all datasets, indicating that the model is less likely to change its prediction when prompted with the answer.
### Interpretation
The data suggests that the Llama-3 models, particularly the larger 70B version, are sensitive to the way the prompt is constructed. Anchoring the prediction to the question itself (Q-Anchored) results in a significantly higher prediction flip rate compared to anchoring to the answer (A-Anchored). This implies that the models are more susceptible to subtle changes in the question phrasing or context.
The varying flip rates across different datasets may reflect the inherent difficulty and characteristics of each dataset. For example, TriviaQA, with its focus on factual knowledge, might be more prone to flips due to the model's uncertainty in recalling specific facts. HotpotQA, which requires multi-hop reasoning, might exhibit lower flip rates because the model needs to maintain consistency across multiple reasoning steps.
The higher performance of the 70B model suggests that increasing model size can improve robustness and reduce sensitivity to prompt variations, but the fundamental difference between Q-Anchored and A-Anchored flip rates remains consistent. This highlights the importance of careful prompt engineering and understanding the model's behavior when interpreting its predictions.
</details>
<details>
<summary>x38.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate Comparison
### Overview
This image presents a bar chart comparing the Prediction Flip Rate for two models, Mistral-7B-v0.1 and Mistral-7B-v0.3, across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. The chart uses paired bars to represent two anchoring methods: Q-Anchored (exact_question) and A-Anchored (exact_question).
### Components/Axes
* **X-axis:** Dataset - PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** Prediction Flip Rate - Scale ranges from 0 to 80 (approximately).
* **Models:** Mistral-7B-v0.1 (left chart), Mistral-7B-v0.3 (right chart).
* **Legend:**
* Red: Q-Anchored (exact\_question)
* Gray: A-Anchored (exact\_question)
### Detailed Analysis
**Mistral-7B-v0.1 (Left Chart)**
* **PopQA:** Q-Anchored: ~68, A-Anchored: ~42
* **TriviaQA:** Q-Anchored: ~72, A-Anchored: ~52
* **HotpotQA:** Q-Anchored: ~56, A-Anchored: ~16
* **NQ:** Q-Anchored: ~64, A-Anchored: ~32
The Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The highest flip rate for this model is observed on the TriviaQA dataset with Q-Anchoring. The lowest flip rate is observed on the HotpotQA dataset with A-Anchoring.
**Mistral-7B-v0.3 (Right Chart)**
* **PopQA:** Q-Anchored: ~60, A-Anchored: ~44
* **TriviaQA:** Q-Anchored: ~76, A-Anchored: ~52
* **HotpotQA:** Q-Anchored: ~60, A-Anchored: ~24
* **NQ:** Q-Anchored: ~68, A-Anchored: ~36
Similar to the v0.1 model, Q-Anchored consistently outperforms A-Anchored in flip rate. The highest flip rate for this model is observed on the TriviaQA dataset with Q-Anchoring. The lowest flip rate is observed on the HotpotQA dataset with A-Anchoring.
### Key Observations
* Q-Anchoring consistently results in higher prediction flip rates than A-Anchoring for both models across all datasets.
* TriviaQA consistently shows the highest flip rates for both models and both anchoring methods.
* HotpotQA consistently shows the lowest flip rates for both models and both anchoring methods.
* The difference in flip rate between Q-Anchored and A-Anchored is more pronounced for the HotpotQA dataset.
* The flip rates for Mistral-7B-v0.3 are generally higher than those for Mistral-7B-v0.1, particularly for Q-Anchoring.
### Interpretation
The data suggests that anchoring predictions using the exact question (Q-Anchored) leads to a higher rate of prediction flips compared to anchoring with the exact answer (A-Anchored). This could indicate that the question provides more informative cues for identifying potential errors in the model's predictions. The consistently high flip rates on TriviaQA might suggest that this dataset presents more challenging or ambiguous questions, while HotpotQA might contain more straightforward or well-defined questions. The improvement in flip rates from v0.1 to v0.3 suggests that the model updates have improved the model's ability to identify and correct its own predictions, or that the model is more sensitive to the anchoring method. The difference in flip rates between anchoring methods could be used as a metric for evaluating the robustness of the model's predictions.
</details>
Figure 17: Prediction flip rate under token patching, probing attention activations of the final token.
<details>
<summary>x39.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate for Llama-3.2-1B & Llama-3.2-3B
### Overview
This image presents two side-by-side bar charts comparing the "Prediction Flip Rate" for two language models, Llama-3.2-1B and Llama-3.2-3B, across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured on the Y-axis, while the datasets are displayed on the X-axis. Each dataset has two bars representing "Q-Anchored (exact_question)" and "A-Anchored (exact_question)".
### Components/Axes
* **Title (Left Chart):** Llama-3.2-1B
* **Title (Right Chart):** Llama-3.2-3B
* **X-axis Label:** Dataset
* **Y-axis Label:** Prediction Flip Rate
* **X-axis Markers:** PopQA, TriviaQA, HotpotQA, NQ
* **Y-axis Scale:** 0 to 40 (approximately), with increments of 10.
* **Legend:**
* Color: Light Reddish-Brown
* Label: Q-Anchored (exact\_question)
* Color: Gray
* Label: A-Anchored (exact\_question)
### Detailed Analysis
**Left Chart: Llama-3.2-1B**
* **PopQA:** The Q-Anchored bar is approximately 45, while the A-Anchored bar is approximately 8.
* **TriviaQA:** The Q-Anchored bar is approximately 30, while the A-Anchored bar is approximately 15.
* **HotpotQA:** The Q-Anchored bar is approximately 40, while the A-Anchored bar is approximately 10.
* **NQ:** The Q-Anchored bar is approximately 20, while the A-Anchored bar is approximately 5.
**Right Chart: Llama-3.2-3B**
* **PopQA:** The Q-Anchored bar is approximately 25, while the A-Anchored bar is approximately 5.
* **TriviaQA:** The Q-Anchored bar is approximately 45, while the A-Anchored bar is approximately 20.
* **HotpotQA:** The Q-Anchored bar is approximately 40, while the A-Anchored bar is approximately 10.
* **NQ:** The Q-Anchored bar is approximately 40, while the A-Anchored bar is approximately 25.
**Trends:**
* In both charts, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets.
* For Llama-3.2-1B, the highest flip rate is observed for PopQA, followed by HotpotQA, TriviaQA, and NQ.
* For Llama-3.2-3B, the highest flip rate is observed for TriviaQA, followed by NQ, HotpotQA, and PopQA.
### Key Observations
* The Q-Anchored flip rate is significantly higher than the A-Anchored flip rate for both models across all datasets.
* Llama-3.2-1B shows a higher flip rate on PopQA and HotpotQA compared to Llama-3.2-3B.
* Llama-3.2-3B shows a higher flip rate on TriviaQA and NQ compared to Llama-3.2-1B.
### Interpretation
The data suggests that anchoring predictions based on the exact question (Q-Anchored) leads to a higher prediction flip rate compared to anchoring based on the exact answer (A-Anchored) for both Llama-3.2-1B and Llama-3.2-3B. This indicates that the models are more sensitive to changes in the question phrasing than changes in the answer phrasing.
The differences in flip rates between the two models across different datasets suggest that the models perform differently depending on the nature of the dataset. Llama-3.2-1B appears to be more robust on PopQA and HotpotQA, while Llama-3.2-3B performs better on TriviaQA and NQ. This could be due to differences in the training data or the complexity of the questions in each dataset.
The high flip rates observed in general suggest that the models are not very confident in their predictions and are easily influenced by small changes in the input. This could be a limitation of the models and an area for future improvement. The difference between Q-Anchored and A-Anchored could also indicate a bias in the model towards question-based reasoning.
</details>
<details>
<summary>x40.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama-3 Models
### Overview
The image presents a comparative bar chart illustrating the prediction flip rate for two Llama-3 models (8B and 70B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured as a percentage and is shown for both "Q-Anchored" (exact question) and "A-Anchored" (exact question) scenarios.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 60, with increments of 10)
* **Models:** Two separate charts, one for Llama-3-8B and one for Llama-3-70B.
* **Legend:**
* Red: Q-Anchored (exact\_question)
* Gray: A-Anchored (exact\_question)
### Detailed Analysis
**Llama-3-8B Chart:**
* **PopQA:** Q-Anchored is approximately 42%, A-Anchored is approximately 8%.
* **TriviaQA:** Q-Anchored is approximately 58%, A-Anchored is approximately 48%.
* **HotpotQA:** Q-Anchored is approximately 42%, A-Anchored is approximately 10%.
* **NQ:** Q-Anchored is approximately 42%, A-Anchored is approximately 16%.
**Llama-3-70B Chart:**
* **PopQA:** Q-Anchored is approximately 42%, A-Anchored is approximately 42%.
* **TriviaQA:** Q-Anchored is approximately 56%, A-Anchored is approximately 48%.
* **HotpotQA:** Q-Anchored is approximately 48%, A-Anchored is approximately 12%.
* **NQ:** Q-Anchored is approximately 42%, A-Anchored is approximately 16%.
**Trends:**
* In both models, the Q-Anchored flip rate is generally higher than the A-Anchored flip rate for all datasets.
* TriviaQA consistently shows the highest Q-Anchored flip rate for both models.
* HotpotQA consistently shows the lowest A-Anchored flip rate for both models.
* The 70B model shows a more consistent A-Anchored flip rate across datasets compared to the 8B model.
### Key Observations
* The difference between Q-Anchored and A-Anchored flip rates is most pronounced in the 8B model, particularly for TriviaQA and HotpotQA.
* The 70B model exhibits a more balanced flip rate between Q-Anchored and A-Anchored scenarios.
* The 70B model shows a slight increase in Q-Anchored flip rate for HotpotQA compared to the 8B model.
### Interpretation
The data suggests that the Llama-3 models exhibit a tendency to "flip" their predictions when prompted with the question directly (Q-Anchored) versus when prompted with the answer (A-Anchored). This difference in flip rate may indicate sensitivity to the phrasing of the prompt or a potential bias in the model's training data. The higher flip rates observed in TriviaQA could be due to the complexity or ambiguity of the questions in that dataset. The 70B model's more consistent performance across datasets suggests that increasing model size may improve robustness and reduce sensitivity to prompt variations. The fact that the A-Anchored flip rates are generally lower suggests that providing the answer as context can help stabilize the model's predictions. This could be useful in applications where consistency and reliability are critical. The difference in flip rates between the two models suggests that the larger model (70B) is less susceptible to prompt engineering or adversarial attacks.
</details>
<details>
<summary>x41.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Mistral Models
### Overview
This image presents a comparative bar chart showing the prediction flip rate for two versions of the Mistral-7B model (v0.1 and v0.3) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured on the Y-axis, while the datasets are displayed on the X-axis. Each dataset has two bars representing "Q-Anchored" and "A-Anchored" predictions.
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "Prediction Flip Rate" with a scale from 0 to 60 (approximately).
* **Models:** Two separate charts, one for "Mistral-7B-v0.1" and one for "Mistral-7B-v0.3", positioned side-by-side.
* **Legend:** Located at the bottom-center of the image.
* Red bar: "Q-Anchored (exact\_question)"
* Gray bar: "A-Anchored (exact\_question)"
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 Chart:**
* **PopQA:**
* Q-Anchored: Approximately 55.
* A-Anchored: Approximately 15.
* **TriviaQA:**
* Q-Anchored: Approximately 55.
* A-Anchored: Approximately 30.
* **HotpotQA:**
* Q-Anchored: Approximately 45.
* A-Anchored: Approximately 10.
* **NQ:**
* Q-Anchored: Approximately 50.
* A-Anchored: Approximately 45.
**Mistral-7B-v0.3 Chart:**
* **PopQA:**
* Q-Anchored: Approximately 50.
* A-Anchored: Approximately 10.
* **TriviaQA:**
* Q-Anchored: Approximately 60.
* A-Anchored: Approximately 25.
* **HotpotQA:**
* Q-Anchored: Approximately 50.
* A-Anchored: Approximately 15.
* **NQ:**
* Q-Anchored: Approximately 50.
* A-Anchored: Approximately 45.
**Trends:**
* In both models, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets, indicating a higher prediction flip rate when the question is anchored.
* For both models, TriviaQA generally shows the highest Q-Anchored flip rate.
* HotpotQA consistently shows the lowest A-Anchored flip rate.
### Key Observations
* The prediction flip rate is significantly higher for Q-Anchored predictions compared to A-Anchored predictions in all datasets for both models.
* The Mistral-7B-v0.3 model shows a slight decrease in Q-Anchored flip rate for PopQA compared to v0.1.
* The Mistral-7B-v0.3 model shows an increase in Q-Anchored flip rate for TriviaQA compared to v0.1.
* The A-Anchored flip rates are generally lower and more consistent across datasets.
### Interpretation
The data suggests that anchoring predictions to the question (Q-Anchored) leads to a higher rate of prediction flips compared to anchoring to the answer (A-Anchored). This could indicate that the model is more sensitive to variations in the question phrasing or that the question provides more informative cues for prediction. The differences between the two model versions (v0.1 and v0.3) suggest that model updates can influence the prediction flip rate, potentially due to changes in the model's architecture or training data. The varying flip rates across datasets may reflect the inherent difficulty and characteristics of each dataset. For example, TriviaQA, with its higher flip rate, might contain more ambiguous or challenging questions. The relatively low A-Anchored flip rates suggest that the model is more stable when relying on the answer context. This data is valuable for understanding the model's behavior and identifying areas for improvement, particularly in terms of robustness to question variations and sensitivity to different types of knowledge.
</details>
Figure 18: Prediction flip rate under token patching, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x42.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate for Llama Models
### Overview
The image presents a comparative bar chart illustrating the prediction flip rate for two Llama models (Llama-3.2-1B and Llama-3.2-3B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured as the percentage of predictions that change when the input is altered between question-anchored and answer-anchored prompts.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 60)
* **Models:** Two separate charts, one for Llama-3.2-1B and one for Llama-3.2-3B. Each chart displays the same datasets.
* **Legend:**
* Red: Q-Anchored (exact\_question)
* Gray: A-Anchored (exact\_question)
* **Chart Arrangement:** Two charts are positioned side-by-side.
### Detailed Analysis
**Llama-3.2-1B Chart:**
* **PopQA:** Q-Anchored: Approximately 48. A-Anchored: Approximately 32.
* **TriviaQA:** Q-Anchored: Approximately 56. A-Anchored: Approximately 28.
* **HotpotQA:** Q-Anchored: Approximately 60. A-Anchored: Approximately 8.
* **NQ:** Q-Anchored: Approximately 48. A-Anchored: Approximately 16.
The Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The highest flip rate for this model is observed on the HotpotQA dataset with Q-Anchored prompts.
**Llama-3.2-3B Chart:**
* **PopQA:** Q-Anchored: Approximately 52. A-Anchored: Approximately 24.
* **TriviaQA:** Q-Anchored: Approximately 58. A-Anchored: Approximately 24.
* **HotpotQA:** Q-Anchored: Approximately 52. A-Anchored: Approximately 8.
* **NQ:** Q-Anchored: Approximately 48. A-Anchored: Approximately 16.
Similar to the 1B model, the 3B model also exhibits higher flip rates for Q-Anchored prompts. The highest flip rate for this model is observed on the TriviaQA dataset with Q-Anchored prompts.
### Key Observations
* **Q-Anchored vs. A-Anchored:** The prediction flip rate is significantly higher when the prompt is anchored to the question (Q-Anchored) compared to being anchored to the answer (A-Anchored) for both models and all datasets.
* **Dataset Variation:** The flip rate varies depending on the dataset. HotpotQA consistently shows the highest flip rate for the 1B model, while TriviaQA shows the highest flip rate for the 3B model.
* **Model Comparison:** The 3B model generally shows slightly higher flip rates than the 1B model, particularly for PopQA and TriviaQA.
### Interpretation
The data suggests that both Llama models are sensitive to the way the prompt is framed â specifically, whether it emphasizes the question or the answer. The higher flip rate for Q-Anchored prompts indicates that the models are more likely to change their predictions when the focus is shifted to the question itself. This could be due to the models relying on subtle cues in the question to generate their answers, and these cues are more prominent when the question is explicitly emphasized.
The variation in flip rates across datasets suggests that the models' sensitivity to prompt framing is influenced by the characteristics of the dataset. Datasets like HotpotQA and TriviaQA, which may require more complex reasoning or knowledge retrieval, might be more susceptible to changes in prompt framing.
The slightly higher flip rates observed for the 3B model could indicate that larger models are more sensitive to subtle changes in input, potentially due to their increased capacity to capture and process complex relationships in the data. This sensitivity could be a double-edged sword, as it might lead to more accurate predictions in some cases but also make the models more vulnerable to adversarial attacks or prompt engineering.
</details>
<details>
<summary>x43.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate for Llama Models
### Overview
This image presents a comparative bar chart illustrating the prediction flip rate for two Llama models (Llama-3-8B and Llama-3-70B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured as a percentage and is shown for both "Q-Anchored (exact_question)" and "A-Anchored (exact_question)" conditions. The chart consists of two sub-charts, one for each Llama model, arranged side-by-side.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 80)
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart)
* **Legend:**
* Red: Q-Anchored (exact\_question)
* Gray: A-Anchored (exact\_question)
### Detailed Analysis
**Llama-3-8B Chart (Left)**
* **PopQA:**
* Q-Anchored: Approximately 62%
* A-Anchored: Approximately 30%
* **TriviaQA:**
* Q-Anchored: Approximately 82%
* A-Anchored: Approximately 42%
* **HotpotQA:**
* Q-Anchored: Approximately 45%
* A-Anchored: Approximately 10%
* **NQ:**
* Q-Anchored: Approximately 68%
* A-Anchored: Approximately 20%
The Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The highest flip rate for this model is observed on the TriviaQA dataset for Q-Anchored.
**Llama-3-70B Chart (Right)**
* **PopQA:**
* Q-Anchored: Approximately 75%
* A-Anchored: Approximately 35%
* **TriviaQA:**
* Q-Anchored: Approximately 88%
* A-Anchored: Approximately 45%
* **HotpotQA:**
* Q-Anchored: Approximately 55%
* A-Anchored: Approximately 15%
* **NQ:**
* Q-Anchored: Approximately 50%
* A-Anchored: Approximately 25%
Similar to the Llama-3-8B chart, the Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The highest flip rate for this model is observed on the TriviaQA dataset for Q-Anchored.
### Key Observations
* The Llama-3-70B model generally exhibits higher prediction flip rates than the Llama-3-8B model across all datasets and anchoring conditions.
* TriviaQA consistently shows the highest prediction flip rates for both models and both anchoring conditions.
* HotpotQA consistently shows the lowest prediction flip rates for both models and both anchoring conditions.
* Q-Anchored consistently has a higher flip rate than A-Anchored.
### Interpretation
The data suggests that the Llama models are more prone to "flipping" their predictions when the question itself is used as the anchor (Q-Anchored) compared to when the answer is used as the anchor (A-Anchored). This could indicate that the models are more sensitive to variations in the question phrasing than variations in the answer. The higher flip rates observed on the TriviaQA dataset might suggest that this dataset presents more challenging or ambiguous questions. The larger Llama-3-70B model demonstrates a greater sensitivity to these anchoring conditions, as evidenced by its higher overall flip rates. This could be due to its increased capacity to model complex relationships within the data, but also potentially indicates a greater susceptibility to overfitting or noise. The consistent pattern across datasets suggests a systematic behavior of the models rather than random fluctuations. The difference in flip rates between Q-Anchored and A-Anchored could be a metric for evaluating the robustness of the models to adversarial attacks or subtle changes in input.
</details>
<details>
<summary>x44.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Mistral Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two versions of the Mistral language model (Mistral-7B-v0.1 and Mistral-7B-v0.3) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured for both "Q-Anchored (exact_question)" and "A-Anchored (exact_question)" scenarios.
### Components/Axes
* **X-axis:** Dataset - PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** Prediction Flip Rate - Scale ranges from 0 to 80.
* **Models:** Two models are compared: Mistral-7B-v0.1 and Mistral-7B-v0.3. Each model has its own chart.
* **Anchoring:** Two anchoring methods are compared within each dataset:
* Q-Anchored (exact\_question) - Represented by a reddish-brown color.
* A-Anchored (exact\_question) - Represented by a gray color.
* **Legend:** Located at the bottom-center of the image, it clearly defines the color coding for each anchoring method.
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 Chart:**
* **PopQA:**
* Q-Anchored: Approximately 72.
* A-Anchored: Approximately 16.
* **TriviaQA:**
* Q-Anchored: Approximately 68.
* A-Anchored: Approximately 44.
* **HotpotQA:**
* Q-Anchored: Approximately 78.
* A-Anchored: Approximately 10.
* **NQ:**
* Q-Anchored: Approximately 74.
* A-Anchored: Approximately 32.
**Mistral-7B-v0.3 Chart:**
* **PopQA:**
* Q-Anchored: Approximately 64.
* A-Anchored: Approximately 28.
* **TriviaQA:**
* Q-Anchored: Approximately 80.
* A-Anchored: Approximately 48.
* **HotpotQA:**
* Q-Anchored: Approximately 76.
* A-Anchored: Approximately 14.
* **NQ:**
* Q-Anchored: Approximately 70.
* A-Anchored: Approximately 36.
In both charts, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets.
### Key Observations
* The Q-Anchored flip rate is significantly higher than the A-Anchored flip rate for all datasets and both models.
* The HotpotQA dataset consistently shows the highest Q-Anchored flip rate for both models.
* The A-Anchored flip rate is generally low across all datasets, but varies between datasets.
* Mistral-7B-v0.3 generally shows a slightly lower Q-Anchored flip rate compared to Mistral-7B-v0.1, but a higher A-Anchored flip rate in some datasets.
### Interpretation
The data suggests that anchoring the prediction flip rate calculation to the exact question (Q-Anchored) results in a much higher flip rate compared to anchoring it to the exact answer (A-Anchored). This indicates that the model is more sensitive to changes in the question phrasing than changes in the answer. The higher flip rates observed on the HotpotQA dataset might suggest that this dataset presents more challenging or ambiguous questions. The slight differences between the two model versions (v0.1 and v0.3) suggest that model updates have a subtle impact on prediction stability, potentially improving robustness to answer variations while maintaining sensitivity to question variations. The large difference between Q and A anchored rates suggests that the model is more likely to change its prediction when the question is altered, even if the correct answer remains the same. This could be due to the model's reliance on specific keywords or phrasing in the question.
</details>
Figure 19: Prediction flip rate under token patching, probing attention activations of the last exact answer token.
<details>
<summary>x45.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two Llama models (Llama-3.2-1B and Llama-3.2-3B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The chart compares the flip rate when the prediction is anchored to the question (Q-Anchored) versus when it's anchored to the answer (A-Anchored). The chart is split into two sections, one for each model.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 80)
* **Models:** Llama-3.2-1B (left chart), Llama-3.2-3B (right chart)
* **Legend:**
* Q-Anchored (exact\_question) - represented by a reddish-brown color.
* A-Anchored (exact\_question) - represented by a gray color.
### Detailed Analysis
**Llama-3.2-1B (Left Chart)**
* **PopQA:** Q-Anchored: Approximately 52. A-Anchored: Approximately 30.
* **TriviaQA:** Q-Anchored: Approximately 65. A-Anchored: Approximately 32.
* **HotpotQA:** Q-Anchored: Approximately 48. A-Anchored: Approximately 10.
* **NQ:** Q-Anchored: Approximately 78. A-Anchored: Approximately 28.
The Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The trend is that the Q-Anchored flip rate is significantly higher for NQ and TriviaQA, and relatively lower for HotpotQA.
**Llama-3.2-3B (Right Chart)**
* **PopQA:** Q-Anchored: Approximately 58. A-Anchored: Approximately 30.
* **TriviaQA:** Q-Anchored: Approximately 68. A-Anchored: Approximately 32.
* **HotpotQA:** Q-Anchored: Approximately 52. A-Anchored: Approximately 12.
* **NQ:** Q-Anchored: Approximately 80. A-Anchored: Approximately 30.
Similar to the 1B model, the Q-Anchored bars are consistently higher than the A-Anchored bars. The trend is that the Q-Anchored flip rate is significantly higher for NQ and TriviaQA, and relatively lower for HotpotQA.
### Key Observations
* The Q-Anchored flip rate is consistently higher than the A-Anchored flip rate for both models across all datasets.
* The NQ dataset consistently shows the highest Q-Anchored flip rate for both models.
* The HotpotQA dataset consistently shows the lowest Q-Anchored flip rate for both models.
* The 3B model generally exhibits slightly higher Q-Anchored flip rates compared to the 1B model, particularly for PopQA and TriviaQA.
### Interpretation
The data suggests that anchoring the prediction to the question (Q-Anchored) leads to a higher prediction flip rate compared to anchoring it to the answer (A-Anchored) for both Llama models. This implies that the model's predictions are more sensitive to changes in the question formulation than changes in the answer. The significant difference in flip rates across datasets indicates that the models perform differently depending on the nature of the questions and answers within each dataset. The higher flip rates observed in NQ and TriviaQA might suggest that these datasets contain more ambiguous or complex questions, leading to greater variability in predictions. The lower flip rates in HotpotQA could indicate that the questions in this dataset are more straightforward or well-defined. The slight improvement in flip rates with the larger 3B model suggests that increasing model size can lead to increased sensitivity to input variations, but the fundamental pattern of Q-Anchored flip rates being higher than A-Anchored flip rates remains consistent. This could be a characteristic of the model's architecture or training data.
</details>
<details>
<summary>x46.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two Llama models (Llama-3-8B and Llama-3-70B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured on the Y-axis, while the datasets are displayed on the X-axis. Two types of anchoring are compared: Q-Anchored (based on the exact question) and A-Anchored (based on the exact answer).
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "Prediction Flip Rate" with a scale ranging from 0 to 60 (approximately).
* **Models:** Two separate charts are presented side-by-side, one for "Llama-3-8B" and one for "Llama-3-70B".
* **Legend:** Located at the bottom-center of the image.
* Red bars: "Q-Anchored (exact\_question)"
* Gray bars: "A-Anchored (exact\_question)"
### Detailed Analysis
**Llama-3-8B Chart:**
* **PopQA:** Q-Anchored: approximately 55. A-Anchored: approximately 25.
* **TriviaQA:** Q-Anchored: approximately 95. A-Anchored: approximately 50.
* **HotpotQA:** Q-Anchored: approximately 45. A-Anchored: approximately 10.
* **NQ:** Q-Anchored: approximately 60. A-Anchored: approximately 20.
**Llama-3-70B Chart:**
* **PopQA:** Q-Anchored: approximately 75. A-Anchored: approximately 50.
* **TriviaQA:** Q-Anchored: approximately 60. A-Anchored: approximately 25.
* **HotpotQA:** Q-Anchored: approximately 50. A-Anchored: approximately 20.
* **NQ:** Q-Anchored: approximately 40. A-Anchored: approximately 20.
**Trends:**
* In both models, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets, indicating a higher prediction flip rate when anchoring on the question.
* For Llama-3-8B, the highest flip rate is observed for TriviaQA (Q-Anchored), and the lowest for HotpotQA (A-Anchored).
* For Llama-3-70B, the highest flip rate is observed for PopQA (Q-Anchored), and the lowest for HotpotQA (A-Anchored).
### Key Observations
* The Llama-3-70B model generally exhibits higher prediction flip rates than the Llama-3-8B model, particularly for the PopQA and TriviaQA datasets.
* The difference between Q-Anchored and A-Anchored flip rates is more pronounced for the Llama-3-8B model.
* HotpotQA consistently shows the lowest flip rates for both models and both anchoring methods.
### Interpretation
The data suggests that anchoring the prediction process on the question (Q-Anchored) leads to a higher rate of prediction flips compared to anchoring on the answer (A-Anchored). This could indicate that the models are more sensitive to variations in the question phrasing than variations in the answer. The larger flip rates for the Llama-3-70B model might suggest a greater capacity for nuanced understanding and sensitivity to input variations, but also potentially a higher susceptibility to being "flipped" by subtle changes. The consistently low flip rates for the HotpotQA dataset could indicate that this dataset is less ambiguous or more straightforward for the models to process, or that the models have learned to perform well on this specific dataset. The difference in flip rates between the two models could also be due to differences in their training data or model architecture. Further investigation would be needed to determine the underlying reasons for these observed patterns.
</details>
<details>
<summary>x47.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate Comparison for Mistral Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two versions of the Mistral-7B model (v0.1 and v0.3) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The chart compares the flip rates for questions anchored to the original question ("Q-Anchored") versus those anchored to the answer ("A-Anchored").
### Components/Axes
* **X-axis:** Dataset - PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** Prediction Flip Rate - Scale ranges from 0 to 80.
* **Two Charts:** Side-by-side bar charts, one for Mistral-7B-v0.1 and one for Mistral-7B-v0.3.
* **Legend:** Located at the bottom-center of the image.
* Red: Q-Anchored (exact\_question)
* Gray: A-Anchored (exact\_question)
* **Titles:** Each chart has a title indicating the model version: "Mistral-7B-v0.1" and "Mistral-7B-v0.3".
### Detailed Analysis
**Mistral-7B-v0.1 Chart:**
* **PopQA:** Q-Anchored: Approximately 72. A-Anchored: Approximately 32.
* **TriviaQA:** Q-Anchored: Approximately 80. A-Anchored: Approximately 52.
* **HotpotQA:** Q-Anchored: Approximately 72. A-Anchored: Approximately 24.
* **NQ:** Q-Anchored: Approximately 80. A-Anchored: Approximately 32.
**Mistral-7B-v0.3 Chart:**
* **PopQA:** Q-Anchored: Approximately 64. A-Anchored: Approximately 36.
* **TriviaQA:** Q-Anchored: Approximately 80. A-Anchored: Approximately 52.
* **HotpotQA:** Q-Anchored: Approximately 68. A-Anchored: Approximately 24.
* **NQ:** Q-Anchored: Approximately 76. A-Anchored: Approximately 32.
**Trends:**
* In both models, the Q-Anchored flip rate is consistently higher than the A-Anchored flip rate across all datasets.
* For both models, the highest Q-Anchored flip rates are observed for TriviaQA and NQ datasets, reaching approximately 80.
* The A-Anchored flip rates are generally lower, ranging from approximately 24 to 52.
* The v0.3 model shows a slight decrease in Q-Anchored flip rates compared to v0.1 for PopQA, HotpotQA, and NQ.
### Key Observations
* The difference between Q-Anchored and A-Anchored flip rates is substantial, suggesting that anchoring to the question significantly impacts prediction stability.
* TriviaQA and NQ datasets consistently elicit higher flip rates for Q-Anchored questions.
* HotpotQA consistently shows the lowest A-Anchored flip rate.
* The v0.3 model appears to be slightly more stable than v0.1 for some datasets (PopQA, HotpotQA, NQ) based on the lower Q-Anchored flip rates.
### Interpretation
The data suggests that the Mistral models are more sensitive to changes when the prediction is evaluated based on the original question (Q-Anchored) compared to when it's evaluated based on the answer (A-Anchored). This could indicate that the models rely more heavily on the question context during prediction. The higher flip rates for TriviaQA and NQ might be due to the complexity or ambiguity of the questions in these datasets. The slight decrease in Q-Anchored flip rates in v0.3 for certain datasets suggests a potential improvement in model stability with the newer version, although the difference is not drastic. The consistent lower A-Anchored flip rates across all datasets indicate that the model is more confident in its predictions when evaluated against the answer, potentially because the answer provides a stronger constraint. The difference in flip rates between Q-Anchored and A-Anchored could be a metric for evaluating the robustness of the model's reasoning process.
</details>
Figure 20: Prediction flip rate under token patching, probing mlp activations of the final token.
<details>
<summary>x48.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate for Llama Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two Llama models (Llama-3.2-1B and Llama-3.2-3B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured for both Q-Anchored (exact question) and A-Anchored (exact question) scenarios. The chart consists of two sub-charts, one for each model, positioned side-by-side.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 50)
* **Models:** Llama-3.2-1B (left chart), Llama-3.2-3B (right chart)
* **Legend:**
* Q-Anchored (exact\_question) - represented by a reddish-brown color.
* A-Anchored (exact\_question) - represented by a gray color.
### Detailed Analysis
**Llama-3.2-1B (Left Chart)**
* **PopQA:** Q-Anchored: approximately 52, A-Anchored: approximately 8.
* **TriviaQA:** Q-Anchored: approximately 45, A-Anchored: approximately 22.
* **HotpotQA:** Q-Anchored: approximately 32, A-Anchored: approximately 12.
* **NQ:** Q-Anchored: approximately 48, A-Anchored: approximately 16.
The Q-Anchored bars consistently show higher flip rates than the A-Anchored bars across all datasets. The highest flip rate for this model is observed on the PopQA dataset for Q-Anchored questions.
**Llama-3.2-3B (Right Chart)**
* **PopQA:** Q-Anchored: approximately 32, A-Anchored: approximately 10.
* **TriviaQA:** Q-Anchored: approximately 52, A-Anchored: approximately 18.
* **HotpotQA:** Q-Anchored: approximately 40, A-Anchored: approximately 12.
* **NQ:** Q-Anchored: approximately 48, A-Anchored: approximately 16.
Similar to the 1B model, the Q-Anchored bars exhibit higher flip rates than the A-Anchored bars. The highest flip rate for this model is observed on the TriviaQA dataset for Q-Anchored questions.
### Key Observations
* The Q-Anchored flip rate is consistently higher than the A-Anchored flip rate for both models across all datasets.
* The Llama-3.2-3B model generally shows lower flip rates on PopQA and HotpotQA compared to the Llama-3.2-1B model, but higher on TriviaQA.
* The PopQA dataset consistently shows a relatively high flip rate for the Q-Anchored scenario, especially for the Llama-3.2-1B model.
### Interpretation
The data suggests that the method of anchoring (question vs. answer) significantly impacts the prediction flip rate. Anchoring based on the question (Q-Anchored) leads to a substantially higher rate of prediction flips compared to anchoring based on the answer (A-Anchored). This could indicate that the models are more sensitive to variations in the question phrasing than variations in the answer.
The differences in flip rates between the two models (1B and 3B) across different datasets suggest that model size and dataset characteristics interact. The larger model (3B) appears to be more robust on some datasets (PopQA, HotpotQA) but less so on others (TriviaQA). This could be due to differences in the training data or the complexity of the questions within each dataset.
The high flip rate on the PopQA dataset, particularly for the 1B model, might indicate that this dataset presents a unique challenge for the models, potentially due to the nature of the questions or the distribution of answers. Further investigation into the characteristics of the PopQA dataset is warranted. The data suggests that the models are not consistently stable in their predictions, and small changes in input can lead to significant changes in output.
</details>
<details>
<summary>x49.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two Llama models â Llama-3-8B and Llama-3-70B â across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The chart compares the flip rates for questions anchored to the original question ("Q-Anchored") versus those anchored to the answer ("A-Anchored").
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 60, with increments of 10)
* **Models:** Two separate charts are presented side-by-side, one for Llama-3-8B and one for Llama-3-70B.
* **Legend:** Located at the bottom-center of the image.
* **Q-Anchored (exact_question):** Represented by a reddish-brown color.
* **A-Anchored (exact_question):** Represented by a gray color.
### Detailed Analysis
**Llama-3-8B Chart (Left)**
* **PopQA:** The Q-Anchored bar has a height of approximately 52. The A-Anchored bar has a height of approximately 8.
* **TriviaQA:** The Q-Anchored bar has a height of approximately 58. The A-Anchored bar has a height of approximately 42.
* **HotpotQA:** The Q-Anchored bar has a height of approximately 42. The A-Anchored bar has a height of approximately 10.
* **NQ:** The Q-Anchored bar has a height of approximately 56. The A-Anchored bar has a height of approximately 24.
**Llama-3-70B Chart (Right)**
* **PopQA:** The Q-Anchored bar has a height of approximately 60. The A-Anchored bar has a height of approximately 6.
* **TriviaQA:** The Q-Anchored bar has a height of approximately 54. The A-Anchored bar has a height of approximately 36.
* **HotpotQA:** The Q-Anchored bar has a height of approximately 52. The A-Anchored bar has a height of approximately 12.
* **NQ:** The Q-Anchored bar has a height of approximately 46. The A-Anchored bar has a height of approximately 26.
In both charts, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets. The Q-Anchored bars generally exhibit a similar height across the datasets, while the A-Anchored bars show more variation.
### Key Observations
* The Prediction Flip Rate is significantly higher for Q-Anchored prompts compared to A-Anchored prompts for both models.
* The Llama-3-70B model generally exhibits a higher Prediction Flip Rate for Q-Anchored prompts than the Llama-3-8B model.
* The A-Anchored flip rates are relatively low and consistent across datasets for both models.
* TriviaQA shows the largest difference between Q-Anchored and A-Anchored flip rates for both models.
### Interpretation
The data suggests that anchoring predictions to the original question ("Q-Anchored") leads to a substantially higher rate of prediction flips compared to anchoring them to the answer ("A-Anchored"). This implies that the models are more sensitive to changes in the question phrasing than changes in the answer. The larger difference observed in the TriviaQA dataset might indicate that this dataset presents more challenging or ambiguous questions, making the models more susceptible to flipping predictions based on slight variations in the question.
The higher flip rates for the Llama-3-70B model could be attributed to its larger size and increased capacity to capture nuanced relationships within the data. However, it also suggests that the larger model might be more prone to overfitting or sensitivity to specific input patterns.
The consistently low A-Anchored flip rates suggest that the models are relatively stable when the context is anchored to the answer, indicating that the answer itself provides a stronger and more reliable basis for prediction. This could be due to the answer being a more definitive and less ambiguous piece of information compared to the question.
</details>
<details>
<summary>x50.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Mistral Models
### Overview
This image presents a comparative bar chart illustrating the prediction flip rate for two versions of the Mistral-7B language model (v0.1 and v0.3) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured as a percentage and is shown for both "Q-Anchored" (exact question) and "A-Anchored" (exact question) scenarios.
### Components/Axes
* **X-axis:** Dataset - PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** Prediction Flip Rate - Scale ranges from 0 to 60 (approximately).
* **Models:** Mistral-7B-v0.1 (left chart), Mistral-7B-v0.3 (right chart).
* **Legend:**
* Red: Q-Anchored (exact\_question)
* Gray: A-Anchored (exact\_question)
### Detailed Analysis
**Mistral-7B-v0.1 (Left Chart)**
* **PopQA:** Q-Anchored flip rate is approximately 62%. A-Anchored flip rate is approximately 24%.
* **TriviaQA:** Q-Anchored flip rate is approximately 68%. A-Anchored flip rate is approximately 46%.
* **HotpotQA:** Q-Anchored flip rate is approximately 44%. A-Anchored flip rate is approximately 8%.
* **NQ:** Q-Anchored flip rate is approximately 66%. A-Anchored flip rate is approximately 42%.
**Mistral-7B-v0.3 (Right Chart)**
* **PopQA:** Q-Anchored flip rate is approximately 64%. A-Anchored flip rate is approximately 12%.
* **TriviaQA:** Q-Anchored flip rate is approximately 70%. A-Anchored flip rate is approximately 48%.
* **HotpotQA:** Q-Anchored flip rate is approximately 48%. A-Anchored flip rate is approximately 10%.
* **NQ:** Q-Anchored flip rate is approximately 68%. A-Anchored flip rate is approximately 44%.
**Trends:**
* For both models, the Q-Anchored flip rate is consistently higher than the A-Anchored flip rate across all datasets.
* The Q-Anchored flip rate is generally high (above 60%) for all datasets in both models.
* The A-Anchored flip rate varies more significantly across datasets.
### Key Observations
* The largest difference between Q-Anchored and A-Anchored flip rates is observed in the PopQA dataset for both models.
* The HotpotQA dataset consistently shows the lowest A-Anchored flip rate for both models.
* The Mistral-7B-v0.3 model generally exhibits a slightly lower A-Anchored flip rate compared to the v0.1 model, except for the NQ dataset.
### Interpretation
The data suggests that the Mistral models are more sensitive to changes in the question phrasing (Q-Anchored) than changes in the answer phrasing (A-Anchored). This indicates that the models rely more heavily on the question context when making predictions. The higher flip rates for Q-Anchored scenarios suggest that even small alterations to the question can lead to different predictions.
The variation in flip rates across datasets likely reflects the inherent difficulty and characteristics of each dataset. For example, the low A-Anchored flip rate for HotpotQA might indicate that the answers in this dataset are more robust to slight variations in phrasing.
The slight improvement in the v0.3 model (lower A-Anchored flip rates in most cases) suggests that the model updates have made it slightly less sensitive to changes in the answer phrasing, potentially indicating improved robustness. However, the differences are relatively small, and further analysis would be needed to confirm this trend. The consistent high Q-Anchored flip rate across both versions suggests that the core sensitivity to question phrasing remains.
</details>
Figure 21: Prediction flip rate under token patching, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x51.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama-3.2-1B and Llama-3.2-3B
### Overview
This image presents two side-by-side bar charts comparing the Prediction Flip Rate for two language models, Llama-3.2-1B and Llama-3.2-3B, across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured for both "Q-Anchored" (based on the exact question) and "A-Anchored" (based on the exact answer) prompts, with both "exact_question" and "random" variations within each anchoring method.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 80)
* **Models:** Llama-3.2-1B (left chart), Llama-3.2-3B (right chart)
* **Legend:**
* Q-Anchored (exact_question) - Light Red
* Q-Anchored (random) - Dark Red
* A-Anchored (exact_question) - Light Gray
* A-Anchored (random) - Dark Gray
* **Title:** "Llama-3.2-1B" (above left chart), "Llama-3.2-3B" (above right chart)
* **Legend Position:** Bottom-center, spanning both charts.
### Detailed Analysis or Content Details
**Llama-3.2-1B (Left Chart)**
* **PopQA:**
* Q-Anchored (exact_question): Approximately 45
* Q-Anchored (random): Approximately 5
* A-Anchored (exact_question): Approximately 10
* A-Anchored (random): Approximately 2
* **TriviaQA:**
* Q-Anchored (exact_question): Approximately 70
* Q-Anchored (random): Approximately 10
* A-Anchored (exact_question): Approximately 25
* A-Anchored (random): Approximately 5
* **HotpotQA:**
* Q-Anchored (exact_question): Approximately 75
* Q-Anchored (random): Approximately 10
* A-Anchored (exact_question): Approximately 10
* A-Anchored (random): Approximately 2
* **NQ:**
* Q-Anchored (exact_question): Approximately 30
* Q-Anchored (random): Approximately 5
* A-Anchored (exact_question): Approximately 10
* A-Anchored (random): Approximately 2
**Llama-3.2-3B (Right Chart)**
* **PopQA:**
* Q-Anchored (exact_question): Approximately 60
* Q-Anchored (random): Approximately 10
* A-Anchored (exact_question): Approximately 20
* A-Anchored (random): Approximately 5
* **TriviaQA:**
* Q-Anchored (exact_question): Approximately 75
* Q-Anchored (random): Approximately 15
* A-Anchored (exact_question): Approximately 30
* A-Anchored (random): Approximately 10
* **HotpotQA:**
* Q-Anchored (exact_question): Approximately 80
* Q-Anchored (random): Approximately 15
* A-Anchored (exact_question): Approximately 15
* A-Anchored (random): Approximately 5
* **NQ:**
* Q-Anchored (exact_question): Approximately 50
* Q-Anchored (random): Approximately 10
* A-Anchored (exact_question): Approximately 15
* A-Anchored (random): Approximately 5
### Key Observations
* **Q-Anchored (exact_question)** consistently shows the highest flip rates across all datasets for both models.
* **A-Anchored (random)** consistently shows the lowest flip rates across all datasets for both models.
* The Llama-3.2-3B model generally exhibits higher flip rates than the Llama-3.2-1B model across all datasets and anchoring methods.
* TriviaQA and HotpotQA datasets consistently show higher flip rates than PopQA and NQ datasets.
* The difference between "exact_question" and "random" variations is more pronounced for Q-Anchored prompts than for A-Anchored prompts.
### Interpretation
The data suggests that the method of anchoring the prompt (question vs. answer) significantly impacts the prediction flip rate. Anchoring based on the question (Q-Anchored) leads to higher flip rates, especially when using the exact question. This indicates that the model is more sensitive to variations in the question phrasing. The larger model (Llama-3.2-3B) demonstrates a greater susceptibility to these variations, as evidenced by its generally higher flip rates.
The higher flip rates observed on TriviaQA and HotpotQA datasets might be attributed to the complexity of these datasets, requiring more nuanced reasoning and potentially making the model more prone to inconsistencies. The relatively low flip rates for A-Anchored (random) prompts suggest that the model is more stable when guided by the answer, regardless of the question phrasing.
The difference between "exact_question" and "random" variations highlights the importance of prompt engineering and the potential for adversarial attacks that exploit sensitivity to question phrasing. The data suggests that the models are not entirely robust to slight changes in the input question, particularly when the prompt is anchored to the question itself. This could be a vulnerability in real-world applications where user input might be noisy or intentionally manipulated.
</details>
<details>
<summary>x52.png Details</summary>

### Visual Description
\n
## Bar Chart: Prediction Flip Rate for Llama-3 Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two Llama-3 models (8B and 70B) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured for both Question-Anchored (Q-Anchored) and Answer-Anchored (A-Anchored) scenarios, with variations based on whether the anchoring is done using the exact question or a random question.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 80)
* **Models:** Two separate charts, one for Llama-3-8B (left) and one for Llama-3-70B (right).
* **Legend:** Located at the bottom-center of the image.
* Q-Anchored (exact\_question) - Red
* Q-Anchored (random) - Dark Red
* A-Anchored (exact\_question) - Light Gray
* A-Anchored (random) - Dark Gray
### Detailed Analysis
**Llama-3-8B (Left Chart)**
* **PopQA:**
* Q-Anchored (exact\_question): Approximately 72.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 32.
* A-Anchored (random): Approximately 12.
* **TriviaQA:**
* Q-Anchored (exact\_question): Approximately 76.
* Q-Anchored (random): Approximately 12.
* A-Anchored (exact\_question): Approximately 24.
* A-Anchored (random): Approximately 10.
* **HotpotQA:**
* Q-Anchored (exact\_question): Approximately 72.
* Q-Anchored (random): Approximately 16.
* A-Anchored (exact\_question): Approximately 16.
* A-Anchored (random): Approximately 8.
* **NQ:**
* Q-Anchored (exact\_question): Approximately 72.
* Q-Anchored (random): Approximately 16.
* A-Anchored (exact\_question): Approximately 16.
* A-Anchored (random): Approximately 8.
**Llama-3-70B (Right Chart)**
* **PopQA:**
* Q-Anchored (exact\_question): Approximately 72.
* Q-Anchored (random): Approximately 24.
* A-Anchored (exact\_question): Approximately 36.
* A-Anchored (random): Approximately 16.
* **TriviaQA:**
* Q-Anchored (exact\_question): Approximately 76.
* Q-Anchored (random): Approximately 20.
* A-Anchored (exact\_question): Approximately 28.
* A-Anchored (random): Approximately 12.
* **HotpotQA:**
* Q-Anchored (exact\_question): Approximately 72.
* Q-Anchored (random): Approximately 20.
* A-Anchored (exact\_question): Approximately 16.
* A-Anchored (random): Approximately 8.
* **NQ:**
* Q-Anchored (exact\_question): Approximately 72.
* Q-Anchored (random): Approximately 20.
* A-Anchored (exact\_question): Approximately 16.
* A-Anchored (random): Approximately 8.
**Trends:**
* In both models, Q-Anchored (exact\_question) consistently exhibits the highest prediction flip rate across all datasets.
* Q-Anchored (random) consistently shows the lowest prediction flip rate.
* A-Anchored (exact\_question) generally has a higher flip rate than A-Anchored (random).
* The 70B model generally shows higher flip rates for A-Anchored scenarios compared to the 8B model.
### Key Observations
* The difference between Q-Anchored (exact\_question) and Q-Anchored (random) is substantial, indicating that using the exact question for anchoring significantly impacts prediction flip rate.
* The 70B model demonstrates a more pronounced difference between A-Anchored (exact\_question) and A-Anchored (random) than the 8B model.
* The prediction flip rate is relatively consistent across the datasets for Q-Anchored (exact\_question).
### Interpretation
The data suggests that anchoring predictions to the exact question (Q-Anchored (exact\_question)) is a highly effective method for inducing prediction flips, resulting in the highest flip rates across all datasets and models. This indicates that the models are sensitive to the specific wording of the question. The lower flip rates observed with random question anchoring suggest that the models are less susceptible to irrelevant or unrelated information.
The larger difference in A-Anchored flip rates for the 70B model suggests that the larger model is more capable of leveraging answer-related information to influence predictions. The consistency of the Q-Anchored (exact\_question) flip rate across datasets implies that this anchoring strategy is robust and generalizable.
The concept of "prediction flip rate" likely refers to the frequency with which the model changes its predicted answer when presented with different anchoring information. This metric is valuable for understanding the model's sensitivity to context and its ability to revise its predictions based on new evidence. The results highlight the importance of carefully considering the anchoring strategy when evaluating and deploying these models.
</details>
<details>
<summary>x53.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate for Mistral Models
### Overview
This image presents a comparative bar chart illustrating the Prediction Flip Rate for two versions of the Mistral-7B model (v0.1 and v0.3) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The flip rate is measured for two anchoring methods: Q-Anchored (based on the exact question) and A-Anchored (based on the exact answer), each with both exact and random variations.
### Components/Axes
* **X-axis:** Dataset (PopQA, TriviaQA, HotpotQA, NQ)
* **Y-axis:** Prediction Flip Rate (ranging from 0 to 80, with increments of 10)
* **Models:** Two separate charts, one for Mistral-7B-v0.1 and one for Mistral-7B-v0.3. Each chart has the same X and Y axes.
* **Legend:**
* Q-Anchored (exact\_question) - Light Red
* Q-Anchored (random) - Dark Red
* A-Anchored (exact\_question) - Light Gray
* A-Anchored (random) - Dark Gray
### Detailed Analysis
**Mistral-7B-v0.1 Chart:**
* **PopQA:**
* Q-Anchored (exact\_question): Approximately 80.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 30.
* A-Anchored (random): Approximately 0.
* **TriviaQA:**
* Q-Anchored (exact\_question): Approximately 75.
* Q-Anchored (random): Approximately 30.
* A-Anchored (exact\_question): Approximately 45.
* A-Anchored (random): Approximately 10.
* **HotpotQA:**
* Q-Anchored (exact\_question): Approximately 15.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 10.
* A-Anchored (random): Approximately 5.
* **NQ:**
* Q-Anchored (exact\_question): Approximately 80.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 20.
* A-Anchored (random): Approximately 10.
**Mistral-7B-v0.3 Chart:**
* **PopQA:**
* Q-Anchored (exact\_question): Approximately 80.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 20.
* A-Anchored (random): Approximately 0.
* **TriviaQA:**
* Q-Anchored (exact\_question): Approximately 80.
* Q-Anchored (random): Approximately 20.
* A-Anchored (exact\_question): Approximately 40.
* A-Anchored (random): Approximately 10.
* **HotpotQA:**
* Q-Anchored (exact\_question): Approximately 20.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 10.
* A-Anchored (random): Approximately 5.
* **NQ:**
* Q-Anchored (exact\_question): Approximately 80.
* Q-Anchored (random): Approximately 10.
* A-Anchored (exact\_question): Approximately 20.
* A-Anchored (random): Approximately 10.
### Key Observations
* For both models, Q-Anchored (exact\_question) consistently exhibits the highest prediction flip rate across all datasets, particularly on PopQA, TriviaQA, and NQ.
* Q-Anchored (random) consistently shows the lowest prediction flip rate.
* A-Anchored methods generally have lower flip rates than Q-Anchored methods.
* HotpotQA consistently shows the lowest flip rates across all anchoring methods and both models.
* The v0.3 model shows a slight decrease in flip rate for A-Anchored (exact\_question) compared to v0.1 on PopQA and TriviaQA.
### Interpretation
The data suggests that anchoring predictions based on the exact question (Q-Anchored) is more sensitive to changes, leading to a higher prediction flip rate. This indicates that the model relies heavily on the specific wording of the question. The random variation of Q-Anchored shows a significantly lower flip rate, suggesting the model is less sensitive to minor variations in the question when not anchored to the exact phrasing.
The lower flip rates for A-Anchored methods suggest that the model is more stable when anchored to the answer. This could indicate that the model is more confident in its answer predictions than its question-based predictions.
The consistently low flip rates on the HotpotQA dataset might indicate that this dataset is easier for the model to handle, or that the model has been specifically trained to perform well on this type of question.
The slight improvements in the v0.3 model compared to v0.1, particularly in the A-Anchored scenarios, suggest that the model updates have improved stability and reduced sensitivity to anchoring. The difference is subtle, however, indicating that the core behavior remains similar between the two versions.
</details>
Figure 22: Prediction flip rate under token patching, probing mlp activations of the last exact answer token.
Appendix E Answer-Only Input
<details>
<summary>x54.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Llama Models on Question Answering Datasets
### Overview
The image presents a comparative bar chart illustrating the performance difference (ÎP) between two Llama models â Llama-3.2-1B and Llama-3.2-3B â across four different question answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. The performance metric, ÎP, appears to represent a change in some probability or accuracy score. The chart uses two bars for each dataset, representing "Q-Anchored" and "A-Anchored" approaches.
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "âÎP" (negative Delta P), with a scale ranging from 0 to 60, incrementing by 10.
* **Legend:** Located at the bottom-center of the image.
* "Q-Anchored" â represented by a light red color (approximately #F08080).
* "A-Anchored" â represented by a gray color (approximately #808080).
* **Titles:** Two titles are present, one above each chart: "Llama-3.2-1B" and "Llama-3.2-3B".
### Detailed Analysis
The chart is divided into two sections, one for each Llama model.
**Llama-3.2-1B:**
* **PopQA:** Q-Anchored is approximately 44, A-Anchored is approximately 8.
* **TriviaQA:** Q-Anchored is approximately 55, A-Anchored is approximately 16.
* **HotpotQA:** Q-Anchored is approximately 62, A-Anchored is approximately 18.
* **NQ:** Q-Anchored is approximately 28, A-Anchored is approximately 10.
**Llama-3.2-3B:**
* **PopQA:** Q-Anchored is approximately 22, A-Anchored is approximately 6.
* **TriviaQA:** Q-Anchored is approximately 60, A-Anchored is approximately 12.
* **HotpotQA:** Q-Anchored is approximately 54, A-Anchored is approximately 16.
* **NQ:** Q-Anchored is approximately 30, A-Anchored is approximately 8.
**Trends:**
* For both models, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets, indicating better performance with the Q-Anchored approach.
* The Q-Anchored performance is highest on the TriviaQA and HotpotQA datasets for both models.
* The A-Anchored performance is relatively low and consistent across all datasets for both models.
* The Llama-3.2-3B model generally shows lower Q-Anchored values than the Llama-3.2-1B model for PopQA, but higher values for TriviaQA, HotpotQA, and NQ.
### Key Observations
* The difference between Q-Anchored and A-Anchored performance is substantial, suggesting that the anchoring method significantly impacts the model's performance.
* The performance varies considerably depending on the dataset.
* The Llama-3.2-3B model shows a different performance profile compared to the Llama-3.2-1B model, particularly on the PopQA dataset.
### Interpretation
The data suggests that the "Q-Anchored" approach consistently outperforms the "A-Anchored" approach across all tested datasets for both Llama models. This implies that anchoring the model's attention or processing towards the question itself is more effective than anchoring it towards the answer. The varying performance across datasets indicates that the effectiveness of each model is dataset-dependent, potentially due to differences in question complexity, answer format, or domain knowledge required. The Llama-3.2-3B model's performance on PopQA is lower than the Llama-3.2-1B model, which could be due to the larger model being less effective on simpler datasets or requiring more data to generalize effectively. The higher performance of the Llama-3.2-3B model on TriviaQA, HotpotQA, and NQ suggests that it can leverage its increased capacity to handle more complex reasoning and knowledge retrieval tasks. The consistent low performance of the A-Anchored approach suggests that it may not be a suitable strategy for these question answering tasks.
</details>
<details>
<summary>x55.png Details</summary>

### Visual Description
\n
## Bar Chart: Performance Comparison of Llama-3 Models
### Overview
This image presents a bar chart comparing the performance of two Llama-3 models (8B and 70B) across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. The performance metric is represented by "-ÎP", which likely indicates a change in probability or performance score. The chart uses paired bars for each dataset, representing "Q-Anchored" and "A-Anchored" conditions.
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "-ÎP" with a scale ranging from 0 to 60, incrementing by 10.
* **Models:** Two separate charts are presented side-by-side, one for "Llama-3-8B" and one for "Llama-3-70B".
* **Legend:** Located at the bottom-center of the image.
* "Q-Anchored" (represented by a reddish-brown color)
* "A-Anchored" (represented by a gray color)
### Detailed Analysis
The chart consists of two sets of four paired bar graphs.
**Llama-3-8B:**
* **PopQA:** Q-Anchored is approximately 52, A-Anchored is approximately 8.
* **TriviaQA:** Q-Anchored is approximately 62, A-Anchored is approximately 12.
* **HotpotQA:** Q-Anchored is approximately 48, A-Anchored is approximately 22.
* **NQ:** Q-Anchored is approximately 28, A-Anchored is approximately 6.
**Llama-3-70B:**
* **PopQA:** Q-Anchored is approximately 48, A-Anchored is approximately 10.
* **TriviaQA:** Q-Anchored is approximately 60, A-Anchored is approximately 8.
* **HotpotQA:** Q-Anchored is approximately 44, A-Anchored is approximately 22.
* **NQ:** Q-Anchored is approximately 44, A-Anchored is approximately 8.
In both models, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets. The Q-Anchored bars show a generally decreasing trend from PopQA to NQ, while the A-Anchored bars remain relatively low.
### Key Observations
* The "Q-Anchored" condition consistently outperforms the "A-Anchored" condition for both models across all datasets.
* The 70B model generally shows slightly lower values for Q-Anchored compared to the 8B model on PopQA and HotpotQA, but similar values on TriviaQA and NQ.
* The difference between Q-Anchored and A-Anchored is most pronounced on TriviaQA for both models.
* The A-Anchored values are relatively stable across all datasets for both models.
### Interpretation
The data suggests that the "Q-Anchored" approach leads to significantly better performance than the "A-Anchored" approach for the Llama-3 models on these question answering datasets. The "-ÎP" metric likely represents an improvement in performance, so higher values are better. The consistent outperformance of Q-Anchored suggests that anchoring the model's attention or processing around the question itself is more effective than anchoring it around the answer.
The slight differences between the 8B and 70B models suggest that model size has a limited impact on this particular performance difference. The relatively stable A-Anchored performance indicates that this approach may have a lower bound on its effectiveness. The large performance gap on TriviaQA suggests that this dataset may be particularly sensitive to the anchoring method used.
The term "Anchored" likely refers to a specific technique used during the model's inference process, potentially related to attention mechanisms or input formatting. Further investigation into the meaning of "Q-Anchored" and "A-Anchored" would be necessary to fully understand the implications of these results.
</details>
<details>
<summary>x56.png Details</summary>

### Visual Description
\n
## Bar Chart: ÎP Performance Comparison of Mistral Models
### Overview
The image presents a bar chart comparing the performance difference (ÎP) of two Mistral language models (Mistral-7B-v0.1 and Mistral-7B-v0.3) across four different datasets: PopQA, TriviaQA, HotpotQA, and NQ. The performance is measured for two anchoring methods: Q-Anchored and A-Anchored.
### Components/Axes
* **X-axis:** "Dataset" with categories: PopQA, TriviaQA, HotpotQA, NQ.
* **Y-axis:** "ÎP" (Delta P), ranging from 0 to 80, with tick marks at 0, 20, 40, 60, and 80.
* **Legend:** Located at the bottom-center of the image.
* "Q-Anchored" - represented by a light red color.
* "A-Anchored" - represented by a gray color.
* **Titles:** Two titles are present, one above each set of bars:
* "Mistral-7B-v0.1"
* "Mistral-7B-v0.3"
### Detailed Analysis
**Mistral-7B-v0.1:**
* **PopQA:** Q-Anchored: approximately 78. A-Anchored: approximately 24.
* **TriviaQA:** Q-Anchored: approximately 75. A-Anchored: approximately 8.
* **HotpotQA:** Q-Anchored: approximately 50. A-Anchored: approximately 20.
* **NQ:** Q-Anchored: approximately 52. A-Anchored: approximately 28.
**Mistral-7B-v0.3:**
* **PopQA:** Q-Anchored: approximately 80. A-Anchored: approximately 20.
* **TriviaQA:** Q-Anchored: approximately 60. A-Anchored: approximately 6.
* **HotpotQA:** Q-Anchored: approximately 48. A-Anchored: approximately 24.
* **NQ:** Q-Anchored: approximately 62. A-Anchored: approximately 30.
For both models, the Q-Anchored bars are consistently higher than the A-Anchored bars across all datasets, indicating better performance with Q-Anchoring.
### Key Observations
* The performance difference between Q-Anchored and A-Anchored is substantial across all datasets for both models.
* Mistral-7B-v0.3 generally shows improved performance compared to Mistral-7B-v0.1, particularly in PopQA and TriviaQA.
* PopQA consistently yields the highest ÎP values for both models and both anchoring methods.
* TriviaQA consistently yields the lowest ÎP values for both models and both anchoring methods.
### Interpretation
The data suggests that Q-Anchoring consistently outperforms A-Anchoring for both Mistral-7B-v0.1 and Mistral-7B-v0.3 across the tested datasets. This implies that anchoring the prompt with the question (Q-Anchored) is more effective than anchoring it with the answer (A-Anchored) for these models. The improved performance of Mistral-7B-v0.3 indicates that updates to the model have resulted in better performance on these question-answering tasks. The varying performance across datasets suggests that the models' effectiveness is influenced by the characteristics of the datasets themselves (e.g., complexity, domain). The large difference in ÎP for PopQA suggests that this dataset is particularly well-suited to the capabilities of these models, while TriviaQA presents a greater challenge. The consistent trend across both models suggests a robust finding, rather than a dataset-specific anomaly.
</details>
Figure 23: $-\Delta\mathrm{P}$ with only the LLM-generated answer. Q-Anchored instances exhibit substantial shifts, whereas A-Anchored instances remain stable, confirming that A-Anchored truthfulness encoding relies on information in the LLM-generated answer itself.
Appendix F Answer Accuracy
<details>
<summary>x57.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering (QA) datasets across layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each line represents a different QA dataset and anchoring method (Q-Anchored or A-Anchored). The charts are positioned side-by-side for direct comparison.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding colors:
* Q-Anchored (PopQA) - Blue
* A-Anchored (PopQA) - Orange
* Q-Anchored (TriviaQA) - Green
* A-Anchored (TriviaQA) - Pink
* Q-Anchored (HotpotQA) - Light Blue (dashed)
* A-Anchored (HotpotQA) - Purple (dashed)
* Q-Anchored (NQ) - Dark Blue
* A-Anchored (NQ) - Brown
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left):**
* **Q-Anchored (PopQA) - Blue:** Starts at approximately 90% accuracy at layer 0, dips to around 50% at layer 2, then fluctuates between 60-80% for layers 3-15.
* **A-Anchored (PopQA) - Orange:** Starts at approximately 20% accuracy at layer 0, rises to around 40% at layer 3, and remains relatively stable between 30-50% for layers 4-15.
* **Q-Anchored (TriviaQA) - Green:** Starts at approximately 20% accuracy at layer 0, rises to around 80% at layer 5, then fluctuates between 60-90% for layers 6-15.
* **A-Anchored (TriviaQA) - Pink:** Starts at approximately 20% accuracy at layer 0, rises to around 60% at layer 5, then fluctuates between 40-70% for layers 6-15.
* **Q-Anchored (HotpotQA) - Light Blue (dashed):** Starts at approximately 60% accuracy at layer 0, dips to around 20% at layer 2, then fluctuates between 40-70% for layers 3-15.
* **A-Anchored (HotpotQA) - Purple (dashed):** Starts at approximately 40% accuracy at layer 0, dips to around 20% at layer 2, then fluctuates between 30-50% for layers 3-15.
* **Q-Anchored (NQ) - Dark Blue:** Starts at approximately 60% accuracy at layer 0, dips to around 30% at layer 2, then fluctuates between 40-60% for layers 3-15.
* **A-Anchored (NQ) - Brown:** Starts at approximately 20% accuracy at layer 0, rises to around 40% at layer 3, and remains relatively stable between 30-50% for layers 4-15.
**Llama-3.2-3B Chart (Right):**
* **Q-Anchored (PopQA) - Blue:** Starts at approximately 90% accuracy at layer 0, dips to around 50% at layer 2, then fluctuates between 60-90% for layers 3-25.
* **A-Anchored (PopQA) - Orange:** Starts at approximately 20% accuracy at layer 0, rises to around 40% at layer 3, and remains relatively stable between 30-50% for layers 4-25.
* **Q-Anchored (TriviaQA) - Green:** Starts at approximately 20% accuracy at layer 0, rises to around 90% at layer 5, then fluctuates between 60-90% for layers 6-25.
* **A-Anchored (TriviaQA) - Pink:** Starts at approximately 20% accuracy at layer 0, rises to around 60% at layer 5, then fluctuates between 40-70% for layers 6-25.
* **Q-Anchored (HotpotQA) - Light Blue (dashed):** Starts at approximately 60% accuracy at layer 0, dips to around 20% at layer 2, then fluctuates between 40-80% for layers 3-25.
* **A-Anchored (HotpotQA) - Purple (dashed):** Starts at approximately 40% accuracy at layer 0, dips to around 20% at layer 2, then fluctuates between 30-50% for layers 3-25.
* **Q-Anchored (NQ) - Dark Blue:** Starts at approximately 60% accuracy at layer 0, dips to around 30% at layer 2, then fluctuates between 40-60% for layers 3-25.
* **A-Anchored (NQ) - Brown:** Starts at approximately 20% accuracy at layer 0, rises to around 40% at layer 3, and remains relatively stable between 30-50% for layers 4-25.
### Key Observations
* **Q-Anchored generally outperforms A-Anchored:** Across all datasets, the Q-Anchored methods consistently achieve higher accuracy than the A-Anchored methods.
* **PopQA shows high initial accuracy:** The PopQA dataset, when Q-Anchored, starts with the highest accuracy in both models.
* **Accuracy fluctuates with layer:** Most datasets exhibit fluctuations in accuracy as the layer number increases, suggesting that the model's performance is not consistently improving with depth.
* **3B model shows more sustained accuracy:** The Llama-3.2-3B model generally maintains higher accuracy levels across layers compared to the Llama-3.2-1B model.
* **Initial dip in accuracy:** Many lines show a dip in accuracy around layer 2, potentially indicating a learning phase or adjustment period.
### Interpretation
The data suggests that question-anchoring (Q-Anchored) is a more effective method for improving answer accuracy in these Llama models compared to answer-anchoring (A-Anchored). The higher accuracy of the 3B model indicates that increasing model size generally leads to better performance. The fluctuations in accuracy across layers suggest that the models are not simply learning linearly with depth; there are likely complex interactions between layers and datasets. The initial dip in accuracy could be due to the model adjusting to the specific characteristics of each dataset. The PopQA dataset, with its high initial accuracy, might be easier for the models to learn or more aligned with their pre-training data. These charts provide valuable insights into the performance of Llama models on different QA tasks and can inform future model development and training strategies. The differences in performance between the 1B and 3B models highlight the importance of model scale in achieving higher accuracy.
</details>
<details>
<summary>x58.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of two Llama models (Llama-3-8B and Llama-3-70B) across different layers. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3-8B
* **Right Chart Title:** Llama-3-70B
* **Legend:**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Light Brown line
* Q-Anchored (TriviaQA) - Purple line
* A-Anchored (TriviaQA) - Light Purple line
* Q-Anchored (HotpotQA) - Green line
* A-Anchored (HotpotQA) - Light Green line
* Q-Anchored (NQ) - Red line
* A-Anchored (NQ) - Orange line
### Detailed Analysis or Content Details
**Llama-3-8B Chart (Left):**
* **Q-Anchored (PopQA):** The blue line starts at approximately 10, rises sharply to around 90 by layer 5, then fluctuates between 60 and 90 for the remainder of the layers, ending at approximately 85.
* **A-Anchored (PopQA):** The light brown line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 10, rises to around 80 by layer 5, and fluctuates between 60 and 90 for the remainder of the layers, ending at approximately 75.
* **A-Anchored (TriviaQA):** The light purple line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
* **Q-Anchored (HotpotQA):** The green line starts at approximately 10, rises to around 85 by layer 5, and fluctuates between 60 and 90 for the remainder of the layers, ending at approximately 80.
* **A-Anchored (HotpotQA):** The light green line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
* **Q-Anchored (NQ):** The red line starts at approximately 10, rises to around 60 by layer 5, and fluctuates between 40 and 70 for the remainder of the layers, ending at approximately 60.
* **A-Anchored (NQ):** The orange line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
**Llama-3-70B Chart (Right):**
* **Q-Anchored (PopQA):** The blue line starts at approximately 10, rises sharply to around 90 by layer 5, then fluctuates between 60 and 90 for the remainder of the layers, ending at approximately 80.
* **A-Anchored (PopQA):** The light brown line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 10, rises to around 80 by layer 5, and fluctuates between 60 and 90 for the remainder of the layers, ending at approximately 75.
* **A-Anchored (TriviaQA):** The light purple line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
* **Q-Anchored (HotpotQA):** The green line starts at approximately 10, rises to around 85 by layer 5, and fluctuates between 60 and 90 for the remainder of the layers, ending at approximately 80.
* **A-Anchored (HotpotQA):** The light green line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
* **Q-Anchored (NQ):** The red line starts at approximately 10, rises to around 60 by layer 5, and fluctuates between 40 and 70 for the remainder of the layers, ending at approximately 60.
* **A-Anchored (NQ):** The orange line starts at approximately 10, rises to around 40 by layer 5, and remains relatively stable between 30 and 50 for the rest of the layers, ending at approximately 40.
### Key Observations
* For both models, the "Q-Anchored" lines consistently exhibit higher answer accuracy than the corresponding "A-Anchored" lines across all datasets.
* The answer accuracy generally increases rapidly in the initial layers (up to layer 5) for all datasets and anchoring methods.
* After the initial increase, the answer accuracy tends to plateau and fluctuate, suggesting diminishing returns from adding more layers.
* The 70B model shows similar trends to the 8B model, but extends to a larger number of layers (80 vs 30).
* PopQA, TriviaQA, and HotpotQA datasets generally achieve higher accuracy than the NQ dataset.
### Interpretation
The data suggests that question-anchored methods consistently outperform answer-anchored methods in terms of answer accuracy for both Llama models. This indicates that focusing on the question during the learning process is more effective than focusing on the answer. The initial rapid increase in accuracy with the first few layers suggests that the early layers are crucial for capturing fundamental knowledge. The plateauing of accuracy after a certain number of layers indicates that adding more layers may not significantly improve performance, and could potentially lead to overfitting. The differences in accuracy between datasets may reflect the complexity and quality of the datasets themselves. The 70B model's extended layer range allows for potentially more nuanced learning, but the overall trends remain consistent with the 8B model. This data is valuable for understanding the strengths and weaknesses of different training strategies and for optimizing the architecture of Llama models.
</details>
<details>
<summary>x59.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Mistral Models
### Overview
This image presents two line charts side-by-side, comparing the answer accuracy of the Mistral-7B-v0.1 and Mistral-7B-v0.3 models across different layers. The x-axis represents the layer number (from 0 to 30), and the y-axis represents the answer accuracy (from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30, with increments of approximately 2-3)
* **Y-axis:** Answer Accuracy (0 to 100, with increments of 10)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Blue Solid Line: Q-Anchored (PopQA)
* Orange Dotted Line: A-Anchored (PopQA)
* Green Solid Line: Q-Anchored (TriviaQA)
* Red Dotted Line: A-Anchored (TriviaQA)
* Purple Dashed Line: Q-Anchored (HotpotQA)
* Teal Dashed Line: A-Anchored (HotpotQA)
* Gray Solid Line: Q-Anchored (NQ)
* Brown Dotted Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 0% accuracy at layer 0, rises sharply to around 80-90% by layer 5, then fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
* **Q-Anchored (TriviaQA) - Green Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80-90% by layer 5, and fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (TriviaQA) - Red Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
* **Q-Anchored (HotpotQA) - Purple Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80-90% by layer 5, and fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (HotpotQA) - Teal Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
* **Q-Anchored (NQ) - Gray Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80-90% by layer 5, and fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (NQ) - Brown Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 0% accuracy at layer 0, rises sharply to around 80-90% by layer 5, then fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
* **Q-Anchored (TriviaQA) - Green Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80-90% by layer 5, and fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (TriviaQA) - Red Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
* **Q-Anchored (HotpotQA) - Purple Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80-90% by layer 5, and fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (HotpotQA) - Teal Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
* **Q-Anchored (NQ) - Gray Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80-90% by layer 5, and fluctuates between 70-95% for the remainder of the layers.
* **A-Anchored (NQ) - Brown Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to around 40-50% by layer 5, and remains relatively stable between 30-60% for the rest of the layers.
### Key Observations
* The Q-Anchored lines consistently achieve significantly higher accuracy than the A-Anchored lines across all datasets and for both models.
* Accuracy generally increases rapidly in the initial layers (0-5) and then plateaus with some fluctuations.
* The two models (v0.1 and v0.3) exhibit very similar performance patterns.
* The accuracy ranges for the Q-Anchored lines are similar across different datasets (PopQA, TriviaQA, HotpotQA, NQ).
* The accuracy ranges for the A-Anchored lines are similar across different datasets (PopQA, TriviaQA, HotpotQA, NQ).
### Interpretation
The data suggests that question-anchoring (Q-Anchored) is a more effective method for improving answer accuracy in the Mistral models compared to answer-anchoring (A-Anchored). Both models demonstrate a similar learning curve, with a rapid increase in accuracy in the early layers followed by a stabilization. The consistent performance across different datasets indicates that the observed trends are not specific to any particular question-answering task. The relatively low accuracy of the A-Anchored lines suggests that the models may struggle to effectively utilize answer-based information for improving performance. The fluctuations in accuracy after layer 5 could be due to overfitting or the inherent complexity of the question-answering tasks. The similarity between the two model versions (v0.1 and v0.3) suggests that the core architecture and training data are similar, and the improvements in v0.3 may not be substantial in terms of the overall accuracy trend.
</details>
Figure 24: Comparisons of answer accuracy between pathways, probing attention activations of the final token.
<details>
<summary>x60.png Details</summary>

### Visual Description
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) across layers in two Llama models: Llama-3.2-1B and Llama-3.2-3B. The charts display accuracy as a function of layer number, with separate lines for question-anchored (Q-Anchored) and answer-anchored (A-Anchored) approaches within each dataset.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Light Blue Line: Q-Anchored (NQ)
* Light Orange Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 60, dips to around 20 at layer 3, then fluctuates between 40 and 80, ending around 70 at layer 15.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 55, decreases to around 30 at layer 3, remains relatively stable between 30 and 50, and ends around 40 at layer 15.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 70, fluctuates between 60 and 90, and ends around 80 at layer 15.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 60, dips to around 30 at layer 3, then rises to around 70, and ends around 60 at layer 15.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately 50, fluctuates between 40 and 70, and ends around 60 at layer 15.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 40, fluctuates between 20 and 50, and ends around 30 at layer 15.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 40, fluctuates between 20 and 60, and ends around 50 at layer 15.
* **A-Anchored (NQ) - Light Orange Line:** Starts at approximately 30, fluctuates between 10 and 40, and ends around 30 at layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 60, dips to around 20 at layer 3, rises to around 90 at layer 10, and ends around 70 at layer 25.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 55, decreases to around 30 at layer 3, remains relatively stable between 30 and 50, and ends around 40 at layer 25.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 70, fluctuates between 60 and 90, and ends around 80 at layer 25.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 60, dips to around 30 at layer 3, then rises to around 70, and ends around 60 at layer 25.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately 50, fluctuates between 40 and 70, and ends around 60 at layer 25.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 40, fluctuates between 20 and 50, and ends around 30 at layer 25.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 40, fluctuates between 20 and 60, and ends around 50 at layer 25.
* **A-Anchored (NQ) - Light Orange Line:** Starts at approximately 30, fluctuates between 10 and 40, and ends around 30 at layer 25.
### Key Observations
* In both models, Q-Anchored (PopQA) generally exhibits higher accuracy than A-Anchored (PopQA).
* TriviaQA consistently shows high accuracy for both Q-Anchored and A-Anchored approaches.
* HotpotQA and NQ generally have lower accuracy compared to PopQA and TriviaQA.
* The 3B model shows a more pronounced accuracy increase around layer 10 for PopQA (Q-Anchored) compared to the 1B model.
* The A-Anchored lines are generally lower in accuracy than the Q-Anchored lines across all datasets.
### Interpretation
The charts demonstrate the performance of Llama models across different question-answering datasets and anchoring strategies as the model depth (layers) increases. The higher accuracy observed with Q-Anchored approaches suggests that leveraging question information is more effective than answer information for these tasks. The differences in accuracy between datasets highlight the varying difficulty of the questions within each dataset. The 3B model's improved performance around layer 10 for PopQA suggests that increasing model size can lead to better performance, particularly at certain depths. The fluctuations in accuracy across layers indicate that the model's performance is not consistently improving with depth, and there may be optimal layer configurations for specific datasets and anchoring strategies. The consistent lower performance of HotpotQA and NQ suggests these datasets pose greater challenges for the models, potentially due to their complexity or the nature of the questions. These results are valuable for understanding the strengths and weaknesses of the Llama models and for guiding future research on improving question-answering performance.
</details>
<details>
<summary>x61.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering (QA) datasets across layers of two Llama models: Llama-3-8B and Llama-3-70B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each line represents a specific QA dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3-8B
* **Right Chart Title:** Llama-3-70B
* **Legend:** Located at the bottom of the image. The legend identifies the following lines:
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Brown dashed line
* Q-Anchored (HotpotQA) - Green dashed-dotted line
* A-Anchored (HotpotQA) - Red dashed line
* Q-Anchored (NQ) - Teal solid line
* A-Anchored (NQ) - Light-orange dashed line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 10% accuracy at layer 0, rapidly increases to around 90% by layer 5, fluctuates between 80-100% for layers 5-30.
* **A-Anchored (PopQA):** Starts at approximately 20% accuracy at layer 0, increases to around 50% by layer 5, and remains relatively stable between 40-60% for layers 5-30.
* **Q-Anchored (TriviaQA):** Starts at approximately 20% accuracy at layer 0, increases to around 90% by layer 5, and fluctuates between 70-100% for layers 5-30.
* **A-Anchored (TriviaQA):** Starts at approximately 20% accuracy at layer 0, increases to around 40% by layer 5, and remains relatively stable between 30-50% for layers 5-30.
* **Q-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 60% by layer 5, and fluctuates between 40-80% for layers 5-30.
* **A-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 30% by layer 5, and remains relatively stable between 20-40% for layers 5-30.
* **Q-Anchored (NQ):** Starts at approximately 10% accuracy at layer 0, increases to around 80% by layer 5, and fluctuates between 60-90% for layers 5-30.
* **A-Anchored (NQ):** Starts at approximately 10% accuracy at layer 0, increases to around 40% by layer 5, and remains relatively stable between 30-50% for layers 5-30.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 10% accuracy at layer 0, rapidly increases to around 90% by layer 10, fluctuates between 70-100% for layers 10-80.
* **A-Anchored (PopQA):** Starts at approximately 20% accuracy at layer 0, increases to around 50% by layer 10, and remains relatively stable between 40-60% for layers 10-80.
* **Q-Anchored (TriviaQA):** Starts at approximately 20% accuracy at layer 0, increases to around 90% by layer 10, and fluctuates between 70-100% for layers 10-80.
* **A-Anchored (TriviaQA):** Starts at approximately 20% accuracy at layer 0, increases to around 40% by layer 10, and remains relatively stable between 30-50% for layers 10-80.
* **Q-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 60% by layer 10, and fluctuates between 40-80% for layers 10-80.
* **A-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 30% by layer 10, and remains relatively stable between 20-40% for layers 10-80.
* **Q-Anchored (NQ):** Starts at approximately 10% accuracy at layer 0, increases to around 80% by layer 10, and fluctuates between 60-90% for layers 10-80.
* **A-Anchored (NQ):** Starts at approximately 10% accuracy at layer 0, increases to around 40% by layer 10, and remains relatively stable between 30-50% for layers 10-80.
### Key Observations
* **Q-Anchored consistently outperforms A-Anchored** across all datasets and both models.
* **PopQA and TriviaQA generally achieve higher accuracy** than HotpotQA and NQ.
* **Llama-3-70B exhibits more consistent fluctuations** in accuracy across layers compared to Llama-3-8B.
* The accuracy for most datasets plateaus after a certain number of layers (around 5-10 for Llama-3-8B and 10-20 for Llama-3-70B).
### Interpretation
The charts demonstrate the impact of model size (8B vs. 70B parameters) and anchoring method (Q-Anchored vs. A-Anchored) on answer accuracy across different QA datasets. The consistent outperformance of Q-Anchored suggests that anchoring questions is more effective than anchoring answers for these models. The higher accuracy achieved by PopQA and TriviaQA might indicate that these datasets are easier for the models to learn from, or that the models are better aligned with the types of questions asked in these datasets. The larger model (Llama-3-70B) shows more variability in accuracy across layers, potentially due to its increased capacity to learn complex patterns and nuances in the data. The plateauing of accuracy after a certain number of layers suggests that adding more layers beyond that point does not significantly improve performance, and may even lead to overfitting. This data is valuable for understanding the strengths and weaknesses of these models and for guiding future research on improving their performance on QA tasks.
</details>
<details>
<summary>x62.png Details</summary>

### Visual Description
## Line Chart: Answer Accuracy vs. Layer for Mistral Models
### Overview
This image presents two line charts, side-by-side, comparing the answer accuracy of the Mistral-7B-v0.1 and Mistral-7B-v0.3 models across different layers. The x-axis represents the layer number (from 0 to 30), and the y-axis represents the answer accuracy (from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30, with tick marks at integer values)
* **Y-axis:** Answer Accuracy (0 to 100, with tick marks at integer multiples of 20)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Blue Solid Line: Q-Anchored (PopQA)
* Orange Dotted Line: A-Anchored (PopQA)
* Green Solid Line: Q-Anchored (TriviaQA)
* Purple Solid Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Teal Solid Line: A-Anchored (HotpotQA)
* Red Dotted Line: Q-Anchored (NQ)
* Yellow Solid Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 5% accuracy at layer 0, rises to a peak of around 95% at layer 6, fluctuates between 60% and 90% for layers 6-20, then gradually increases to approximately 90% at layer 30.
* **A-Anchored (PopQA) - Orange Dotted Line:** Starts at approximately 5% accuracy at layer 0, rises to a peak of around 65% at layer 4, then fluctuates between 30% and 60% for layers 4-30.
* **Q-Anchored (TriviaQA) - Green Solid Line:** Starts at approximately 10% accuracy at layer 0, rises to a peak of around 95% at layer 5, fluctuates between 60% and 90% for layers 5-20, then gradually increases to approximately 95% at layer 30.
* **A-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 10% accuracy at layer 0, rises to a peak of around 70% at layer 4, then fluctuates between 30% and 60% for layers 4-30.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 80% at layer 6, fluctuates between 40% and 80% for layers 6-20, then gradually increases to approximately 75% at layer 30.
* **A-Anchored (HotpotQA) - Teal Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 50% at layer 4, then fluctuates between 20% and 50% for layers 4-30.
* **Q-Anchored (NQ) - Red Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 60% at layer 6, fluctuates between 20% and 60% for layers 6-20, then gradually increases to approximately 50% at layer 30.
* **A-Anchored (NQ) - Yellow Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 40% at layer 4, then fluctuates between 20% and 40% for layers 4-30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 5% accuracy at layer 0, rises to a peak of around 95% at layer 6, fluctuates between 60% and 90% for layers 6-20, then gradually increases to approximately 95% at layer 30.
* **A-Anchored (PopQA) - Orange Dotted Line:** Starts at approximately 5% accuracy at layer 0, rises to a peak of around 65% at layer 4, then fluctuates between 30% and 60% for layers 4-30.
* **Q-Anchored (TriviaQA) - Green Solid Line:** Starts at approximately 10% accuracy at layer 0, rises to a peak of around 95% at layer 5, fluctuates between 60% and 90% for layers 5-20, then gradually increases to approximately 95% at layer 30.
* **A-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 10% accuracy at layer 0, rises to a peak of around 70% at layer 4, then fluctuates between 30% and 60% for layers 4-30.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 80% at layer 6, fluctuates between 40% and 80% for layers 6-20, then gradually increases to approximately 75% at layer 30.
* **A-Anchored (HotpotQA) - Teal Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 50% at layer 4, then fluctuates between 20% and 50% for layers 4-30.
* **Q-Anchored (NQ) - Red Dotted Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 60% at layer 6, fluctuates between 20% and 60% for layers 6-20, then gradually increases to approximately 50% at layer 30.
* **A-Anchored (NQ) - Yellow Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to a peak of around 40% at layer 4, then fluctuates between 20% and 40% for layers 4-30.
### Key Observations
* For both models, the Q-Anchored lines generally exhibit higher accuracy than the A-Anchored lines across all datasets.
* The accuracy tends to peak in the early layers (around layers 5-6) and then fluctuates.
* PopQA and TriviaQA datasets show higher accuracy compared to HotpotQA and NQ datasets.
* The two charts (v0.1 and v0.3) are visually very similar, suggesting that the improvement from v0.1 to v0.3 is not dramatically reflected in these accuracy curves.
### Interpretation
The charts demonstrate the performance of the Mistral models on different question-answering tasks as the model depth (layers) increases. The higher accuracy of Q-Anchored methods suggests that anchoring the questions is more effective than anchoring the answers for these tasks. The varying performance across datasets indicates that the models are better at answering questions from some knowledge sources (PopQA, TriviaQA) than others (HotpotQA, NQ). The similarity between the v0.1 and v0.3 charts suggests that the improvements in v0.3 may be more nuanced than a simple increase in overall accuracy, potentially focusing on other aspects of performance like efficiency or robustness. The fluctuating accuracy after the initial peak could indicate that deeper layers introduce complexity or noise that hinders performance on these specific tasks.
</details>
Figure 25: Comparisons of answer accuracy between pathways, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x63.png Details</summary>

### Visual Description
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering (QA) datasets across layers in two Llama models: Llama-3.2-1B and Llama-3.2-3B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each line represents a different QA dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 15 for Llama-3.2-1B and 0 to 25 for Llama-3.2-3B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Light Blue Line: A-Anchored (TriviaQA)
* Purple Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Gray Line: Q-Anchored (NQ)
* Brown Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA) (Blue Line):** Starts at approximately 20 accuracy at layer 0, rises sharply to a peak of around 95 accuracy at layer 5, fluctuates between 80 and 95 accuracy until layer 12, and then declines to approximately 70 accuracy at layer 15.
* **A-Anchored (PopQA) (Orange Line):** Starts at approximately 30 accuracy at layer 0, rises to a peak of around 55 accuracy at layer 2, remains relatively stable between 45 and 60 accuracy until layer 10, and then declines to approximately 40 accuracy at layer 15.
* **Q-Anchored (TriviaQA) (Green Line):** Starts at approximately 30 accuracy at layer 0, rises to a peak of around 90 accuracy at layer 5, fluctuates between 75 and 90 accuracy until layer 12, and then declines to approximately 75 accuracy at layer 15.
* **A-Anchored (TriviaQA) (Light Blue Line):** Starts at approximately 35 accuracy at layer 0, rises to a peak of around 65 accuracy at layer 2, remains relatively stable between 50 and 65 accuracy until layer 10, and then declines to approximately 50 accuracy at layer 15.
* **Q-Anchored (HotpotQA) (Purple Dashed Line):** Starts at approximately 40 accuracy at layer 0, rises to a peak of around 85 accuracy at layer 5, fluctuates between 70 and 85 accuracy until layer 12, and then declines to approximately 70 accuracy at layer 15.
* **A-Anchored (HotpotQA) (Red Dashed Line):** Starts at approximately 45 accuracy at layer 0, rises to a peak of around 60 accuracy at layer 2, remains relatively stable between 50 and 60 accuracy until layer 10, and then declines to approximately 45 accuracy at layer 15.
* **Q-Anchored (NQ) (Gray Line):** Starts at approximately 25 accuracy at layer 0, rises to a peak of around 70 accuracy at layer 5, fluctuates between 60 and 70 accuracy until layer 12, and then declines to approximately 60 accuracy at layer 15.
* **A-Anchored (NQ) (Brown Line):** Starts at approximately 35 accuracy at layer 0, rises to a peak of around 50 accuracy at layer 2, remains relatively stable between 40 and 50 accuracy until layer 10, and then declines to approximately 40 accuracy at layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA) (Blue Line):** Starts at approximately 20 accuracy at layer 0, rises sharply to a peak of around 95 accuracy at layer 5, fluctuates between 85 and 95 accuracy until layer 20, and then declines to approximately 80 accuracy at layer 25.
* **A-Anchored (PopQA) (Orange Line):** Starts at approximately 30 accuracy at layer 0, rises to a peak of around 55 accuracy at layer 2, remains relatively stable between 45 and 60 accuracy until layer 15, and then declines to approximately 40 accuracy at layer 25.
* **Q-Anchored (TriviaQA) (Green Line):** Starts at approximately 30 accuracy at layer 0, rises to a peak of around 90 accuracy at layer 5, fluctuates between 75 and 90 accuracy until layer 20, and then declines to approximately 75 accuracy at layer 25.
* **A-Anchored (TriviaQA) (Light Blue Line):** Starts at approximately 35 accuracy at layer 0, rises to a peak of around 65 accuracy at layer 2, remains relatively stable between 50 and 65 accuracy until layer 15, and then declines to approximately 50 accuracy at layer 25.
* **Q-Anchored (HotpotQA) (Purple Dashed Line):** Starts at approximately 40 accuracy at layer 0, rises to a peak of around 85 accuracy at layer 5, fluctuates between 70 and 85 accuracy until layer 20, and then declines to approximately 70 accuracy at layer 25.
* **A-Anchored (HotpotQA) (Red Dashed Line):** Starts at approximately 45 accuracy at layer 0, rises to a peak of around 60 accuracy at layer 2, remains relatively stable between 50 and 60 accuracy until layer 15, and then declines to approximately 45 accuracy at layer 25.
* **Q-Anchored (NQ) (Gray Line):** Starts at approximately 25 accuracy at layer 0, rises to a peak of around 70 accuracy at layer 5, fluctuates between 60 and 70 accuracy until layer 20, and then declines to approximately 60 accuracy at layer 25.
* **A-Anchored (NQ) (Brown Line):** Starts at approximately 35 accuracy at layer 0, rises to a peak of around 50 accuracy at layer 2, remains relatively stable between 40 and 50 accuracy until layer 15, and then declines to approximately 40 accuracy at layer 25.
### Key Observations
* Generally, the Q-Anchored lines exhibit higher accuracy than the A-Anchored lines across all datasets and models.
* Accuracy tends to peak around layer 5 for both models and then plateaus or slightly declines.
* The Llama-3.2-3B model generally maintains higher accuracy levels across all datasets and layers compared to the Llama-3.2-1B model.
* PopQA, TriviaQA, and HotpotQA datasets show higher accuracy compared to NQ.
### Interpretation
The charts demonstrate the impact of model size (1B vs. 3B parameters) and anchoring method (Q-Anchored vs. A-Anchored) on answer accuracy across different QA datasets. The higher accuracy of the 3B model suggests that increasing model capacity improves performance. The consistently higher accuracy of Q-Anchored methods indicates that anchoring questions is more effective than anchoring answers for these datasets. The varying performance across datasets highlights the importance of dataset characteristics in evaluating QA models. The initial rise in accuracy followed by a plateau or decline suggests that there may be a point of diminishing returns with increasing layers, and further optimization or architectural changes may be needed to sustain performance gains. The differences in accuracy between datasets could be due to variations in question complexity, data quality, or the types of reasoning required to answer the questions.
</details>
<details>
<summary>x64.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering (QA) datasets across layers of two Llama models: Llama-3-8B and Llama-3-70B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each line represents a specific QA dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3-8B
* **Right Chart Title:** Llama-3-70B
* **Legend:**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Light Brown line
* Q-Anchored (TriviaQA) - Green line
* A-Anchored (TriviaQA) - Purple line
* Q-Anchored (HotpotQA) - Dashed Purple line
* A-Anchored (HotpotQA) - Dashed Brown line
* Q-Anchored (NQ) - Light Blue line
* A-Anchored (NQ) - Orange line
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* **Q-Anchored (PopQA):** The blue line starts at approximately 10% accuracy at layer 0, rapidly increases to around 90% by layer 5, fluctuates between 80% and 95% for layers 5-25, and then decreases to around 85% by layer 30.
* **A-Anchored (PopQA):** The light brown line starts at approximately 20% accuracy at layer 0, increases to around 40% by layer 5, and remains relatively stable between 30% and 50% for the rest of the layers.
* **Q-Anchored (TriviaQA):** The green line starts at approximately 20% accuracy at layer 0, increases to around 90% by layer 5, and fluctuates between 80% and 95% for layers 5-25, and then decreases to around 80% by layer 30.
* **A-Anchored (TriviaQA):** The purple line starts at approximately 20% accuracy at layer 0, increases to around 60% by layer 5, and remains relatively stable between 50% and 70% for the rest of the layers.
* **Q-Anchored (HotpotQA):** The dashed purple line starts at approximately 10% accuracy at layer 0, increases to around 80% by layer 5, and fluctuates between 70% and 90% for layers 5-25, and then decreases to around 75% by layer 30.
* **A-Anchored (HotpotQA):** The dashed brown line starts at approximately 10% accuracy at layer 0, increases to around 40% by layer 5, and remains relatively stable between 30% and 50% for the rest of the layers.
* **Q-Anchored (NQ):** The light blue line starts at approximately 10% accuracy at layer 0, increases to around 70% by layer 5, and fluctuates between 60% and 80% for layers 5-25, and then decreases to around 65% by layer 30.
* **A-Anchored (NQ):** The orange line starts at approximately 10% accuracy at layer 0, increases to around 30% by layer 5, and remains relatively stable between 20% and 40% for the rest of the layers.
**Llama-3-70B Chart:**
The trends are similar to the Llama-3-8B chart, but the fluctuations are more pronounced and the layer range is extended to 80. All lines exhibit similar oscillatory behavior, peaking around 80-100% accuracy at various points and dipping to lower values. The A-Anchored lines consistently remain lower in accuracy than the Q-Anchored lines across all datasets.
### Key Observations
* **Q-Anchored consistently outperforms A-Anchored:** Across all datasets and both models, the Q-Anchored lines generally exhibit higher answer accuracy than the A-Anchored lines.
* **Initial Accuracy Increase:** All lines show a significant increase in accuracy within the first 5 layers.
* **Fluctuations:** The accuracy fluctuates significantly across layers, particularly in the Llama-3-70B model.
* **Model Size Impact:** The Llama-3-70B model exhibits more pronounced fluctuations in accuracy compared to the Llama-3-8B model.
### Interpretation
The data suggests that question anchoring (Q-Anchored) is a more effective method for improving answer accuracy in both Llama models compared to answer anchoring (A-Anchored). The initial rapid increase in accuracy across all datasets indicates that the early layers of the models are crucial for capturing basic question-answering capabilities. The subsequent fluctuations in accuracy suggest that deeper layers may be more sensitive to the specific characteristics of each dataset. The more pronounced fluctuations in the Llama-3-70B model could be attributed to its larger size and increased complexity, allowing it to capture more nuanced patterns but also making it more susceptible to overfitting or noise in the data. The consistent lower performance of A-Anchored methods suggests that focusing on the question itself, rather than the answer, is more beneficial for improving the model's ability to retrieve accurate information. The oscillatory behavior could be due to the model learning and unlearning patterns as it progresses through the layers, or it could be an artifact of the training process. Further investigation would be needed to determine the underlying cause of these fluctuations.
</details>
<details>
<summary>x65.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Mistral Models
### Overview
This image presents two line charts, side-by-side, comparing the answer accuracy of the Mistral-7B-v0.1 and Mistral-7B-v0.3 models across different layers. The x-axis represents the layer number (from 0 to 30), and the y-axis represents the answer accuracy (from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30, with tick marks at intervals of 5)
* **Y-axis:** Answer Accuracy (0 to 100, with tick marks at intervals of 20)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Gray Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Light Blue Line: Q-Anchored (NQ)
* Brown Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 80, dips to around 20 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 40, remains relatively stable between 30-50 for layers 0-25, then decreases to around 30 at layer 30.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 90, dips to around 50 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 70, dips to around 30 at layer 2, fluctuates between 40-70 for layers 3-25, then decreases to around 40 at layer 30.
* **Q-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately 90, dips to around 40 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 50, dips to around 20 at layer 2, fluctuates between 30-50 for layers 3-25, then decreases to around 30 at layer 30.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 90, dips to around 40 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (NQ) - Brown Line:** Starts at approximately 40, dips to around 20 at layer 2, fluctuates between 30-50 for layers 3-25, then decreases to around 30 at layer 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 90, dips to around 30 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 40, remains relatively stable between 30-50 for layers 0-25, then decreases to around 30 at layer 30.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 90, dips to around 50 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 70, dips to around 30 at layer 2, fluctuates between 40-70 for layers 3-25, then decreases to around 40 at layer 30.
* **Q-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately 90, dips to around 40 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 50, dips to around 20 at layer 2, fluctuates between 30-50 for layers 3-25, then decreases to around 30 at layer 30.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 90, dips to around 40 at layer 2, fluctuates between 60-90 for layers 3-25, then decreases to around 60 at layer 30.
* **A-Anchored (NQ) - Brown Line:** Starts at approximately 40, dips to around 20 at layer 2, fluctuates between 30-50 for layers 3-25, then decreases to around 30 at layer 30.
### Key Observations
* Both models exhibit a significant dip in accuracy around layer 2 across all datasets and anchoring methods.
* Q-Anchored methods consistently outperform A-Anchored methods across all datasets for both models.
* The accuracy generally fluctuates between 60-90% for Q-Anchored methods after the initial dip.
* A-Anchored methods generally have lower accuracy, fluctuating between 30-50%.
* Mistral-7B-v0.3 generally shows higher initial accuracy than Mistral-7B-v0.1.
### Interpretation
The charts demonstrate the impact of model layers on answer accuracy for different question-answering datasets. The initial dip in accuracy at layer 2 suggests a potential issue with the model's initial processing of information. The consistent outperformance of Q-Anchored methods indicates that anchoring the question is more effective than anchoring the answer for these tasks. The higher accuracy of Mistral-7B-v0.3 suggests that the updates made in this version have improved the model's performance. The fluctuations in accuracy across layers could be due to the complexity of the datasets and the model's ability to generalize to different types of questions. The relatively stable performance of A-Anchored methods suggests they are less sensitive to layer variations, but also less capable of achieving high accuracy. The data suggests that further investigation into the layer 2 dip and the effectiveness of different anchoring methods could lead to further improvements in model performance.
</details>
Figure 26: Comparisons of answer accuracy between pathways, probing attention activations of the last exact answer token.
<details>
<summary>x66.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering (QA) datasets across layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each chart displays multiple lines, each representing a different QA dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Brown solid line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Green solid line
* Q-Anchored (HotpotQA) - Gray dashed line
* A-Anchored (HotpotQA) - Light Blue dashed line
* Q-Anchored (NQ) - Pink dashed line
* A-Anchored (NQ) - Orange dashed line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 20, peaks around 95 at layer 2, then fluctuates between 60 and 90, ending around 70 at layer 15.
* **A-Anchored (PopQA):** Remains relatively stable around 40-50 throughout all layers.
* **Q-Anchored (TriviaQA):** Starts at approximately 20, rises to around 80 at layer 3, then fluctuates between 40 and 70, ending around 60 at layer 15.
* **A-Anchored (TriviaQA):** Starts at approximately 10, rises to around 50 at layer 3, then fluctuates between 30 and 50, ending around 40 at layer 15.
* **Q-Anchored (HotpotQA):** Starts at approximately 20, rises to around 60 at layer 3, then fluctuates between 30 and 60, ending around 40 at layer 15.
* **A-Anchored (HotpotQA):** Starts at approximately 10, rises to around 40 at layer 3, then fluctuates between 20 and 40, ending around 30 at layer 15.
* **Q-Anchored (NQ):** Starts at approximately 10, rises to around 50 at layer 3, then fluctuates between 20 and 50, ending around 30 at layer 15.
* **A-Anchored (NQ):** Remains relatively stable around 30-40 throughout all layers.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 20, peaks around 90 at layer 2, then fluctuates between 60 and 90, ending around 75 at layer 25.
* **A-Anchored (PopQA):** Remains relatively stable around 40-50 throughout all layers.
* **Q-Anchored (TriviaQA):** Starts at approximately 20, rises to around 80 at layer 3, then fluctuates between 40 and 70, ending around 65 at layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 10, rises to around 50 at layer 3, then fluctuates between 30 and 50, ending around 40 at layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 20, rises to around 60 at layer 3, then fluctuates between 30 and 60, ending around 50 at layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately 10, rises to around 40 at layer 3, then fluctuates between 20 and 40, ending around 30 at layer 25.
* **Q-Anchored (NQ):** Starts at approximately 10, rises to around 50 at layer 3, then fluctuates between 20 and 50, ending around 40 at layer 25.
* **A-Anchored (NQ):** Remains relatively stable around 30-40 throughout all layers.
### Key Observations
* **Q-Anchored datasets generally outperform A-Anchored datasets** across all QA datasets and both models.
* **PopQA consistently shows the highest accuracy** among all datasets, particularly when Q-Anchored.
* **The 3B model generally exhibits slightly higher accuracy** than the 1B model, especially in the later layers.
* **Accuracy tends to fluctuate** after the initial rise in the first few layers, suggesting that adding more layers doesn't always lead to consistent improvement.
* **A-Anchored (PopQA) and A-Anchored (NQ) remain relatively flat** across all layers, indicating limited improvement with increased model depth.
### Interpretation
The charts demonstrate the impact of model depth (layers) and anchoring method (Q-Anchored vs. A-Anchored) on answer accuracy for different QA datasets. The superior performance of Q-Anchored datasets suggests that anchoring questions is more effective than anchoring answers for these specific tasks. The higher accuracy of the 3B model indicates that increasing model size can lead to improved performance, although the gains are not always linear. The fluctuations in accuracy after the initial layers suggest that there may be diminishing returns to adding more layers, and that further optimization of the model architecture or training process may be necessary to achieve consistent improvements. The relatively flat performance of A-Anchored (PopQA) and A-Anchored (NQ) suggests that these datasets may be less sensitive to model depth or that the anchoring method is not well-suited for these tasks. Overall, the data highlights the importance of carefully considering model size, anchoring method, and dataset characteristics when developing and evaluating QA systems.
</details>
<details>
<summary>x67.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
This image presents two line charts comparing the answer accuracy of different question-answering (QA) datasets across layers of two Llama models: Llama-3-8B and Llama-3-70B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each line represents a specific QA dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3-8B
* **Right Chart Title:** Llama-3-70B
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding line colors:
* Q-Anchored (PopQA) - Blue
* A-Anchored (PopQA) - Orange
* Q-Anchored (TriviaQA) - Green
* A-Anchored (TriviaQA) - Purple
* Q-Anchored (HotpotQA) - Dashed Red
* A-Anchored (HotpotQA) - Dashed Brown
* Q-Anchored (NQ) - Gray
* A-Anchored (NQ) - Light Orange
### Detailed Analysis or Content Details
**Llama-3-8B Chart (Left):**
* **Q-Anchored (PopQA) - Blue:** Starts at approximately 0 accuracy at layer 0, rapidly increases to around 95-100 accuracy by layer 5, and then fluctuates between approximately 70-100 accuracy for the remaining layers.
* **A-Anchored (PopQA) - Orange:** Starts at approximately 0 accuracy at layer 0, gradually increases to around 40 accuracy by layer 5, and then remains relatively stable between approximately 20-40 accuracy for the remaining layers.
* **Q-Anchored (TriviaQA) - Green:** Starts at approximately 0 accuracy at layer 0, increases to around 90-100 accuracy by layer 5, and then fluctuates between approximately 60-100 accuracy for the remaining layers.
* **A-Anchored (TriviaQA) - Purple:** Starts at approximately 0 accuracy at layer 0, increases to around 60 accuracy by layer 5, and then fluctuates between approximately 40-70 accuracy for the remaining layers.
* **Q-Anchored (HotpotQA) - Dashed Red:** Starts at approximately 0 accuracy at layer 0, increases to around 80-90 accuracy by layer 5, and then fluctuates between approximately 50-90 accuracy for the remaining layers.
* **A-Anchored (HotpotQA) - Dashed Brown:** Starts at approximately 0 accuracy at layer 0, increases to around 30 accuracy by layer 5, and then remains relatively stable between approximately 20-40 accuracy for the remaining layers.
* **Q-Anchored (NQ) - Gray:** Starts at approximately 0 accuracy at layer 0, increases to around 80-90 accuracy by layer 5, and then fluctuates between approximately 50-90 accuracy for the remaining layers.
* **A-Anchored (NQ) - Light Orange:** Starts at approximately 0 accuracy at layer 0, gradually increases to around 40 accuracy by layer 5, and then remains relatively stable between approximately 20-40 accuracy for the remaining layers.
**Llama-3-70B Chart (Right):**
* **Q-Anchored (PopQA) - Blue:** Starts at approximately 0 accuracy at layer 0, rapidly increases to around 95-100 accuracy by layer 5, and then fluctuates between approximately 70-100 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **A-Anchored (PopQA) - Orange:** Starts at approximately 0 accuracy at layer 0, gradually increases to around 40 accuracy by layer 5, and then remains relatively stable between approximately 20-40 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **Q-Anchored (TriviaQA) - Green:** Starts at approximately 0 accuracy at layer 0, increases to around 90-100 accuracy by layer 5, and then fluctuates between approximately 60-100 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **A-Anchored (TriviaQA) - Purple:** Starts at approximately 0 accuracy at layer 0, increases to around 60 accuracy by layer 5, and then fluctuates between approximately 40-70 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **Q-Anchored (HotpotQA) - Dashed Red:** Starts at approximately 0 accuracy at layer 0, increases to around 80-90 accuracy by layer 5, and then fluctuates between approximately 50-90 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **A-Anchored (HotpotQA) - Dashed Brown:** Starts at approximately 0 accuracy at layer 0, increases to around 30 accuracy by layer 5, and then remains relatively stable between approximately 20-40 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **Q-Anchored (NQ) - Gray:** Starts at approximately 0 accuracy at layer 0, increases to around 80-90 accuracy by layer 5, and then fluctuates between approximately 50-90 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
* **A-Anchored (NQ) - Light Orange:** Starts at approximately 0 accuracy at layer 0, gradually increases to around 40 accuracy by layer 5, and then remains relatively stable between approximately 20-40 accuracy for the remaining layers. The pattern is similar to the 8B model, but extends to layer 80.
### Key Observations
* **Q-Anchored consistently outperforms A-Anchored** across all datasets and models.
* **PopQA, TriviaQA, HotpotQA, and NQ all show a similar initial rapid increase in accuracy** up to layer 5.
* **After layer 5, the accuracy fluctuates**, suggesting that adding more layers beyond a certain point does not necessarily improve performance and may even introduce instability.
* **The 70B model exhibits similar trends to the 8B model**, but extends to a higher layer count (80).
* **A-Anchored accuracy remains relatively low** compared to Q-Anchored, consistently below 40.
### Interpretation
The data suggests that question anchoring (Q-Anchored) is a more effective strategy for improving answer accuracy in Llama models compared to answer anchoring (A-Anchored). The initial rapid increase in accuracy across all datasets indicates that the early layers of the models are crucial for learning basic question-answering capabilities. The subsequent fluctuations in accuracy suggest that deeper layers may be more sensitive to the specific dataset and require careful tuning. The similarity in trends between the 8B and 70B models indicates that the underlying learning dynamics are consistent across different model sizes, although the 70B model can be trained for more layers. The consistently low accuracy of A-Anchored suggests that this approach may not be as effective for capturing the nuances of question-answering tasks. The data highlights the importance of layer selection and anchoring strategies in optimizing the performance of Llama models for question-answering applications.
</details>
<details>
<summary>x68.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Mistral Models
### Overview
This image presents two line charts, side-by-side, comparing the answer accuracy of the Mistral-7B-v0.1 and Mistral-7B-v0.3 models across different layers. The charts display accuracy as a function of layer number, with separate lines representing different question-answering datasets and anchoring methods. Each chart has a similar structure, with the x-axis representing the layer number and the y-axis representing answer accuracy.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 32).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Gray Dashed Line: Q-Anchored (HotpotQA)
* Gray Line: A-Anchored (HotpotQA)
* Light Blue Line: Q-Anchored (NQ)
* Light Orange Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 0% accuracy, rapidly increases to around 90-95% by layer 5, then fluctuates between 80-95% for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 0% accuracy, increases to around 50% by layer 5, then fluctuates between 20-50% for the remainder of the layers.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 0% accuracy, rapidly increases to around 90-95% by layer 5, then fluctuates between 80-95% for the remainder of the layers.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 0% accuracy, increases to around 60% by layer 5, then fluctuates between 30-60% for the remainder of the layers.
* **Q-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately 0% accuracy, increases to around 60% by layer 5, then fluctuates between 30-60% for the remainder of the layers.
* **A-Anchored (HotpotQA) - Gray Line:** Starts at approximately 0% accuracy, increases to around 40% by layer 5, then fluctuates between 20-40% for the remainder of the layers.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 0% accuracy, increases to around 80% by layer 5, then fluctuates between 60-80% for the remainder of the layers.
* **A-Anchored (NQ) - Light Orange Line:** Starts at approximately 0% accuracy, increases to around 40% by layer 5, then fluctuates between 20-40% for the remainder of the layers.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 0% accuracy, rapidly increases to around 95-100% by layer 5, then remains consistently high (85-100%) for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 0% accuracy, increases to around 50% by layer 5, then fluctuates between 30-50% for the remainder of the layers.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 0% accuracy, rapidly increases to around 95-100% by layer 5, then remains consistently high (85-100%) for the remainder of the layers.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 0% accuracy, increases to around 60% by layer 5, then fluctuates between 40-60% for the remainder of the layers.
* **Q-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately 0% accuracy, increases to around 60% by layer 5, then fluctuates between 40-60% for the remainder of the layers.
* **A-Anchored (HotpotQA) - Gray Line:** Starts at approximately 0% accuracy, increases to around 40% by layer 5, then fluctuates between 20-40% for the remainder of the layers.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 0% accuracy, increases to around 85% by layer 5, then fluctuates between 65-85% for the remainder of the layers.
* **A-Anchored (NQ) - Light Orange Line:** Starts at approximately 0% accuracy, increases to around 40% by layer 5, then fluctuates between 20-40% for the remainder of the layers.
### Key Observations
* Both models show a significant increase in accuracy within the first 5 layers.
* Q-Anchored methods consistently outperform A-Anchored methods across all datasets for both models.
* Mistral-7B-v0.3 generally achieves higher and more stable accuracy than Mistral-7B-v0.1, particularly for the PopQA and TriviaQA datasets.
* The HotpotQA and NQ datasets exhibit lower overall accuracy compared to PopQA and TriviaQA.
* The accuracy curves tend to stabilize after layer 10, with fluctuations around a certain level.
### Interpretation
The charts demonstrate the impact of model version (v0.1 vs. v0.3) and anchoring method (Q-Anchored vs. A-Anchored) on answer accuracy across different question-answering datasets. The rapid increase in accuracy within the initial layers suggests that the early layers of the model are crucial for learning basic question-answering capabilities. The consistently higher performance of Q-Anchored methods indicates that anchoring questions is more effective than anchoring answers for this task.
The improved performance of Mistral-7B-v0.3 suggests that the updates made between versions have resulted in a more robust and accurate model. The differences in accuracy across datasets highlight the challenges of generalizing question-answering models to diverse types of questions. The stabilization of accuracy after layer 10 suggests that further increasing the model depth may not yield significant improvements in performance.
The data suggests that the model's ability to understand and process questions is more important than its ability to process answers, as evidenced by the superior performance of Q-Anchored methods. The differences between the two models indicate that improvements in model architecture or training data can lead to substantial gains in answer accuracy.
</details>
Figure 27: Comparisons of answer accuracy between pathways, probing mlp activations of the final token.
<details>
<summary>x69.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) across layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays the accuracy of both Q-Anchored and A-Anchored approaches for each dataset. The charts use shaded areas to represent the variance around the mean accuracy.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 1 to 15 for the 1B model and 1 to 25 for the 3B model).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:** Located at the bottom of the image.
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Purple Line: Q-Anchored (TriviaQA)
* Light Blue Line: A-Anchored (TriviaQA)
* Red Dashed Line: Q-Anchored (HotpotQA)
* Brown Line: A-Anchored (HotpotQA)
* Green Line: Q-Anchored (NQ)
* Light Green Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left)**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 95% accuracy at layer 1, rapidly decreases to around 20% by layer 3, then fluctuates between 30% and 60% for the remaining layers.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 55% accuracy at layer 1, decreases to around 40% by layer 3, and remains relatively stable between 40% and 60% for the rest of the layers.
* **Q-Anchored (TriviaQA) - Purple Line:** Starts at approximately 60% accuracy at layer 1, increases to around 80% by layer 5, then fluctuates between 60% and 90% for the remaining layers.
* **A-Anchored (TriviaQA) - Light Blue Line:** Starts at approximately 45% accuracy at layer 1, increases to around 60% by layer 5, and remains relatively stable between 40% and 70% for the rest of the layers.
* **Q-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 40% accuracy at layer 1, fluctuates significantly between 20% and 70% for the rest of the layers.
* **A-Anchored (HotpotQA) - Brown Line:** Starts at approximately 30% accuracy at layer 1, fluctuates significantly between 20% and 50% for the rest of the layers.
* **Q-Anchored (NQ) - Green Line:** Starts at approximately 65% accuracy at layer 1, increases to around 90% by layer 5, then fluctuates between 70% and 95% for the remaining layers.
* **A-Anchored (NQ) - Light Green Line:** Starts at approximately 35% accuracy at layer 1, increases to around 50% by layer 5, and remains relatively stable between 30% and 60% for the rest of the layers.
**Llama-3.2-3B Chart (Right)**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 85% accuracy at layer 1, decreases to around 30% by layer 5, then fluctuates between 40% and 70% for the remaining layers.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 60% accuracy at layer 1, decreases to around 40% by layer 5, and remains relatively stable between 40% and 60% for the rest of the layers.
* **Q-Anchored (TriviaQA) - Purple Line:** Starts at approximately 80% accuracy at layer 1, increases to around 95% by layer 5, then fluctuates between 70% and 90% for the remaining layers.
* **A-Anchored (TriviaQA) - Light Blue Line:** Starts at approximately 50% accuracy at layer 1, increases to around 70% by layer 5, and remains relatively stable between 50% and 80% for the rest of the layers.
* **Q-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 50% accuracy at layer 1, fluctuates significantly between 30% and 80% for the rest of the layers.
* **A-Anchored (HotpotQA) - Brown Line:** Starts at approximately 40% accuracy at layer 1, fluctuates significantly between 20% and 60% for the rest of the layers.
* **Q-Anchored (NQ) - Green Line:** Starts at approximately 75% accuracy at layer 1, increases to around 90% by layer 5, then fluctuates between 70% and 95% for the remaining layers.
* **A-Anchored (NQ) - Light Green Line:** Starts at approximately 40% accuracy at layer 1, increases to around 60% by layer 5, and remains relatively stable between 40% and 70% for the rest of the layers.
### Key Observations
* The 3B model generally exhibits higher initial accuracy across all datasets compared to the 1B model.
* The Q-Anchored approach consistently outperforms the A-Anchored approach for most datasets, particularly for TriviaQA and NQ.
* HotpotQA shows the most significant fluctuations in accuracy for both models and both anchoring methods.
* PopQA shows a sharp initial drop in accuracy for both models, followed by stabilization.
* The shaded areas indicate a considerable variance in accuracy, suggesting that the performance is not consistently stable across different samples or runs.
### Interpretation
The charts demonstrate the impact of model size (1B vs. 3B) and anchoring method (Q vs. A) on answer accuracy across different question-answering datasets. The larger 3B model generally performs better, indicating that increased model capacity leads to improved performance. The Q-Anchored approach, which likely focuses on question-based reasoning, consistently yields higher accuracy than the A-Anchored approach, suggesting that question understanding is crucial for accurate answers. The high variability in accuracy, as indicated by the shaded areas, highlights the challenges in achieving consistent performance in question answering and the need for further research to improve model robustness. The erratic behavior of HotpotQA suggests that this dataset is particularly challenging for the models, potentially due to its complex reasoning requirements. The initial drop in PopQA accuracy could be attributed to the model needing to adapt to the specific characteristics of this dataset. Overall, the data suggests that while larger models and question-focused anchoring improve performance, there is still significant room for improvement in question-answering systems.
</details>
<details>
<summary>x70.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the answer accuracy of two Llama models (Llama-3-8B and Llama-3-70B) across different layers. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (ranging from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3-8B
* **Right Chart Title:** Llama-3-70B
* **Legend:**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Light Brown/Orange dashed line
* Q-Anchored (TriviaQA) - Purple line
* A-Anchored (TriviaQA) - Green line
* Q-Anchored (HotpotQA) - Gray dashed line
* A-Anchored (HotpotQA) - Yellow/Beige line
* Q-Anchored (NQ) - Teal line
* A-Anchored (NQ) - Light Brown/Orange line
### Detailed Analysis or Content Details
**Llama-3-8B Chart (Left):**
* **Q-Anchored (PopQA):** The blue line starts at approximately 5% accuracy at layer 0, rises sharply to around 90% by layer 5, fluctuates between 70% and 95% for layers 5-25, and then declines to around 75% by layer 30.
* **A-Anchored (PopQA):** The light brown dashed line starts at approximately 60% accuracy at layer 0, decreases steadily to around 30% by layer 10, and remains relatively stable around 30-40% for the remaining layers.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 10% accuracy at layer 0, rises rapidly to around 95% by layer 5, and fluctuates between 80% and 95% for layers 5-30.
* **A-Anchored (TriviaQA):** The green line starts at approximately 20% accuracy at layer 0, rises to around 60% by layer 5, and remains relatively stable around 60-70% for the remaining layers.
* **Q-Anchored (HotpotQA):** The gray dashed line starts at approximately 5% accuracy at layer 0, rises to around 85% by layer 5, and fluctuates between 70% and 90% for layers 5-30.
* **A-Anchored (HotpotQA):** The yellow line starts at approximately 30% accuracy at layer 0, decreases to around 20% by layer 5, and remains relatively stable around 20-30% for the remaining layers.
* **Q-Anchored (NQ):** The teal line starts at approximately 10% accuracy at layer 0, rises to around 90% by layer 5, and fluctuates between 70% and 95% for layers 5-30.
* **A-Anchored (NQ):** The light brown line starts at approximately 40% accuracy at layer 0, decreases to around 30% by layer 5, and remains relatively stable around 30-40% for the remaining layers.
**Llama-3-70B Chart (Right):**
* **Q-Anchored (PopQA):** The blue line starts at approximately 5% accuracy at layer 0, rises sharply to around 90% by layer 5, fluctuates between 70% and 95% for layers 5-60, and then declines to around 75% by layer 80.
* **A-Anchored (PopQA):** The light brown dashed line starts at approximately 60% accuracy at layer 0, decreases steadily to around 30% by layer 10, and remains relatively stable around 30-40% for the remaining layers.
* **Q-Anchored (TriviaQA):** The purple line starts at approximately 10% accuracy at layer 0, rises rapidly to around 95% by layer 5, and fluctuates between 80% and 95% for layers 5-80.
* **A-Anchored (TriviaQA):** The green line starts at approximately 20% accuracy at layer 0, rises to around 60% by layer 5, and remains relatively stable around 60-70% for the remaining layers.
* **Q-Anchored (HotpotQA):** The gray dashed line starts at approximately 5% accuracy at layer 0, rises to around 85% by layer 5, and fluctuates between 70% and 90% for layers 5-80.
* **A-Anchored (HotpotQA):** The yellow line starts at approximately 30% accuracy at layer 0, decreases to around 20% by layer 5, and remains relatively stable around 20-30% for the remaining layers.
* **Q-Anchored (NQ):** The teal line starts at approximately 10% accuracy at layer 0, rises to around 90% by layer 5, and fluctuates between 70% and 95% for layers 5-80.
* **A-Anchored (NQ):** The light brown line starts at approximately 40% accuracy at layer 0, decreases to around 30% by layer 5, and remains relatively stable around 30-40% for the remaining layers.
### Key Observations
* For both models, the "Q-Anchored" lines consistently exhibit higher accuracy than the "A-Anchored" lines across all datasets.
* The accuracy of the "Q-Anchored" lines generally peaks around layer 5 and remains relatively high for subsequent layers.
* The "A-Anchored" lines show a decreasing trend in accuracy after layer 0, stabilizing at a lower level.
* The Llama-3-70B model shows a more extended period of high accuracy compared to the Llama-3-8B model, as evidenced by the longer x-axis range.
* The datasets (PopQA, TriviaQA, HotpotQA, NQ) show similar accuracy trends for both anchoring methods within each model.
### Interpretation
The data suggests that question-anchored prompting consistently outperforms answer-anchored prompting for both Llama-3-8B and Llama-3-70B models across various question-answering datasets. This indicates that providing the question as context during the model's processing leads to more accurate answers. The initial rise in accuracy followed by stabilization or slight decline suggests that the models learn effectively up to a certain layer, after which further layers may not contribute significantly to performance or could even introduce noise. The larger model (Llama-3-70B) demonstrates a more sustained high accuracy, indicating its greater capacity to retain and utilize learned information across deeper layers. The consistent performance differences between datasets suggest that the difficulty and characteristics of each dataset influence the overall accuracy, but the relative performance of anchoring methods remains consistent. The A-Anchored lines consistently underperform, suggesting that the answer context is less useful for the model than the question context. This could be due to the way the models are trained or the nature of the question-answering task.
</details>
<details>
<summary>x71.png Details</summary>

### Visual Description
## Line Chart: Answer Accuracy vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the answer accuracy of the Mistral-7B-v0.1 and Mistral-7B-v0.3 models across different layers. The x-axis represents the layer number (from 0 to 30), and the y-axis represents the answer accuracy (from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30, with tick marks at integer values)
* **Y-axis:** Answer Accuracy (0 to 100, with tick marks at integer multiples of 20)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom-Left):**
* Blue Solid Line: Q-Anchored (PopQA)
* Orange Dashed Line: A-Anchored (PopQA)
* Green Solid Line: Q-Anchored (TriviaQA)
* Purple Solid Line: A-Anchored (TriviaQA)
* Brown Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Teal Solid Line: Q-Anchored (NQ)
* Grey Solid Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 5% accuracy at layer 0, rises to a peak of around 95% at layer 6, then fluctuates between 50% and 90% for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Dashed Line:** Starts at approximately 55% accuracy at layer 0, decreases to around 30% by layer 5, and remains relatively stable between 20% and 40% for the rest of the layers.
* **Q-Anchored (TriviaQA) - Green Solid Line:** Starts at approximately 0% accuracy at layer 0, rises rapidly to around 90% by layer 5, and fluctuates between 60% and 95% for the remaining layers.
* **A-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 20% accuracy at layer 0, rises to around 70% by layer 5, and fluctuates between 40% and 80% for the remaining layers.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to around 60% by layer 5, and fluctuates between 30% and 70% for the remaining layers.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 20% accuracy at layer 0, rises to around 40% by layer 5, and remains relatively stable between 20% and 50% for the rest of the layers.
* **Q-Anchored (NQ) - Teal Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80% by layer 5, and fluctuates between 50% and 90% for the remaining layers.
* **A-Anchored (NQ) - Grey Solid Line:** Starts at approximately 20% accuracy at layer 0, rises to around 50% by layer 5, and fluctuates between 30% and 60% for the remaining layers.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Solid Line:** Starts at approximately 5% accuracy at layer 0, rises to a peak of around 95% at layer 6, then fluctuates between 50% and 90% for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Dashed Line:** Starts at approximately 55% accuracy at layer 0, decreases to around 30% by layer 5, and remains relatively stable between 20% and 40% for the rest of the layers.
* **Q-Anchored (TriviaQA) - Green Solid Line:** Starts at approximately 0% accuracy at layer 0, rises rapidly to around 90% by layer 5, and fluctuates between 60% and 95% for the remaining layers.
* **A-Anchored (TriviaQA) - Purple Solid Line:** Starts at approximately 20% accuracy at layer 0, rises to around 70% by layer 5, and fluctuates between 40% and 80% for the remaining layers.
* **Q-Anchored (HotpotQA) - Brown Dashed Line:** Starts at approximately 0% accuracy at layer 0, rises to around 60% by layer 5, and fluctuates between 30% and 70% for the remaining layers.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 20% accuracy at layer 0, rises to around 40% by layer 5, and remains relatively stable between 20% and 50% for the rest of the layers.
* **Q-Anchored (NQ) - Teal Solid Line:** Starts at approximately 0% accuracy at layer 0, rises to around 80% by layer 5, and fluctuates between 50% and 90% for the remaining layers.
* **A-Anchored (NQ) - Grey Solid Line:** Starts at approximately 20% accuracy at layer 0, rises to around 50% by layer 5, and fluctuates between 30% and 60% for the remaining layers.
### Key Observations
* The Q-Anchored lines generally exhibit higher accuracy than the A-Anchored lines across all datasets and models.
* Accuracy tends to increase rapidly in the initial layers (0-5) for most datasets.
* After layer 5, the accuracy fluctuates significantly, suggesting instability or diminishing returns with increasing layers.
* The two charts (v0.1 and v0.3) are nearly identical, indicating that the model update did not significantly alter the accuracy trends across layers and datasets.
* PopQA and TriviaQA consistently show the highest accuracy, while HotpotQA and NQ show lower accuracy.
### Interpretation
The data suggests that the Mistral models perform better when questions are used for anchoring (Q-Anchored) compared to answers (A-Anchored). The initial layers seem to be crucial for learning, as accuracy increases rapidly in this phase. However, beyond a certain point (around layer 5), adding more layers does not consistently improve accuracy and can even lead to fluctuations. The differences in accuracy across datasets indicate that the models are more proficient at answering questions from PopQA and TriviaQA than from HotpotQA and NQ. The similarity between the v0.1 and v0.3 models suggests that the update focused on areas other than the core accuracy trends observed in this analysis. The fluctuating accuracy after layer 5 could be due to overfitting, vanishing gradients, or the inherent complexity of the datasets. Further investigation is needed to understand the reasons behind these fluctuations and to identify strategies for improving the models' performance in the later layers.
</details>
Figure 28: Comparisons of answer accuracy between pathways, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x72.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
This image presents two line charts comparing the answer accuracy of different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) across layers in two Llama models: Llama-3.2-1B and Llama-3.2-3B. The charts display accuracy as a function of layer number, with shaded areas representing confidence intervals.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 15 for the 1B model and 0 to 25 for the 3B model).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Llama-3.2-1B
* **Right Chart Title:** Llama-3.2-3B
* **Legend:** Located at the bottom of the image. The legend contains the following entries:
* Blue Solid Line: Q-Anchored (PopQA)
* Orange Solid Line: A-Anchored (PopQA)
* Green Solid Line: Q-Anchored (TriviaQA)
* Light Blue Dashed Line: Q-Anchored (HotpotQA)
* Brown Solid Line: A-Anchored (TriviaQA)
* Gray Solid Line: A-Anchored (HotpotQA)
* Purple Solid Line: Q-Anchored (NQ)
* Red Solid Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart (Left)**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 90% accuracy at layer 0, rises to a peak of around 98% between layers 6 and 10, then declines to approximately 85% at layer 15.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 30% accuracy at layer 0, rises to a peak of around 45% between layers 6 and 8, then declines to approximately 30% at layer 15.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 70% accuracy at layer 0, rises to a peak of around 95% between layers 6 and 10, then declines to approximately 80% at layer 15.
* **Q-Anchored (HotpotQA) - Light Blue Dashed Line:** Starts at approximately 40% accuracy at layer 0, rises to a peak of around 70% between layers 6 and 10, then declines to approximately 50% at layer 15.
* **A-Anchored (TriviaQA) - Brown Line:** Starts at approximately 30% accuracy at layer 0, rises to a peak of around 40% between layers 6 and 8, then declines to approximately 30% at layer 15.
* **A-Anchored (HotpotQA) - Gray Line:** Starts at approximately 20% accuracy at layer 0, rises to a peak of around 35% between layers 6 and 8, then declines to approximately 25% at layer 15.
* **Q-Anchored (NQ) - Purple Line:** Starts at approximately 80% accuracy at layer 0, rises to a peak of around 95% between layers 6 and 10, then declines to approximately 85% at layer 15.
* **A-Anchored (NQ) - Red Line:** Starts at approximately 30% accuracy at layer 0, rises to a peak of around 45% between layers 6 and 8, then declines to approximately 30% at layer 15.
**Llama-3.2-3B Chart (Right)**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 90% accuracy at layer 0, fluctuates between 80% and 95% with peaks around layers 5, 10, 15, and 20, then declines to approximately 80% at layer 25.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 30% accuracy at layer 0, rises to a peak of around 45% between layers 6 and 8, then fluctuates between 30% and 50% and declines to approximately 35% at layer 25.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 70% accuracy at layer 0, rises to a peak of around 95% between layers 6 and 10, then fluctuates between 70% and 90% and declines to approximately 75% at layer 25.
* **Q-Anchored (HotpotQA) - Light Blue Dashed Line:** Starts at approximately 40% accuracy at layer 0, rises to a peak of around 70% between layers 6 and 10, then fluctuates between 50% and 70% and declines to approximately 60% at layer 25.
* **A-Anchored (TriviaQA) - Brown Line:** Starts at approximately 30% accuracy at layer 0, rises to a peak of around 40% between layers 6 and 8, then fluctuates between 30% and 45% and declines to approximately 35% at layer 25.
* **A-Anchored (HotpotQA) - Gray Line:** Starts at approximately 20% accuracy at layer 0, rises to a peak of around 35% between layers 6 and 8, then fluctuates between 20% and 35% and declines to approximately 25% at layer 25.
* **Q-Anchored (NQ) - Purple Line:** Starts at approximately 80% accuracy at layer 0, rises to a peak of around 95% between layers 6 and 10, then fluctuates between 70% and 90% and declines to approximately 75% at layer 25.
* **A-Anchored (NQ) - Red Line:** Starts at approximately 30% accuracy at layer 0, rises to a peak of around 45% between layers 6 and 8, then fluctuates between 30% and 50% and declines to approximately 35% at layer 25.
### Key Observations
* The "Q-Anchored" lines consistently outperform the "A-Anchored" lines across all datasets and models.
* Accuracy generally increases with layer number up to a certain point (around layers 6-10), after which it plateaus or declines.
* The 3B model exhibits more fluctuation in accuracy across layers compared to the 1B model.
* PopQA and NQ datasets generally have higher accuracy scores than TriviaQA and HotpotQA.
### Interpretation
The charts demonstrate the impact of model depth (layers) on answer accuracy for different question-answering datasets. The superior performance of "Q-Anchored" approaches suggests that anchoring the model's attention to the question itself is more effective than anchoring it to the answer. The initial increase in accuracy with layer number indicates that deeper models can capture more complex relationships in the data. However, the subsequent plateau or decline suggests that adding more layers beyond a certain point may not lead to further improvements and could even introduce overfitting or other issues. The differences in accuracy across datasets likely reflect the inherent difficulty of each dataset. The 3B model's greater fluctuation could be attributed to its increased capacity to learn complex patterns, but also its greater susceptibility to noise or overfitting. Overall, the data suggests that model depth is an important factor in answer accuracy, but it must be carefully balanced with other considerations such as dataset complexity and the anchoring strategy employed.
</details>
<details>
<summary>x73.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Llama Models
### Overview
This image presents two line charts comparing the answer accuracy of different question-answering (QA) methods across layers in two Llama models: Llama-3-8B and Llama-3-70B. The x-axis represents the layer number, and the y-axis represents the answer accuracy, ranging from 0 to 100. Each chart displays multiple lines, each representing a different QA method and anchoring strategy.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **QA Methods/Anchoring Strategies (Legend):**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Orange line
* Q-Anchored (TriviaQA) - Purple line
* A-Anchored (TriviaQA) - Brown line
* Q-Anchored (HotpotQA) - Light Green dashed line
* A-Anchored (HotpotQA) - Yellow dashed line
* Q-Anchored (NQ) - Teal line
* A-Anchored (NQ) - Gray line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 0% accuracy at layer 0, rapidly increases to around 90-95% accuracy by layer 10, and remains relatively stable around 90-95% for the rest of the layers.
* **A-Anchored (PopQA):** Starts at approximately 0% accuracy at layer 0, increases to around 40-50% accuracy by layer 10, and plateaus around 40-50% for the remaining layers.
* **Q-Anchored (TriviaQA):** Starts at approximately 0% accuracy at layer 0, increases to around 80-90% accuracy by layer 10, and remains relatively stable around 80-90% for the rest of the layers.
* **A-Anchored (TriviaQA):** Starts at approximately 0% accuracy at layer 0, increases to around 50-60% accuracy by layer 10, and plateaus around 50-60% for the remaining layers.
* **Q-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 80-90% accuracy by layer 10, and remains relatively stable around 80-90% for the rest of the layers.
* **A-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 40-50% accuracy by layer 10, and plateaus around 40-50% for the remaining layers.
* **Q-Anchored (NQ):** Starts at approximately 0% accuracy at layer 0, increases to around 80-90% accuracy by layer 10, and remains relatively stable around 80-90% for the rest of the layers.
* **A-Anchored (NQ):** Starts at approximately 0% accuracy at layer 0, increases to around 40-50% accuracy by layer 10, and plateaus around 40-50% for the remaining layers.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 0% accuracy at layer 0, rapidly increases to around 90-95% accuracy by layer 10, and remains relatively stable around 90-95% for the rest of the layers.
* **A-Anchored (PopQA):** Starts at approximately 0% accuracy at layer 0, increases to around 40-50% accuracy by layer 10, and plateaus around 40-50% for the remaining layers.
* **Q-Anchored (TriviaQA):** Starts at approximately 0% accuracy at layer 0, increases to around 80-90% accuracy by layer 10, and remains relatively stable around 80-90% for the rest of the layers.
* **A-Anchored (TriviaQA):** Starts at approximately 0% accuracy at layer 0, increases to around 50-60% accuracy by layer 10, and plateaus around 50-60% for the remaining layers.
* **Q-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 80-90% accuracy by layer 10, and remains relatively stable around 80-90% for the rest of the layers.
* **A-Anchored (HotpotQA):** Starts at approximately 0% accuracy at layer 0, increases to around 40-50% accuracy by layer 10, and plateaus around 40-50% for the remaining layers.
* **Q-Anchored (NQ):** Starts at approximately 0% accuracy at layer 0, increases to around 80-90% accuracy by layer 10, and remains relatively stable around 80-90% for the rest of the layers.
* **A-Anchored (NQ):** Starts at approximately 0% accuracy at layer 0, increases to around 40-50% accuracy by layer 10, and plateaus around 40-50% for the remaining layers.
### Key Observations
* **Q-Anchored methods consistently outperform A-Anchored methods** across all QA datasets and both model sizes.
* **Accuracy generally increases rapidly in the initial layers (0-10)** and then plateaus.
* **The 70B model shows similar trends to the 8B model**, but the x-axis extends to layer 80, indicating a deeper model.
* **PopQA, TriviaQA, HotpotQA, and NQ datasets all exhibit similar accuracy curves** for the Q-Anchored methods.
* **A-Anchored methods consistently plateau at a lower accuracy level** (around 40-60%) compared to Q-Anchored methods (around 80-95%).
### Interpretation
The data suggests that question-anchored methods are significantly more effective than answer-anchored methods for improving answer accuracy in Llama models. This could be because anchoring on the question provides more relevant context for the model to generate accurate answers. The rapid increase in accuracy in the initial layers indicates that the early layers of the model are crucial for capturing fundamental linguistic and semantic information. The plateauing of accuracy after layer 10 suggests that further layers contribute less to overall performance, or that the model has reached its capacity for learning on these datasets. The similarity in trends across different QA datasets suggests that the observed patterns are not specific to any particular dataset but are rather a general characteristic of the model's behavior. The larger 70B model does not fundamentally change the observed trends, indicating that simply increasing model size does not necessarily lead to significant improvements in accuracy without changes to the anchoring strategy. The consistent performance gap between Q- and A-anchored methods highlights the importance of carefully considering the anchoring strategy when training and deploying large language models for question answering.
</details>
<details>
<summary>x74.png Details</summary>

### Visual Description
\n
## Line Chart: Answer Accuracy vs. Layer for Mistral Models
### Overview
This image presents two line charts, side-by-side, comparing the answer accuracy of the Mistral-7B-v0.1 and Mistral-7B-v0.3 models across different layers. The charts display accuracy as a function of layer number, with different lines representing different question-answering datasets and anchoring methods.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30).
* **Y-axis:** Answer Accuracy (ranging from 0 to 100).
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend:** Located at the bottom of the image, containing the following data series:
* Q-Anchored (PopQA) - Blue solid line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (PopQA) - Orange dashed line
* A-Anchored (TriviaQA) - Green dashed line
* Q-Anchored (HotpotQA) - Brown dashed-dotted line
* A-Anchored (HotpotQA) - Light Blue dashed-dotted line
* Q-Anchored (NQ) - Teal solid line
* A-Anchored (NQ) - Red dashed line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 90% accuracy, dips to around 30% at layer 2, then rises and plateaus around 85-95% from layer 8 onwards.
* **Q-Anchored (TriviaQA):** Starts at approximately 90% accuracy, dips to around 40% at layer 2, then rises and plateaus around 80-90% from layer 8 onwards.
* **A-Anchored (PopQA):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
* **A-Anchored (TriviaQA):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
* **Q-Anchored (HotpotQA):** Starts at approximately 90% accuracy, dips to around 30% at layer 2, then rises and plateaus around 80-90% from layer 8 onwards.
* **A-Anchored (HotpotQA):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
* **Q-Anchored (NQ):** Starts at approximately 90% accuracy, dips to around 30% at layer 2, then rises and plateaus around 85-95% from layer 8 onwards.
* **A-Anchored (NQ):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 95% accuracy, dips to around 35% at layer 2, then rises and plateaus around 90-100% from layer 8 onwards.
* **Q-Anchored (TriviaQA):** Starts at approximately 95% accuracy, dips to around 45% at layer 2, then rises and plateaus around 85-95% from layer 8 onwards.
* **A-Anchored (PopQA):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
* **A-Anchored (TriviaQA):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
* **Q-Anchored (HotpotQA):** Starts at approximately 95% accuracy, dips to around 35% at layer 2, then rises and plateaus around 85-95% from layer 8 onwards.
* **A-Anchored (HotpotQA):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
* **Q-Anchored (NQ):** Starts at approximately 95% accuracy, dips to around 35% at layer 2, then rises and plateaus around 90-100% from layer 8 onwards.
* **A-Anchored (NQ):** Starts at approximately 40% accuracy, remains relatively stable around 40-50% throughout all layers.
### Key Observations
* All "Q-Anchored" lines exhibit a similar initial drop in accuracy at the beginning layers (0-2), followed by a recovery and plateauing at higher accuracy levels.
* "A-Anchored" lines consistently show lower and more stable accuracy across all layers, remaining around 40-50%.
* Mistral-7B-v0.3 generally achieves higher accuracy than Mistral-7B-v0.1 across all datasets and anchoring methods.
* The accuracy difference between Q-Anchored and A-Anchored methods is significant, with Q-Anchored consistently outperforming A-Anchored.
### Interpretation
The data suggests that the Mistral models, particularly v0.3, demonstrate improved performance with increasing layers, after an initial dip. The "Q-Anchored" approach consistently yields significantly higher accuracy than the "A-Anchored" approach, indicating that anchoring questions is more effective than anchoring answers for these question-answering tasks. The consistent low accuracy of A-Anchored methods suggests that this approach may not be well-suited for these datasets or model architecture. The higher accuracy of Mistral-7B-v0.3 compared to v0.1 indicates that the model improvements in version 0.3 have a positive impact on answer accuracy. The initial dip in accuracy across all Q-Anchored lines could be attributed to the model adapting to the specific layers or learning initial representations. The plateauing of accuracy at higher layers suggests that the model has reached a point of diminishing returns in terms of layer depth. The consistent performance across datasets for each anchoring method suggests that the anchoring strategy is more influential than the specific dataset.
</details>
Figure 29: Comparisons of answer accuracy between pathways, probing mlp activations of the last exact answer token.
Appendix G I-Donât-Know Rate
<details>
<summary>x75.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the "I-Don't-Know Rate" across different layers of two Llama models: Llama-3.2-1B and Llama-3.2-3B. The charts display the rate for different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) and anchoring methods (Q-Anchored and A-Anchored). The x-axis represents the "Layer" number, and the y-axis represents the "I-Don't-Know Rate" in percentage.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to 15 for the Llama-3.2-1B chart and 0 to 25 for the Llama-3.2-3B chart.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100, representing percentage.
* **Title (Left Chart):** "Llama-3.2-1B"
* **Title (Right Chart):** "Llama-3.2-3B"
* **Legend:** Located at the bottom of each chart. The legend entries are:
* Q-Anchored (PopQA) - Solid Blue Line
* A-Anchored (PopQA) - Dashed Orange Line
* Q-Anchored (TriviaQA) - Solid Purple Line
* A-Anchored (TriviaQA) - Dashed Brown Line
* Q-Anchored (HotpotQA) - Dashed Gray Line
* A-Anchored (HotpotQA) - Solid Gray Line
* Q-Anchored (NQ) - Dashed Teal Line
* A-Anchored (NQ) - Solid Teal Line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 90%, rapidly decreases to a minimum of around 20% at layer 4, then fluctuates between 30% and 60% until layer 15.
* **A-Anchored (PopQA):** Starts at approximately 70%, decreases to around 40% at layer 3, then remains relatively stable between 40% and 60% until layer 15.
* **Q-Anchored (TriviaQA):** Starts at approximately 80%, decreases to a minimum of around 25% at layer 4, then increases to around 50% at layer 10, and fluctuates between 40% and 60% until layer 15.
* **A-Anchored (TriviaQA):** Starts at approximately 60%, decreases to around 35% at layer 3, then remains relatively stable between 40% and 60% until layer 15.
* **Q-Anchored (HotpotQA):** Starts at approximately 60%, decreases to around 30% at layer 3, then fluctuates between 30% and 50% until layer 15.
* **A-Anchored (HotpotQA):** Starts at approximately 50%, decreases to around 30% at layer 3, then remains relatively stable between 30% and 50% until layer 15.
* **Q-Anchored (NQ):** Starts at approximately 70%, decreases to around 30% at layer 3, then fluctuates between 30% and 50% until layer 15.
* **A-Anchored (NQ):** Starts at approximately 50%, decreases to around 30% at layer 3, then remains relatively stable between 30% and 50% until layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 90%, decreases to a minimum of around 20% at layer 4, then fluctuates between 30% and 60% until layer 25.
* **A-Anchored (PopQA):** Starts at approximately 70%, decreases to around 40% at layer 3, then remains relatively stable between 40% and 60% until layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately 80%, decreases to a minimum of around 25% at layer 4, then increases to around 50% at layer 10, and fluctuates between 40% and 60% until layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 60%, decreases to around 35% at layer 3, then remains relatively stable between 40% and 60% until layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 60%, decreases to around 30% at layer 3, then fluctuates between 30% and 50% until layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately 50%, decreases to around 30% at layer 3, then remains relatively stable between 30% and 50% until layer 25.
* **Q-Anchored (NQ):** Starts at approximately 70%, decreases to around 30% at layer 3, then fluctuates between 30% and 50% until layer 25.
* **A-Anchored (NQ):** Starts at approximately 50%, decreases to around 30% at layer 3, then remains relatively stable between 30% and 50% until layer 25.
### Key Observations
* All lines in both charts exhibit a significant initial drop in "I-Don't-Know Rate" within the first few layers (0-4).
* The "I-Don't-Know Rate" generally stabilizes after layer 5 for the Llama-3.2-1B model and after layer 10 for the Llama-3.2-3B model.
* Q-Anchored methods consistently show higher "I-Don't-Know Rates" compared to A-Anchored methods across all datasets.
* PopQA and TriviaQA datasets generally have higher "I-Don't-Know Rates" than HotpotQA and NQ.
* The Llama-3.2-3B model shows a more prolonged period of fluctuation in the "I-Don't-Know Rate" compared to the Llama-3.2-1B model.
### Interpretation
The charts demonstrate how the "I-Don't-Know Rate" changes as information propagates through the layers of the Llama models. The initial sharp decrease suggests that the early layers are crucial for acquiring basic knowledge and reducing uncertainty. The subsequent stabilization indicates that the models reach a point where further layers contribute less to reducing the rate of not knowing.
The difference between Q-Anchored and A-Anchored methods suggests that the way questions are anchored (whether based on the question itself or the answer) impacts the model's confidence. Q-Anchoring appears to lead to more uncertainty, potentially because it requires the model to reason more deeply about the question's intent.
The varying rates across datasets likely reflect the inherent difficulty of the questions within each dataset. PopQA and TriviaQA, being more general knowledge-based, may present more ambiguous or challenging questions compared to HotpotQA and NQ, which are more focused and structured.
The longer fluctuation period in the Llama-3.2-3B model could indicate that the larger model has a greater capacity for learning and adapting, but also that it requires more layers to fully converge on a stable state. This suggests a more complex internal representation of knowledge.
</details>
<details>
<summary>x76.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the "I-Don't-Know Rate" against the "Layer" number for two different Llama models: Llama-3-8B and Llama-3-70B. Each chart features multiple lines representing different data series, distinguished by color and annotation. The charts aim to visualize how the rate of uncertainty (indicated by "I-Don't-Know") changes across the layers of these language models.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to approximately 30 for Llama-3-8B and 0 to approximately 80 for Llama-3-70B.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100.
* **Title (Left Chart):** "Llama-3-8B"
* **Title (Right Chart):** "Llama-3-70B"
* **Legend (Bottom):** Contains the following data series labels:
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Orange dotted line
* Q-Anchored (TriviaQA) - Green line
* A-Anchored (TriviaQA) - Red dotted line
* Q-Anchored (HotpotQA) - Purple dashed line
* A-Anchored (HotpotQA) - Brown dotted line
* Q-Anchored (NQ) - Teal line
* A-Anchored (NQ) - Pink dotted line
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 20, dips to around 10 at layer 5, then fluctuates between 20 and 40, ending around 30.
* **A-Anchored (PopQA):** Starts at approximately 70, decreases to around 50 by layer 5, then stabilizes between 50 and 70, ending around 60.
* **Q-Anchored (TriviaQA):** Starts at approximately 30, dips to around 15 at layer 5, then fluctuates between 20 and 40, ending around 35.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 60 by layer 5, then stabilizes between 60 and 80, ending around 70.
* **Q-Anchored (HotpotQA):** Starts at approximately 60, dips to around 30 at layer 5, then fluctuates between 30 and 60, ending around 50.
* **A-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 60 by layer 5, then stabilizes between 60 and 80, ending around 70.
* **Q-Anchored (NQ):** Starts at approximately 10, remains relatively low, fluctuating between 10 and 30, ending around 20.
* **A-Anchored (NQ):** Starts at approximately 70, decreases to around 50 by layer 5, then stabilizes between 50 and 70, ending around 60.
**Llama-3-70B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 20, fluctuates significantly between 20 and 60, ending around 40.
* **A-Anchored (PopQA):** Starts at approximately 70, fluctuates significantly between 60 and 80, ending around 70.
* **Q-Anchored (TriviaQA):** Starts at approximately 30, fluctuates significantly between 20 and 60, ending around 40.
* **A-Anchored (TriviaQA):** Starts at approximately 80, fluctuates significantly between 70 and 90, ending around 80.
* **Q-Anchored (HotpotQA):** Starts at approximately 60, fluctuates significantly between 40 and 80, ending around 60.
* **A-Anchored (HotpotQA):** Starts at approximately 80, fluctuates significantly between 70 and 90, ending around 80.
* **Q-Anchored (NQ):** Starts at approximately 10, fluctuates significantly between 10 and 40, ending around 20.
* **A-Anchored (NQ):** Starts at approximately 70, fluctuates significantly between 60 and 80, ending around 70.
### Key Observations
* In both charts, the "A-Anchored" series consistently exhibit higher "I-Don't-Know Rates" than the corresponding "Q-Anchored" series.
* The "I-Don't-Know Rate" generally decreases in the initial layers (0-5) for most series, then tends to stabilize or fluctuate.
* The Llama-3-70B model shows more pronounced fluctuations in the "I-Don't-Know Rate" across layers compared to the Llama-3-8B model.
* The NQ dataset consistently shows the lowest "I-Don't-Know Rate" among the "Q-Anchored" series.
### Interpretation
The charts illustrate the uncertainty levels of the Llama-3 models across different layers, assessed using various question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ). The higher "I-Don't-Know Rate" for "A-Anchored" series suggests that the models are more uncertain when answering questions based on the answer itself (A-Anchored) compared to questions based on the question itself (Q-Anchored). This could indicate that the models struggle more with reasoning about the answer context.
The decrease in "I-Don't-Know Rate" in the initial layers suggests that the models learn basic patterns and knowledge in the early layers. The subsequent stabilization or fluctuation indicates that further layers refine this knowledge but don't necessarily lead to a significant reduction in uncertainty.
The greater fluctuations in the Llama-3-70B model might be due to its larger size and increased complexity, allowing it to capture more nuanced patterns but also making it more susceptible to variations in the input data. The consistently low "I-Don't-Know Rate" for the NQ dataset suggests that the model performs relatively well on this particular type of question-answering task.
These findings provide insights into the strengths and weaknesses of the Llama-3 models and can inform future research on improving their reasoning capabilities and reducing their uncertainty levels.
</details>
<details>
<summary>x77.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the "I-Don't-Know Rate" across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. The x-axis represents the "Layer" (ranging from 0 to approximately 32), and the y-axis represents the "I-Don't-Know Rate" (ranging from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 32, approximately).
* **Y-axis:** I-Don't-Know Rate (0 to 100).
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Orange dotted line
* Q-Anchored (TriviaQA) - Purple line
* A-Anchored (TriviaQA) - Red dotted line
* Q-Anchored (HotpotQA) - Gray dashed line
* A-Anchored (HotpotQA) - Brown dashed line
* Q-Anchored (NQ) - Light Blue line
* A-Anchored (NQ) - Green line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, dips to around 20 at layer 8, then fluctuates between 40 and 70. Ends at approximately 60.
* **A-Anchored (PopQA):** Starts at approximately 70, dips to around 30 at layer 8, then fluctuates between 50 and 80. Ends at approximately 70.
* **Q-Anchored (TriviaQA):** Starts at approximately 80, dips to around 30 at layer 8, then fluctuates between 40 and 70. Ends at approximately 60.
* **A-Anchored (TriviaQA):** Starts at approximately 80, dips to around 30 at layer 8, then fluctuates between 50 and 80. Ends at approximately 70.
* **Q-Anchored (HotpotQA):** Starts at approximately 90, dips to around 40 at layer 8, then fluctuates between 60 and 90. Ends at approximately 80.
* **A-Anchored (HotpotQA):** Starts at approximately 80, dips to around 40 at layer 8, then fluctuates between 60 and 80. Ends at approximately 70.
* **Q-Anchored (NQ):** Starts at approximately 60, dips to around 10 at layer 8, then fluctuates between 20 and 40. Ends at approximately 30.
* **A-Anchored (NQ):** Starts at approximately 60, dips to around 10 at layer 8, then fluctuates between 20 and 40. Ends at approximately 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, dips to around 20 at layer 8, then fluctuates between 40 and 70. Ends at approximately 60.
* **A-Anchored (PopQA):** Starts at approximately 70, dips to around 30 at layer 8, then fluctuates between 50 and 80. Ends at approximately 70.
* **Q-Anchored (TriviaQA):** Starts at approximately 80, dips to around 30 at layer 8, then fluctuates between 40 and 70. Ends at approximately 60.
* **A-Anchored (TriviaQA):** Starts at approximately 80, dips to around 30 at layer 8, then fluctuates between 50 and 80. Ends at approximately 70.
* **Q-Anchored (HotpotQA):** Starts at approximately 90, dips to around 40 at layer 8, then fluctuates between 60 and 90. Ends at approximately 80.
* **A-Anchored (HotpotQA):** Starts at approximately 80, dips to around 40 at layer 8, then fluctuates between 60 and 80. Ends at approximately 70.
* **Q-Anchored (NQ):** Starts at approximately 60, dips to around 10 at layer 8, then fluctuates between 20 and 40. Ends at approximately 30.
* **A-Anchored (NQ):** Starts at approximately 60, dips to around 10 at layer 8, then fluctuates between 20 and 40. Ends at approximately 30.
### Key Observations
* All lines exhibit a significant dip in "I-Don't-Know Rate" around layer 8.
* The "I-Don't-Know Rate" generally stabilizes after layer 16 for most datasets.
* The HotpotQA dataset consistently shows a higher "I-Don't-Know Rate" compared to other datasets.
* The NQ dataset consistently shows a lower "I-Don't-Know Rate" compared to other datasets.
* The two charts (v0.1 and v0.3) are remarkably similar in their overall trends and values.
### Interpretation
The charts demonstrate how the "I-Don't-Know Rate" changes across the layers of the Mistral language models when tested on different question-answering datasets. The initial high rate suggests the model is uncertain at the beginning of processing. The dip around layer 8 indicates a point where the model begins to gain confidence or extract relevant information. The subsequent stabilization suggests the model has reached a point of diminishing returns in terms of information processing.
The differences in "I-Don't-Know Rate" between datasets likely reflect the complexity and ambiguity of the questions within each dataset. HotpotQA, being a more complex reasoning task, results in a higher rate, while NQ, potentially being more straightforward, results in a lower rate.
The similarity between the v0.1 and v0.3 charts suggests that the core architecture and learning process of the model remained consistent between these versions, and the improvements in v0.3 may not be directly reflected in the "I-Don't-Know Rate" across layers. Further investigation would be needed to determine if the improvements in v0.3 are related to other performance metrics. The anchoring method (Q vs A) does not appear to have a significant impact on the overall trend.
</details>
Figure 30: Comparisons of i-donât-know rate between pathways, probing attention activations of the final token.
<details>
<summary>x78.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, depicting the "I-Don't-Know Rate" as a function of "Layer" for two different Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The charts are visually similar, showing a general decreasing trend in I-Don't-Know Rate with increasing layer number, but with significant fluctuations.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to 15 for the Llama-3.2-1B chart and 0 to 25 for the Llama-3.2-3B chart. The scale is linear.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100. The scale is linear.
* **Title (Left Chart):** "Llama-3.2-1B"
* **Title (Right Chart):** "Llama-3.2-3B"
* **Legend:** Located at the bottom of the image, spanning both charts. The legend identifies the different lines based on anchoring method and dataset:
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Red Line: A-Anchored (TriviaQA)
* Gray Line: Q-Anchored (HotpotQA)
* Purple Line: A-Anchored (HotpotQA)
* Light Blue Line: Q-Anchored (NQ)
* Brown Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA) (Blue):** Starts at approximately 90, rapidly decreases to around 20-30 by layer 5, then fluctuates between 20 and 40 until layer 15.
* **A-Anchored (PopQA) (Orange):** Starts at approximately 70, decreases to around 40-50 by layer 5, and remains relatively stable between 40 and 60 until layer 15.
* **Q-Anchored (TriviaQA) (Green):** Starts at approximately 70, decreases to around 20-30 by layer 5, then fluctuates between 20 and 40 until layer 15.
* **A-Anchored (TriviaQA) (Red):** Starts at approximately 80, decreases to around 50-60 by layer 5, and remains relatively stable between 50 and 70 until layer 15.
* **Q-Anchored (HotpotQA) (Gray):** Starts at approximately 60, decreases to around 30-40 by layer 5, and remains relatively stable between 30 and 50 until layer 15.
* **A-Anchored (HotpotQA) (Purple):** Starts at approximately 60, decreases to around 40-50 by layer 5, and remains relatively stable between 40 and 60 until layer 15.
* **Q-Anchored (NQ) (Light Blue):** Starts at approximately 60, decreases to around 20-30 by layer 5, then fluctuates between 20 and 40 until layer 15.
* **A-Anchored (NQ) (Brown):** Starts at approximately 60, decreases to around 40-50 by layer 5, and remains relatively stable between 40 and 60 until layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA) (Blue):** Starts at approximately 90, rapidly decreases to around 10-20 by layer 5, then fluctuates between 10 and 30 until layer 25.
* **A-Anchored (PopQA) (Orange):** Starts at approximately 70, decreases to around 40-50 by layer 5, and remains relatively stable between 40 and 60 until layer 25.
* **Q-Anchored (TriviaQA) (Green):** Starts at approximately 70, decreases to around 20-30 by layer 5, then fluctuates between 20 and 40 until layer 25.
* **A-Anchored (TriviaQA) (Red):** Starts at approximately 80, decreases to around 50-60 by layer 5, and remains relatively stable between 50 and 70 until layer 25.
* **Q-Anchored (HotpotQA) (Gray):** Starts at approximately 60, decreases to around 30-40 by layer 5, and remains relatively stable between 30 and 50 until layer 25.
* **A-Anchored (HotpotQA) (Purple):** Starts at approximately 60, decreases to around 40-50 by layer 5, and remains relatively stable between 40 and 60 until layer 25.
* **Q-Anchored (NQ) (Light Blue):** Starts at approximately 60, decreases to around 20-30 by layer 5, then fluctuates between 20 and 40 until layer 25.
* **A-Anchored (NQ) (Brown):** Starts at approximately 60, decreases to around 40-50 by layer 5, and remains relatively stable between 40 and 60 until layer 25.
### Key Observations
* Both charts show a significant decrease in I-Don't-Know Rate in the initial layers (0-5).
* The Q-Anchored lines generally exhibit lower I-Don't-Know Rates than the A-Anchored lines, especially in the Llama-3.2-3B chart.
* The Llama-3.2-3B model consistently demonstrates lower I-Don't-Know Rates across all datasets and anchoring methods compared to the Llama-3.2-1B model.
* The I-Don't-Know Rate stabilizes after layer 5 for most datasets and anchoring methods, indicating diminishing returns in reducing uncertainty with increasing layers.
* PopQA consistently shows lower I-Don't-Know rates than TriviaQA, HotpotQA, and NQ.
### Interpretation
The data suggests that increasing the number of layers in the Llama models initially reduces the model's uncertainty (as measured by the I-Don't-Know Rate). However, this improvement plateaus after a certain number of layers. The difference in performance between the Llama-3.2-1B and Llama-3.2-3B models indicates that increasing model size (number of parameters) leads to a more confident and knowledgeable model.
The consistent lower I-Don't-Know Rates for Q-Anchored methods suggest that anchoring questions is more effective than anchoring answers in reducing uncertainty. The performance differences across datasets (PopQA, TriviaQA, HotpotQA, NQ) likely reflect the inherent difficulty and complexity of each dataset. PopQA appears to be the easiest dataset, while NQ is the most challenging.
The fluctuations in the I-Don't-Know Rate after layer 5 could be due to overfitting, noise in the data, or the model's ability to identify genuinely unanswerable questions. Further investigation would be needed to determine the underlying cause of these fluctuations.
</details>
<details>
<summary>x79.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts comparing the "I-Don't-Know Rate" across different layers of two Llama models: Llama-3-8B and Llama-3-70B. The charts display the rate for various question-answering (QA) datasets and anchoring methods. The x-axis represents the "Layer" number, and the y-axis represents the "I-Don't-Know Rate" ranging from 0 to 100.
### Components/Axes
* **X-axis:** Layer (ranging from approximately 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B).
* **Y-axis:** I-Don't-Know Rate (ranging from 0 to 100).
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Red dotted line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Orange dotted line
* Q-Anchored (HotpotQA) - Green solid line
* A-Anchored (HotpotQA) - Brown dotted line
* Q-Anchored (NQ) - Teal solid line
* A-Anchored (NQ) - Pink dotted line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 95, drops to around 20 by layer 5, fluctuates between 20 and 60 for layers 5-25, and then increases to around 40 by layer 30.
* **A-Anchored (PopQA):** Starts at approximately 80, decreases to around 50 by layer 5, fluctuates between 50 and 80 for layers 5-25, and then increases to around 70 by layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 85, drops to around 20 by layer 5, fluctuates between 20 and 50 for layers 5-25, and then increases to around 50 by layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 75, decreases to around 40 by layer 5, fluctuates between 40 and 70 for layers 5-25, and then increases to around 60 by layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 90, drops to around 10 by layer 5, fluctuates between 10 and 30 for layers 5-25, and then increases to around 30 by layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 30 by layer 5, fluctuates between 30 and 50 for layers 5-25, and then increases to around 50 by layer 30.
* **Q-Anchored (NQ):** Starts at approximately 95, drops to around 20 by layer 5, fluctuates between 20 and 40 for layers 5-25, and then increases to around 40 by layer 30.
* **A-Anchored (NQ):** Starts at approximately 85, decreases to around 30 by layer 5, fluctuates between 30 and 50 for layers 5-25, and then increases to around 50 by layer 30.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 95, drops to around 20 by layer 10, fluctuates between 20 and 60 for layers 10-60, and then increases to around 60 by layer 80.
* **A-Anchored (PopQA):** Starts at approximately 80, decreases to around 50 by layer 10, fluctuates between 50 and 80 for layers 10-60, and then increases to around 70 by layer 80.
* **Q-Anchored (TriviaQA):** Starts at approximately 85, drops to around 20 by layer 10, fluctuates between 20 and 50 for layers 10-60, and then increases to around 50 by layer 80.
* **A-Anchored (TriviaQA):** Starts at approximately 75, decreases to around 40 by layer 10, fluctuates between 40 and 70 for layers 10-60, and then increases to around 60 by layer 80.
* **Q-Anchored (HotpotQA):** Starts at approximately 90, drops to around 10 by layer 10, fluctuates between 10 and 30 for layers 10-60, and then increases to around 30 by layer 80.
* **A-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 30 by layer 10, fluctuates between 30 and 50 for layers 10-60, and then increases to around 50 by layer 80.
* **Q-Anchored (NQ):** Starts at approximately 95, drops to around 20 by layer 10, fluctuates between 20 and 40 for layers 10-60, and then increases to around 40 by layer 80.
* **A-Anchored (NQ):** Starts at approximately 85, decreases to around 30 by layer 10, fluctuates between 30 and 50 for layers 10-60, and then increases to around 50 by layer 80.
### Key Observations
* All lines generally exhibit a decreasing trend in "I-Don't-Know Rate" in the initial layers (0-5/10).
* After the initial decrease, the rates tend to fluctuate within a range, with some lines showing more variability than others.
* Towards the later layers, there's a general increasing trend in "I-Don't-Know Rate" for all lines.
* The "Q-Anchored" lines consistently show higher "I-Don't-Know Rates" compared to the "A-Anchored" lines for the same dataset.
* The HotpotQA dataset consistently shows the lowest "I-Don't-Know Rates" across both models.
### Interpretation
The charts demonstrate how the model's confidence (or lack thereof) changes as information propagates through its layers. The initial drop in "I-Don't-Know Rate" suggests that the early layers are effective at processing basic information. The subsequent fluctuations indicate that the model encounters more complex or ambiguous questions as it progresses through the layers. The final increase in "I-Don't-Know Rate" could be due to the model reaching its limitations in understanding or generating responses.
The difference between "Q-Anchored" and "A-Anchored" suggests that the method of anchoring (question vs. answer) impacts the model's confidence. "Q-Anchored" consistently shows higher uncertainty, potentially indicating that the model finds it more challenging to reason from the question itself.
The consistently lower "I-Don't-Know Rate" for HotpotQA suggests that this dataset is easier for the model to handle, possibly due to its structure or the types of questions it contains.
The larger number of layers in the Llama-3-70B model allows for a more extended period of fluctuation before the final increase in "I-Don't-Know Rate," suggesting that the larger model can process more complex information before reaching its limits. The trends are qualitatively similar between the two models, indicating that the underlying behavior is consistent despite the difference in size.
</details>
<details>
<summary>x80.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Mistral Models
### Overview
This image presents two line charts, side-by-side, comparing the "I-Don't-Know Rate" across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. The x-axis represents the layer number (ranging from 0 to approximately 30), and the y-axis represents the I-Don't-Know Rate, scaled from 0 to 100. Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30, with gridlines at integer values)
* **Y-axis:** I-Don't-Know Rate (0 to 100, with gridlines at 20-unit intervals)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom Center):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Red dashed line
* Q-Anchored (HotpotQA) - Brown dashed-dotted line
* A-Anchored (HotpotQA) - Green solid line
* Q-Anchored (NQ) - Cyan solid line
* A-Anchored (NQ) - Magenta dashed line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) (Blue):** Starts at approximately 80, dips to around 20 at layer 5, then fluctuates between 40 and 80, ending around 60.
* **A-Anchored (PopQA) (Orange):** Starts at approximately 60, dips to around 40 at layer 5, then fluctuates between 40 and 70, ending around 60.
* **Q-Anchored (TriviaQA) (Purple):** Starts at approximately 60, dips to around 20 at layer 5, then fluctuates between 30 and 60, ending around 50.
* **A-Anchored (TriviaQA) (Red):** Starts at approximately 70, dips to around 30 at layer 5, then fluctuates between 40 and 70, ending around 60.
* **Q-Anchored (HotpotQA) (Brown):** Starts at approximately 70, dips to around 30 at layer 5, then fluctuates between 40 and 80, ending around 70.
* **A-Anchored (HotpotQA) (Green):** Starts at approximately 50, dips to around 10 at layer 5, then fluctuates between 20 and 50, ending around 40.
* **Q-Anchored (NQ) (Cyan):** Starts at approximately 40, dips to around 10 at layer 5, then fluctuates between 20 and 40, ending around 30.
* **A-Anchored (NQ) (Magenta):** Starts at approximately 50, dips to around 20 at layer 5, then fluctuates between 30 and 60, ending around 50.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) (Blue):** Starts at approximately 80, dips to around 20 at layer 5, then fluctuates between 40 and 80, ending around 60.
* **A-Anchored (PopQA) (Orange):** Starts at approximately 60, dips to around 40 at layer 5, then fluctuates between 40 and 70, ending around 60.
* **Q-Anchored (TriviaQA) (Purple):** Starts at approximately 60, dips to around 20 at layer 5, then fluctuates between 30 and 60, ending around 50.
* **A-Anchored (TriviaQA) (Red):** Starts at approximately 70, dips to around 30 at layer 5, then fluctuates between 40 and 70, ending around 60.
* **Q-Anchored (HotpotQA) (Brown):** Starts at approximately 70, dips to around 30 at layer 5, then fluctuates between 40 and 80, ending around 70.
* **A-Anchored (HotpotQA) (Green):** Starts at approximately 50, dips to around 10 at layer 5, then fluctuates between 20 and 50, ending around 40.
* **Q-Anchored (NQ) (Cyan):** Starts at approximately 40, dips to around 10 at layer 5, then fluctuates between 20 and 40, ending around 30.
* **A-Anchored (NQ) (Magenta):** Starts at approximately 50, dips to around 20 at layer 5, then fluctuates between 30 and 60, ending around 50.
### Key Observations
* All lines exhibit a significant dip in I-Don't-Know Rate around layer 5, suggesting improved knowledge or confidence in the model at that layer.
* The I-Don't-Know Rate generally fluctuates throughout the layers, indicating varying performance across different layers.
* The Q-Anchored lines (solid) tend to have higher I-Don't-Know Rates than the A-Anchored lines (dashed) for most datasets.
* The two charts (v0.1 and v0.3) show very similar trends and values, suggesting that the overall behavior of the model regarding I-Don't-Know Rate hasn't changed significantly between the two versions.
* HotpotQA consistently shows higher I-Don't-Know rates than other datasets.
### Interpretation
The charts demonstrate the model's uncertainty (as measured by the I-Don't-Know Rate) across different layers. The initial high rates likely reflect the model's limited knowledge at the beginning of processing. The dip around layer 5 could indicate a point where the model starts to effectively utilize its learned parameters. The subsequent fluctuations suggest that the model's confidence varies depending on the complexity of the information being processed at each layer.
The difference between Q-Anchored and A-Anchored lines suggests that the method of anchoring (question vs. answer) influences the model's confidence. Q-Anchoring might lead to more uncertainty because the model is directly processing the question, while A-Anchoring might benefit from the context provided by the answer.
The similarity between the two model versions (v0.1 and v0.3) indicates that the improvements in v0.3 likely focus on aspects other than reducing the I-Don't-Know Rate. The consistently high I-Don't-Know rate for HotpotQA suggests that this dataset presents a particular challenge for the model, potentially due to its complexity or the need for multi-hop reasoning.
</details>
Figure 31: Comparisons of i-donât-know rate between pathways, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x81.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, depicting the "I-Don't-Know Rate" as a function of "Layer" for two different Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The charts aim to illustrate how the rate of the model responding with "I-Don't-Know" changes as the model's layers increase.
### Components/Axes
* **X-axis:** "Layer" - Ranges from approximately 0 to 15 for the Llama-3.2-1B chart and from 0 to 25 for the Llama-3.2-3B chart.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100.
* **Legend:** Located at the bottom of the image, identifies the different lines. The legend includes:
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Orange dotted line
* Q-Anchored (TriviaQA) - Green line
* A-Anchored (TriviaQA) - Red dotted line
* Q-Anchored (HotpotQA) - Gray dashed line
* A-Anchored (HotpotQA) - Brown dashed line
* Q-Anchored (NQ) - Light Blue line
* A-Anchored (NQ) - Pink line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 75, decreases sharply to around 10 at layer 5, then fluctuates between 20 and 50 until layer 15.
* **A-Anchored (PopQA):** Starts at approximately 70, decreases to around 40 at layer 5, then remains relatively stable between 40 and 60 until layer 15.
* **Q-Anchored (TriviaQA):** Starts at approximately 60, decreases to around 15 at layer 5, then fluctuates between 20 and 40 until layer 15.
* **A-Anchored (TriviaQA):** Starts at approximately 65, decreases to around 45 at layer 5, then remains relatively stable between 40 and 60 until layer 15.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 20 at layer 5, then fluctuates between 30 and 50 until layer 15.
* **A-Anchored (HotpotQA):** Starts at approximately 75, decreases to around 40 at layer 5, then remains relatively stable between 40 and 60 until layer 15.
* **Q-Anchored (NQ):** Starts at approximately 70, decreases to around 10 at layer 5, then fluctuates between 20 and 50 until layer 15.
* **A-Anchored (NQ):** Starts at approximately 65, decreases to around 30 at layer 5, then remains relatively stable between 40 and 60 until layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 80, decreases sharply to around 10 at layer 5, then fluctuates between 20 and 50 until layer 25.
* **A-Anchored (PopQA):** Starts at approximately 75, decreases to around 40 at layer 5, then remains relatively stable between 40 and 60 until layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately 70, decreases to around 15 at layer 5, then fluctuates between 20 and 40 until layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 65, decreases to around 45 at layer 5, then remains relatively stable between 40 and 60 until layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 85, decreases to around 20 at layer 5, then fluctuates between 30 and 50 until layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 40 at layer 5, then remains relatively stable between 40 and 60 until layer 25.
* **Q-Anchored (NQ):** Starts at approximately 75, decreases to around 10 at layer 5, then fluctuates between 20 and 50 until layer 25.
* **A-Anchored (NQ):** Starts at approximately 70, decreases to around 30 at layer 5, then remains relatively stable between 40 and 60 until layer 25.
### Key Observations
* In both charts, all lines exhibit a significant decrease in "I-Don't-Know Rate" between layers 0 and 5.
* After layer 5, the lines generally stabilize, with fluctuations between approximately 20% and 60%.
* The "Q-Anchored" lines generally have lower "I-Don't-Know Rates" than the "A-Anchored" lines, suggesting that question anchoring is more effective than answer anchoring.
* The Llama-3.2-3B model appears to have a slightly higher initial "I-Don't-Know Rate" compared to the Llama-3.2-1B model, but the overall trends are similar.
* The HotpotQA dataset consistently shows a higher "I-Don't-Know Rate" compared to other datasets.
### Interpretation
The data suggests that increasing the number of layers in the Llama models initially leads to a significant reduction in the rate at which the model responds with "I-Don't-Know." This indicates that the model is learning to provide answers more confidently as it processes more information. However, beyond a certain number of layers (around 5 in this case), the improvement plateaus, and the "I-Don't-Know Rate" stabilizes.
The difference between "Q-Anchored" and "A-Anchored" lines suggests that the way the question is presented (or anchored) has a greater impact on the model's confidence than the way the answer is presented. This could be because the model relies more heavily on understanding the question to formulate a response.
The higher "I-Don't-Know Rate" for the HotpotQA dataset may indicate that this dataset contains more complex or ambiguous questions that are more challenging for the model to answer. The fact that the trends are similar for both models suggests that the underlying phenomenon is related to the model architecture and training process, rather than specific characteristics of either model. The larger model (3B) has a higher initial rate, but the same overall trend. This suggests that simply increasing model size does not necessarily solve the problem of uncertainty, but can improve performance up to a point.
</details>
<details>
<summary>x82.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the "I-Don't-Know Rate" against the "Layer" number for two different Llama models: Llama-3-8B and Llama-3-70B. Each chart contains multiple lines representing different question-answering datasets and anchoring methods. The charts aim to visualize how the rate of the model responding with "I-Don't-Know" changes as the model's layer depth increases.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B. The axis is linearly scaled with gridlines.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100, representing the percentage of times the model responds with "I-Don't-Know". The axis is linearly scaled with gridlines.
* **Title (Left Chart):** "Llama-3-8B"
* **Title (Right Chart):** "Llama-3-70B"
* **Legend:** Located at the bottom of the image, spanning both charts. It identifies the different lines by dataset and anchoring method.
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Brown dashed line
* Q-Anchored (HotpotQA) - Green solid line
* A-Anchored (HotpotQA) - Red dashed line
* Q-Anchored (NQ) - Cyan solid line
* A-Anchored (NQ) - Magenta dashed line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, rapidly decreases to around 10-15 by layer 5, then fluctuates between 10 and 25 until layer 30.
* **A-Anchored (PopQA):** Starts at approximately 80, decreases to around 50 by layer 5, then fluctuates between 50 and 70 until layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 20 by layer 5, then fluctuates between 20 and 35 until layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 60 by layer 5, then fluctuates between 60 and 75 until layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 10 by layer 5, then fluctuates between 10 and 20 until layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 50 by layer 5, then fluctuates between 50 and 70 until layer 30.
* **Q-Anchored (NQ):** Starts at approximately 80, decreases to around 10 by layer 5, then fluctuates between 10 and 20 until layer 30.
* **A-Anchored (NQ):** Starts at approximately 80, decreases to around 50 by layer 5, then fluctuates between 50 and 70 until layer 30.
**Llama-3-70B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, decreases to around 20 by layer 10, then fluctuates between 20 and 40 until layer 80.
* **A-Anchored (PopQA):** Starts at approximately 80, decreases to around 50 by layer 10, then fluctuates between 50 and 70 until layer 80.
* **Q-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 30 by layer 10, then fluctuates between 30 and 50 until layer 80.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 60 by layer 10, then fluctuates between 60 and 80 until layer 80.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 20 by layer 10, then fluctuates between 20 and 40 until layer 80.
* **A-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 50 by layer 10, then fluctuates between 50 and 70 until layer 80.
* **Q-Anchored (NQ):** Starts at approximately 80, decreases to around 20 by layer 10, then fluctuates between 20 and 40 until layer 80.
* **A-Anchored (NQ):** Starts at approximately 80, decreases to around 50 by layer 10, then fluctuates between 50 and 70 until layer 80.
### Key Observations
* Both models exhibit a significant drop in "I-Don't-Know Rate" in the initial layers (0-5).
* The "I-Don't-Know Rate" generally stabilizes after the initial drop, fluctuating within a certain range for the remaining layers.
* "A-Anchored" methods consistently show higher "I-Don't-Know Rates" than "Q-Anchored" methods across all datasets for both models.
* The 70B model generally exhibits a lower "I-Don't-Know Rate" than the 8B model, especially in the later layers.
* The PopQA, TriviaQA, HotpotQA, and NQ datasets all show similar trends, though the specific values differ.
### Interpretation
The charts demonstrate that increasing the depth of the Llama models (adding more layers) initially reduces the frequency of "I-Don't-Know" responses. This suggests that the early layers are crucial for acquiring basic knowledge and reducing uncertainty. However, beyond a certain point, adding more layers does not significantly decrease the "I-Don't-Know Rate," indicating that the model's knowledge acquisition plateaus.
The consistent difference between "Q-Anchored" and "A-Anchored" methods suggests that the way questions are anchored (whether based on the question itself or the answer) influences the model's confidence. "A-Anchored" methods, which likely rely more on the provided answer context, result in a higher "I-Don't-Know Rate," potentially because they are more sensitive to ambiguous or incomplete information.
The lower "I-Don't-Know Rate" for the 70B model compared to the 8B model highlights the benefits of scaling model size. Larger models are better able to generalize and provide answers, even when faced with challenging or ambiguous questions. The fluctuations in the "I-Don't-Know Rate" after the initial drop could be due to the complexity of the datasets and the inherent limitations of the models. The datasets themselves may contain questions that are genuinely difficult or require specialized knowledge, leading to increased uncertainty.
</details>
<details>
<summary>x83.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the "I-Don't-Know Rate" across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. The x-axis represents the "Layer" (ranging from 0 to 30), and the y-axis represents the "I-Don't-Know Rate" (ranging from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method. Shaded areas around each line indicate the variance or confidence interval.
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** I-Don't-Know Rate (0 to 100)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend:**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Red dashed line
* Q-Anchored (HotpotQA) - Brown dashed-dotted line
* A-Anchored (HotpotQA) - Green solid line
* Q-Anchored (NQ) - Teal dashed line
* A-Anchored (NQ) - Grey solid line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, rapidly decreases to around 10 by layer 5, then fluctuates between 10 and 20 for the remainder of the layers.
* **A-Anchored (PopQA):** Starts at approximately 85, decreases to around 60 by layer 5, then remains relatively stable between 60 and 75 for the rest of the layers.
* **Q-Anchored (TriviaQA):** Starts at approximately 70, decreases to around 30 by layer 5, then fluctuates between 30 and 50 for the remainder of the layers.
* **A-Anchored (TriviaQA):** Starts at approximately 75, decreases to around 55 by layer 5, then remains relatively stable between 55 and 70 for the rest of the layers.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 40 by layer 5, then fluctuates between 40 and 60 for the remainder of the layers.
* **A-Anchored (HotpotQA):** Starts at approximately 75, decreases to around 40 by layer 5, then remains relatively stable between 40 and 55 for the rest of the layers.
* **Q-Anchored (NQ):** Starts at approximately 60, decreases to around 20 by layer 5, then fluctuates between 20 and 30 for the remainder of the layers.
* **A-Anchored (NQ):** Starts at approximately 65, decreases to around 30 by layer 5, then remains relatively stable between 30 and 40 for the rest of the layers.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, rapidly decreases to around 10 by layer 5, then fluctuates between 10 and 20 for the remainder of the layers.
* **A-Anchored (PopQA):** Starts at approximately 85, decreases to around 60 by layer 5, then remains relatively stable between 60 and 75 for the rest of the layers.
* **Q-Anchored (TriviaQA):** Starts at approximately 70, decreases to around 30 by layer 5, then fluctuates between 30 and 50 for the remainder of the layers.
* **A-Anchored (TriviaQA):** Starts at approximately 75, decreases to around 55 by layer 5, then remains relatively stable between 55 and 70 for the rest of the layers.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 40 by layer 5, then fluctuates between 40 and 60 for the remainder of the layers.
* **A-Anchored (HotpotQA):** Starts at approximately 75, decreases to around 40 by layer 5, then remains relatively stable between 40 and 55 for the rest of the layers.
* **Q-Anchored (NQ):** Starts at approximately 60, decreases to around 20 by layer 5, then fluctuates between 20 and 30 for the remainder of the layers.
* **A-Anchored (NQ):** Starts at approximately 65, decreases to around 30 by layer 5, then remains relatively stable between 30 and 40 for the rest of the layers.
### Key Observations
* All lines in both charts exhibit a steep decline in "I-Don't-Know Rate" from layer 0 to layer 5.
* After layer 5, the "I-Don't-Know Rate" stabilizes, with fluctuations generally within a range of 10-75.
* "A-Anchored" lines consistently show higher "I-Don't-Know Rates" than their corresponding "Q-Anchored" counterparts across all datasets.
* The two charts (v0.1 and v0.3) are remarkably similar in shape and trend, suggesting that the model updates between versions did not drastically alter the "I-Don't-Know Rate" behavior.
* PopQA consistently has the highest I-Don't-Know rate, while NQ has the lowest.
### Interpretation
The charts demonstrate how the model's confidence (or lack thereof) evolves across its layers. The initial high "I-Don't-Know Rate" likely reflects the model's initial uncertainty as it processes input. The rapid decrease from layer 0 to 5 suggests that the model quickly learns to extract relevant information and form initial responses. The stabilization after layer 5 indicates that further layers contribute less to reducing uncertainty.
The difference between "Q-Anchored" and "A-Anchored" lines suggests that the method of anchoring (question vs. answer) impacts the model's confidence. The higher "I-Don't-Know Rate" for "A-Anchored" lines could indicate that the model finds it more challenging to reason from answers than from questions.
The similarity between the two model versions (v0.1 and v0.3) suggests that the updates primarily focused on improving performance without fundamentally changing the model's confidence profile. The differences in I-Don't-Know rates across datasets (PopQA, TriviaQA, HotpotQA, NQ) likely reflect the inherent difficulty and complexity of each dataset. PopQA appears to be the most challenging, while NQ is the easiest for the model to answer.
</details>
Figure 32: Comparisons of i-donât-know rate between pathways, probing attention activations of the last exact answer token.
<details>
<summary>x84.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the "I-Don't-Know Rate" as a function of "Layer" for two different Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart shows multiple lines representing different question-answering datasets and anchoring methods. The charts are designed to compare how the rate of the model responding with "I-Don't-Know" changes as the model's layers increase.
### Components/Axes
* **X-axis:** "Layer" - Ranges from approximately 0 to 15 for the Llama-3.2-1B chart and from 0 to 25 for the Llama-3.2-3B chart.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100.
* **Title (Left Chart):** "Llama-3.2-1B"
* **Title (Right Chart):** "Llama-3.2-3B"
* **Legend:** Located at the bottom of the image, contains the following labels and corresponding line styles/colors:
* Q-Anchored (PopQA) - Solid Blue Line
* A-Anchored (PopQA) - Solid Orange Line
* Q-Anchored (TriviaQA) - Solid Green Line
* A-Anchored (TriviaQA) - Solid Purple Line
* Q-Anchored (HotpotQA) - Dashed Blue Line
* A-Anchored (HotpotQA) - Dashed Orange Line
* Q-Anchored (NQ) - Dashed Green Line
* A-Anchored (NQ) - Dashed Purple Line
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 15, drops to a minimum of around 10 at layer 3, then gradually increases to approximately 55 by layer 15.
* **A-Anchored (PopQA):** Starts at approximately 60, decreases to a minimum of around 40 at layer 3, then fluctuates between 50 and 65 until layer 15.
* **Q-Anchored (TriviaQA):** Starts at approximately 85, drops sharply to around 20 at layer 3, then increases to approximately 50 by layer 15.
* **A-Anchored (TriviaQA):** Starts at approximately 70, decreases to around 30 at layer 3, then increases to approximately 60 by layer 15.
* **Q-Anchored (HotpotQA):** Starts at approximately 60, decreases to around 25 at layer 3, then fluctuates between 40 and 60 until layer 15.
* **A-Anchored (HotpotQA):** Starts at approximately 65, decreases to around 35 at layer 3, then fluctuates between 45 and 65 until layer 15.
* **Q-Anchored (NQ):** Starts at approximately 75, drops to around 25 at layer 3, then increases to approximately 55 by layer 15.
* **A-Anchored (NQ):** Starts at approximately 70, decreases to around 30 at layer 3, then increases to approximately 60 by layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 15, drops to a minimum of around 10 at layer 3, then fluctuates between 30 and 60 until layer 25.
* **A-Anchored (PopQA):** Starts at approximately 60, decreases to a minimum of around 40 at layer 3, then fluctuates between 50 and 70 until layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately 85, drops sharply to around 20 at layer 3, then increases to approximately 50 by layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 70, decreases to around 30 at layer 3, then increases to approximately 60 by layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 60, decreases to around 25 at layer 3, then fluctuates between 40 and 60 until layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately 65, decreases to around 35 at layer 3, then fluctuates between 45 and 65 until layer 25.
* **Q-Anchored (NQ):** Starts at approximately 75, drops to around 25 at layer 3, then increases to approximately 55 by layer 25.
* **A-Anchored (NQ):** Starts at approximately 70, decreases to around 30 at layer 3, then increases to approximately 60 by layer 25.
### Key Observations
* All lines in both charts exhibit a significant drop in "I-Don't-Know Rate" within the first few layers (up to layer 3).
* After the initial drop, the lines generally stabilize or exhibit more gradual increases.
* The "Q-Anchored" lines tend to have lower "I-Don't-Know Rates" than the corresponding "A-Anchored" lines across all datasets.
* The "TriviaQA" dataset consistently shows a higher initial "I-Don't-Know Rate" compared to other datasets.
* The Llama-3.2-3B model generally exhibits a more stable "I-Don't-Know Rate" across layers compared to the Llama-3.2-1B model.
### Interpretation
The data suggests that increasing the number of layers in the Llama models initially improves their ability to answer questions, as evidenced by the decrease in the "I-Don't-Know Rate." However, beyond a certain point (around layer 3), the improvement plateaus, and the rate may even slightly increase. The difference between "Q-Anchored" and "A-Anchored" methods indicates that anchoring questions may be more effective than anchoring answers in reducing uncertainty. The higher initial "I-Don't-Know Rate" for the "TriviaQA" dataset suggests that this dataset presents more challenging questions for the models. The greater stability observed in the Llama-3.2-3B model suggests that a larger model size can lead to more consistent performance across layers. The charts provide valuable insights into the behavior of these language models and can inform strategies for improving their performance and reducing uncertainty in their responses.
</details>
<details>
<summary>x85.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the "I-Don't-Know Rate" as a function of "Layer" for two different Llama models: Llama-3-8B and Llama-3-70B. Each chart contains multiple lines representing different question-answering datasets and anchoring methods. The charts aim to visualize how the rate of the model responding with "I-Don't-Know" changes across different layers of the neural network.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to approximately 30 for the Llama-3-8B chart and 0 to approximately 80 for the Llama-3-70B chart.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100.
* **Models:** Llama-3-8B (left chart), Llama-3-70B (right chart).
* **Datasets/Anchoring Methods (Legend):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Red dashed line
* Q-Anchored (HotpotQA) - Green dashed-dotted line
* A-Anchored (HotpotQA) - Gray solid line
* Q-Anchored (NQ) - Cyan solid line
* A-Anchored (NQ) - Magenta dashed line
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 90, rapidly decreases to around 20 by layer 5, then fluctuates between 20 and 40.
* **A-Anchored (PopQA):** Starts at approximately 80, decreases to around 50 by layer 5, then remains relatively stable between 50 and 70.
* **Q-Anchored (TriviaQA):** Starts at approximately 85, decreases to around 30 by layer 5, then fluctuates between 30 and 50.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 50 by layer 5, then remains relatively stable between 50 and 70.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 20 by layer 5, then fluctuates between 20 and 40.
* **A-Anchored (HotpotQA):** Starts at approximately 75, decreases to around 40 by layer 5, then remains relatively stable between 40 and 60.
* **Q-Anchored (NQ):** Starts at approximately 70, decreases to around 15 by layer 5, then fluctuates between 15 and 30.
* **A-Anchored (NQ):** Starts at approximately 70, decreases to around 40 by layer 5, then remains relatively stable between 40 and 60.
**Llama-3-70B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 90, decreases to around 20 by layer 10, then fluctuates between 20 and 40.
* **A-Anchored (PopQA):** Starts at approximately 80, decreases to around 50 by layer 10, then remains relatively stable between 50 and 70.
* **Q-Anchored (TriviaQA):** Starts at approximately 85, decreases to around 30 by layer 10, then fluctuates between 30 and 50.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 50 by layer 10, then remains relatively stable between 50 and 70.
* **Q-Anchored (HotpotQA):** Starts at approximately 80, decreases to around 20 by layer 10, then fluctuates between 20 and 40.
* **A-Anchored (HotpotQA):** Starts at approximately 75, decreases to around 40 by layer 10, then remains relatively stable between 40 and 60.
* **Q-Anchored (NQ):** Starts at approximately 70, decreases to around 15 by layer 10, then fluctuates between 15 and 30.
* **A-Anchored (NQ):** Starts at approximately 70, decreases to around 40 by layer 10, then remains relatively stable between 40 and 60.
### Key Observations
* In both charts, all lines exhibit a steep initial decline in the "I-Don't-Know Rate" within the first 5-10 layers.
* After the initial decline, the lines tend to stabilize, fluctuating within a certain range.
* "A-Anchored" methods generally have higher "I-Don't-Know Rates" than "Q-Anchored" methods across all datasets.
* The "NQ" dataset consistently shows the lowest "I-Don't-Know Rate" among the "Q-Anchored" methods.
* The 70B model appears to exhibit similar trends to the 8B model, but extends to a larger number of layers.
### Interpretation
The charts demonstrate how the confidence of the Llama models (as measured by the "I-Don't-Know Rate") evolves as information propagates through the layers of the neural network. The initial steep decline suggests that the early layers are crucial for establishing a basic understanding of the input and reducing uncertainty. The subsequent stabilization indicates that further layers refine this understanding but do not significantly alter the overall confidence level.
The difference between "Q-Anchored" and "A-Anchored" methods suggests that the way the question or answer is presented influences the model's confidence. "A-Anchored" methods, which likely involve providing the answer context, lead to higher "I-Don't-Know Rates," potentially because the model is more cautious when presented with a potential answer.
The consistent performance of the "NQ" dataset with "Q-Anchored" methods could indicate that this dataset is particularly well-suited for the model's architecture or training process. The fact that the trends are similar between the 8B and 70B models suggests that increasing model size does not fundamentally change the underlying behavior, but rather extends the range over which this behavior is observed. The charts provide valuable insights into the internal workings of these language models and can inform strategies for improving their performance and reliability.
</details>
<details>
<summary>x86.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the "I-Don't-Know Rate" across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. The x-axis represents the "Layer" (ranging from 0 to 30), and the y-axis represents the "I-Don't-Know Rate" (ranging from 0 to 100). Each chart displays multiple lines, each representing a different data series based on the anchoring method (Q-Anchored or A-Anchored) and the dataset used (PopQA, TriviaQA, HotpotQA, NQ).
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** I-Don't-Know Rate (0 to 100)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom Center):**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Light Blue Line: A-Anchored (TriviaQA)
* Purple Line: Q-Anchored (HotpotQA)
* Red Line: A-Anchored (HotpotQA)
* Teal Line: Q-Anchored (NQ)
* Gray Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 95, rapidly decreases to around 15 by layer 10, then fluctuates between 10 and 25 until layer 30.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 90, decreases to around 50 by layer 10, then gradually decreases to around 30-40, with some fluctuations, until layer 30.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 90, decreases to around 20 by layer 10, then fluctuates between 20 and 30 until layer 30.
* **A-Anchored (TriviaQA) - Light Blue Line:** Starts at approximately 95, decreases to around 60 by layer 10, then gradually decreases to around 40-50, with some fluctuations, until layer 30.
* **Q-Anchored (HotpotQA) - Purple Line:** Starts at approximately 85, decreases to around 40 by layer 10, then fluctuates between 30 and 50 until layer 30.
* **A-Anchored (HotpotQA) - Red Line:** Starts at approximately 90, decreases to around 60 by layer 10, then gradually decreases to around 50-60, with some fluctuations, until layer 30.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately 80, decreases to around 10 by layer 10, then fluctuates between 10 and 20 until layer 30.
* **A-Anchored (NQ) - Gray Line:** Starts at approximately 85, decreases to around 40 by layer 10, then gradually decreases to around 30-40, with some fluctuations, until layer 30.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 95, rapidly decreases to around 10 by layer 10, then fluctuates between 10 and 20 until layer 30.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 90, decreases to around 50 by layer 10, then gradually decreases to around 40-50, with some fluctuations, until layer 30.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 90, decreases to around 20 by layer 10, then fluctuates between 20 and 30 until layer 30.
* **A-Anchored (TriviaQA) - Light Blue Line:** Starts at approximately 95, decreases to around 60 by layer 10, then gradually decreases to around 40-50, with some fluctuations, until layer 30.
* **Q-Anchored (HotpotQA) - Purple Line:** Starts at approximately 85, decreases to around 40 by layer 10, then fluctuates between 30 and 50 until layer 30.
* **A-Anchored (HotpotQA) - Red Line:** Starts at approximately 90, decreases to around 60 by layer 10, then gradually decreases to around 50-60, with some fluctuations, until layer 30.
* **Q-Anchored (NQ) - Teal Line:** Starts at approximately 80, decreases to around 10 by layer 10, then fluctuates between 10 and 20 until layer 30.
* **A-Anchored (NQ) - Gray Line:** Starts at approximately 85, decreases to around 40 by layer 10, then gradually decreases to around 30-40, with some fluctuations, until layer 30.
### Key Observations
* Both models (v0.1 and v0.3) exhibit a significant decrease in "I-Don't-Know Rate" in the initial layers (0-10).
* Q-Anchored data series generally have lower "I-Don't-Know Rates" than A-Anchored series, especially for PopQA and NQ datasets.
* The "I-Don't-Know Rate" tends to stabilize after layer 10 for most data series.
* Mistral-7B-v0.3 consistently shows lower "I-Don't-Know Rates" compared to Mistral-7B-v0.1 across all datasets and anchoring methods.
### Interpretation
The charts demonstrate the impact of model depth (layers) and anchoring method on the model's confidence in providing answers. The initial steep decline in "I-Don't-Know Rate" suggests that the early layers of the model are crucial for learning basic knowledge and reducing uncertainty. The difference between Q-Anchored and A-Anchored series indicates that the method used to provide context or guidance to the model influences its confidence. Q-Anchored, which likely involves question-based prompting, appears to be more effective in eliciting responses.
The consistent improvement in Mistral-7B-v0.3 over v0.1 suggests that the model updates have resulted in a more knowledgeable and confident model, capable of answering a wider range of questions with greater certainty. The stabilization of the "I-Don't-Know Rate" after layer 10 implies that further increasing model depth may yield diminishing returns in terms of reducing uncertainty. The datasets used (PopQA, TriviaQA, HotpotQA, NQ) represent different types of knowledge and reasoning challenges, and the variations in "I-Don't-Know Rate" across these datasets highlight the model's strengths and weaknesses in different areas.
</details>
Figure 33: Comparisons of i-donât-know rate between pathways, probing mlp activations of the final token.
<details>
<summary>x87.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, visualizing the "I-Don't-Know Rate" against the "Layer" number for two different Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart displays multiple lines representing different question-answering datasets and anchoring methods. The charts are designed to compare how the rate of the model failing to answer questions (I-Don't-Know Rate) changes as the model's layers increase.
### Components/Axes
* **X-axis:** "Layer" - Ranges from approximately 2 to 15 for the Llama-3.2-1B chart and from approximately 2 to 27 for the Llama-3.2-3B chart.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 80 for the Llama-3.2-1B chart and from 0 to 100 for the Llama-3.2-3B chart.
* **Legend:** Located at the bottom of the image, containing the following labels and corresponding line styles/colors:
* Q-Anchored (PopQA) - Solid Blue Line
* A-Anchored (PopQA) - Dashed Orange Line
* Q-Anchored (TriviaQA) - Solid Red Line
* A-Anchored (TriviaQA) - Dashed Green Line
* Q-Anchored (HotpotQA) - Dashed Blue Line
* A-Anchored (HotpotQA) - Dashed Orange Line
* Q-Anchored (NQ) - Solid Green Line
* A-Anchored (NQ) - Dashed Purple Line
* **Titles:**
* Left Chart: "Llama-3.2-1B"
* Right Chart: "Llama-3.2-3B"
### Detailed Analysis or Content Details
**Llama-3.2-1B Chart:**
* **Q-Anchored (PopQA):** The line starts at approximately 10 at Layer 2, peaks at approximately 80 at Layer 2.5, then declines to approximately 50 at Layer 15.
* **A-Anchored (PopQA):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 50 and 70 until Layer 15.
* **Q-Anchored (TriviaQA):** The line starts at approximately 60 at Layer 2, peaks at approximately 75 at Layer 2.5, then declines to approximately 60 at Layer 15.
* **A-Anchored (TriviaQA):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 50 and 65 until Layer 15.
* **Q-Anchored (HotpotQA):** The line starts at approximately 60 at Layer 2, fluctuates between approximately 50 and 70 until Layer 15.
* **A-Anchored (HotpotQA):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 50 and 65 until Layer 15.
* **Q-Anchored (NQ):** The line starts at approximately 20 at Layer 2, increases to approximately 50 at Layer 7.5, then declines to approximately 30 at Layer 15.
* **A-Anchored (NQ):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 40 and 60 until Layer 15.
**Llama-3.2-3B Chart:**
* **Q-Anchored (PopQA):** The line starts at approximately 80 at Layer 2, declines to approximately 20 at Layer 10, then fluctuates between approximately 20 and 40 until Layer 27.
* **A-Anchored (PopQA):** The line starts at approximately 60 at Layer 2, fluctuates between approximately 40 and 60 until Layer 27.
* **Q-Anchored (TriviaQA):** The line starts at approximately 70 at Layer 2, declines to approximately 40 at Layer 10, then fluctuates between approximately 40 and 60 until Layer 27.
* **A-Anchored (TriviaQA):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 40 and 60 until Layer 27.
* **Q-Anchored (HotpotQA):** The line starts at approximately 70 at Layer 2, declines to approximately 40 at Layer 10, then fluctuates between approximately 40 and 60 until Layer 27.
* **A-Anchored (HotpotQA):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 40 and 60 until Layer 27.
* **Q-Anchored (NQ):** The line starts at approximately 40 at Layer 2, declines to approximately 10 at Layer 10, then fluctuates between approximately 10 and 30 until Layer 27.
* **A-Anchored (NQ):** The line starts at approximately 50 at Layer 2, fluctuates between approximately 40 and 60 until Layer 27.
### Key Observations
* In both charts, the "Q-Anchored (PopQA)" line exhibits a significant initial drop in I-Don't-Know Rate as the layer number increases.
* The "A-Anchored" lines generally remain more stable than the "Q-Anchored" lines across all datasets.
* The Llama-3.2-3B model generally shows a lower I-Don't-Know Rate than the Llama-3.2-1B model, particularly after the initial layers.
* The I-Don't-Know Rate for the Llama-3.2-1B model appears to stabilize around 50-70 after Layer 7.5, while the Llama-3.2-3B model stabilizes around 40-60 after Layer 10.
### Interpretation
The charts demonstrate the impact of model size (number of parameters) and anchoring method on the model's ability to answer questions. The larger Llama-3.2-3B model consistently exhibits a lower I-Don't-Know Rate, indicating improved knowledge and reasoning capabilities. The "Q-Anchored" method, which likely involves prompting the model with a question, initially shows a higher I-Don't-Know Rate but then improves with increasing layers, suggesting the model learns to better understand and respond to questions as it processes more information. The "A-Anchored" method, which may involve providing the model with an answer or context, maintains a more stable I-Don't-Know Rate, indicating a more consistent level of performance. The initial spike in I-Don't-Know Rate for the "Q-Anchored" lines could be due to the model struggling with the initial layers or the complexity of the questions. The stabilization of the lines after a certain number of layers suggests that the model reaches a point of diminishing returns in terms of knowledge acquisition. The differences in I-Don't-Know Rates across different datasets (PopQA, TriviaQA, HotpotQA, NQ) likely reflect the varying difficulty and complexity of the questions in each dataset.
</details>
<details>
<summary>x88.png Details</summary>

### Visual Description
## Chart: I-Don't-Know Rate vs. Layer for Llama-3 Models
### Overview
The image presents two line charts comparing the "I-Don't-Know Rate" across different layers of two Llama-3 models: Llama-3-8B and Llama-3-70B. The charts display the rate for different question-answering datasets (PopQA, TriviaQA, HotpotQA, and NQ) and anchoring methods (Q-Anchored and A-Anchored).
### Components/Axes
* **X-axis:** "Layer" - ranging from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B.
* **Y-axis:** "I-Don't-Know Rate" - ranging from 0 to 100.
* **Models:** Two separate charts, one for "Llama-3-8B" (left) and one for "Llama-3-70B" (right).
* **Datasets/Anchoring:** The legend at the bottom identifies eight data series:
* Q-Anchored (PopQA) - Blue line
* A-Anchored (PopQA) - Orange line
* Q-Anchored (TriviaQA) - Purple line
* A-Anchored (TriviaQA) - Brown line
* Q-Anchored (HotpotQA) - Light Blue dashed line
* A-Anchored (HotpotQA) - Grey dashed line
* Q-Anchored (NQ) - Green line
* A-Anchored (NQ) - Red line
### Detailed Analysis or Content Details
**Llama-3-8B (Left Chart)**
* **Q-Anchored (PopQA):** Starts at approximately 95, rapidly decreases to around 15 by layer 10, then fluctuates between 10 and 25 until layer 30.
* **A-Anchored (PopQA):** Starts at approximately 90, decreases to around 60 by layer 10, and remains relatively stable between 55 and 70 until layer 30.
* **Q-Anchored (TriviaQA):** Starts at approximately 90, decreases to around 20 by layer 10, then fluctuates between 15 and 30 until layer 30.
* **A-Anchored (TriviaQA):** Starts at approximately 85, decreases to around 65 by layer 10, and remains relatively stable between 60 and 75 until layer 30.
* **Q-Anchored (HotpotQA):** Starts at approximately 95, decreases to around 30 by layer 10, then fluctuates between 20 and 40 until layer 30.
* **A-Anchored (HotpotQA):** Starts at approximately 90, decreases to around 50 by layer 10, and remains relatively stable between 45 and 60 until layer 30.
* **Q-Anchored (NQ):** Starts at approximately 95, decreases to around 10 by layer 10, then fluctuates between 5 and 20 until layer 30.
* **A-Anchored (NQ):** Starts at approximately 90, decreases to around 60 by layer 10, and remains relatively stable between 55 and 70 until layer 30.
**Llama-3-70B (Right Chart)**
* **Q-Anchored (PopQA):** Starts at approximately 95, decreases to around 20 by layer 20, then fluctuates between 20 and 40 until layer 80.
* **A-Anchored (PopQA):** Starts at approximately 90, decreases to around 60 by layer 20, and remains relatively stable between 55 and 70 until layer 80.
* **Q-Anchored (TriviaQA):** Starts at approximately 90, decreases to around 25 by layer 20, then fluctuates between 25 and 40 until layer 80.
* **A-Anchored (TriviaQA):** Starts at approximately 85, decreases to around 65 by layer 20, and remains relatively stable between 60 and 75 until layer 80.
* **Q-Anchored (HotpotQA):** Starts at approximately 95, decreases to around 35 by layer 20, then fluctuates between 30 and 50 until layer 80.
* **A-Anchored (HotpotQA):** Starts at approximately 90, decreases to around 55 by layer 20, and remains relatively stable between 50 and 65 until layer 80.
* **Q-Anchored (NQ):** Starts at approximately 95, decreases to around 15 by layer 20, then fluctuates between 10 and 30 until layer 80.
* **A-Anchored (NQ):** Starts at approximately 90, decreases to around 60 by layer 20, and remains relatively stable between 55 and 70 until layer 80.
### Key Observations
* All data series exhibit a significant decrease in "I-Don't-Know Rate" in the initial layers (0-10).
* The Q-Anchored series generally have lower "I-Don't-Know Rates" than the A-Anchored series across all datasets.
* The "I-Don't-Know Rate" stabilizes after approximately layer 20 for Llama-3-8B and layer 20 for Llama-3-70B.
* The 70B model generally exhibits a lower "I-Don't-Know Rate" than the 8B model, especially after the initial decrease.
* PopQA and NQ datasets show the most significant reduction in "I-Don't-Know Rate" with increasing layers.
### Interpretation
The charts demonstrate that as the model layers increase, the "I-Don't-Know Rate" decreases, indicating that the model becomes more confident in its answers. The difference between Q-Anchored and A-Anchored suggests that using question-based anchoring leads to lower uncertainty compared to answer-based anchoring. The larger model (70B) consistently outperforms the smaller model (8B), suggesting that increased model size improves knowledge retention and reduces uncertainty. The stabilization of the "I-Don't-Know Rate" after a certain number of layers indicates that adding more layers beyond that point may not significantly improve performance. The varying rates across datasets suggest that the model's performance is influenced by the complexity and nature of the questions in each dataset. The initial high rates suggest the model starts with limited knowledge and learns as it processes more layers. The consistent trend across both models suggests a general pattern in how these Llama-3 models learn and respond to different types of questions.
</details>
<details>
<summary>x89.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the "I-Don't-Know Rate" across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. The x-axis represents the "Layer" (ranging from 0 to 30), and the y-axis represents the "I-Don't-Know Rate" (ranging from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method.
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** I-Don't-Know Rate (0 to 100)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Purple solid line
* A-Anchored (TriviaQA) - Green dashed line
* Q-Anchored (HotpotQA) - Brown dashed line
* A-Anchored (HotpotQA) - Red dashed line
* Q-Anchored (NQ) - Light Blue solid line
* A-Anchored (NQ) - Grey solid line
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 95, dips to around 20 at layer 8, then fluctuates between 40 and 80 until layer 30, ending around 60.
* **A-Anchored (PopQA):** Starts at approximately 85, gradually decreases to around 50 at layer 10, then fluctuates between 50 and 75 until layer 30, ending around 65.
* **Q-Anchored (TriviaQA):** Starts at approximately 90, decreases to around 50 at layer 10, then fluctuates between 50 and 80 until layer 30, ending around 70.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 40 at layer 10, then fluctuates between 40 and 60 until layer 30, ending around 55.
* **Q-Anchored (HotpotQA):** Starts at approximately 95, decreases to around 40 at layer 10, then fluctuates between 40 and 70 until layer 30, ending around 60.
* **A-Anchored (HotpotQA):** Starts at approximately 90, decreases to around 50 at layer 10, then fluctuates between 50 and 80 until layer 30, ending around 75.
* **Q-Anchored (NQ):** Starts at approximately 95, dips to around 20 at layer 8, then fluctuates between 40 and 70 until layer 30, ending around 60.
* **A-Anchored (NQ):** Starts at approximately 85, decreases to around 40 at layer 10, then fluctuates between 40 and 60 until layer 30, ending around 50.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 95, dips to around 30 at layer 8, then fluctuates between 30 and 60 until layer 30, ending around 50.
* **A-Anchored (PopQA):** Starts at approximately 85, gradually decreases to around 40 at layer 10, then fluctuates between 40 and 60 until layer 30, ending around 55.
* **Q-Anchored (TriviaQA):** Starts at approximately 90, decreases to around 40 at layer 10, then fluctuates between 40 and 60 until layer 30, ending around 50.
* **A-Anchored (TriviaQA):** Starts at approximately 80, decreases to around 30 at layer 10, then fluctuates between 30 and 50 until layer 30, ending around 45.
* **Q-Anchored (HotpotQA):** Starts at approximately 95, decreases to around 40 at layer 10, then fluctuates between 40 and 60 until layer 30, ending around 50.
* **A-Anchored (HotpotQA):** Starts at approximately 90, decreases to around 50 at layer 10, then fluctuates between 50 and 70 until layer 30, ending around 65.
* **Q-Anchored (NQ):** Starts at approximately 95, dips to around 30 at layer 8, then fluctuates between 30 and 50 until layer 30, ending around 40.
* **A-Anchored (NQ):** Starts at approximately 85, decreases to around 40 at layer 10, then fluctuates between 40 and 50 until layer 30, ending around 45.
### Key Observations
* All lines in both charts start with high "I-Don't-Know Rates" (around 80-95) at layer 0.
* There's a general decreasing trend in "I-Don't-Know Rate" up to around layer 10 for most datasets and anchoring methods.
* After layer 10, the rates fluctuate, but generally remain between 40 and 80.
* The Mistral-7B-v0.3 model consistently exhibits lower "I-Don't-Know Rates" compared to the Mistral-7B-v0.1 model across most datasets and anchoring methods.
* Q-Anchored methods generally have higher "I-Don't-Know Rates" than A-Anchored methods for the same dataset.
### Interpretation
The charts demonstrate how the "I-Don't-Know Rate" changes as information propagates through the layers of the Mistral language models. The initial high rate suggests the model has limited initial knowledge. The decrease up to layer 10 indicates that the model learns and gains confidence as it processes information. The subsequent fluctuations likely represent the model encountering more complex or ambiguous information.
The consistent lower "I-Don't-Know Rates" in Mistral-7B-v0.3 suggest that this version of the model is more robust and has a better understanding of the datasets used for evaluation. The difference between Q-Anchored and A-Anchored methods suggests that the way questions are anchored (using the question itself vs. the answer) impacts the model's confidence. The fact that Q-Anchored methods generally have higher rates could indicate that the model finds it more challenging to reason directly from the question.
The data suggests that model improvements (v0.3 over v0.1) lead to a reduction in uncertainty (lower I-Don't-Know Rate) across different knowledge domains (PopQA, TriviaQA, HotpotQA, NQ). This is a positive indicator of model performance and generalization ability.
</details>
Figure 34: Comparisons of i-donât-know rate between pathways, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x90.png Details</summary>

### Visual Description
\n
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the "I-Don't-Know Rate" as a function of "Layer" for two different Llama models: Llama-3.2-1B and Llama-3.2-3B. Each chart shows multiple lines representing different question-answering datasets and anchoring methods. The charts are visually similar, with the right chart extending to a higher layer value.
### Components/Axes
* **X-axis:** "Layer" - ranging from approximately 0 to 15 for the left chart (Llama-3.2-1B) and from 0 to 25 for the right chart (Llama-3.2-3B). The axis is labeled with numerical markers at intervals of 5.
* **Y-axis:** "I-Don't-Know Rate" - ranging from 0 to 100. The axis is labeled with numerical markers at intervals of 20.
* **Title (Left Chart):** "Llama-3.2-1B" - positioned at the top-center.
* **Title (Right Chart):** "Llama-3.2-3B" - positioned at the top-center.
* **Legend:** Located at the bottom of the image, spanning both charts. It identifies the different lines by dataset and anchoring method.
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Green solid line
* A-Anchored (TriviaQA) - Red dashed line
* Q-Anchored (HotpotQA) - Gray dashed-dotted line
* A-Anchored (HotpotQA) - Gray solid line
* Q-Anchored (NQ) - Purple dashed-dotted line
* A-Anchored (NQ) - Purple solid line
### Detailed Analysis or Content Details
**Llama-3.2-1B (Left Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, rapidly decreases to around 10-20 between layers 1 and 5, then fluctuates between 15 and 30 until layer 15.
* **A-Anchored (PopQA):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 15.
* **Q-Anchored (TriviaQA):** Starts at approximately 60, decreases to around 10-20 between layers 1 and 5, then fluctuates between 20 and 40 until layer 15.
* **A-Anchored (TriviaQA):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 15.
* **Q-Anchored (HotpotQA):** Starts at approximately 60, decreases to around 20-30 between layers 1 and 5, then fluctuates between 20 and 40 until layer 15.
* **A-Anchored (HotpotQA):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 15.
* **Q-Anchored (NQ):** Starts at approximately 60, decreases to around 20-30 between layers 1 and 5, then fluctuates between 30 and 50 until layer 15.
* **A-Anchored (NQ):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 15.
**Llama-3.2-3B (Right Chart):**
* **Q-Anchored (PopQA):** Starts at approximately 80, rapidly decreases to around 10-20 between layers 1 and 5, then fluctuates between 15 and 30 until layer 25.
* **A-Anchored (PopQA):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 25.
* **Q-Anchored (TriviaQA):** Starts at approximately 60, decreases to around 10-20 between layers 1 and 5, then fluctuates between 20 and 40 until layer 25.
* **A-Anchored (TriviaQA):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 25.
* **Q-Anchored (HotpotQA):** Starts at approximately 60, decreases to around 20-30 between layers 1 and 5, then fluctuates between 20 and 40 until layer 25.
* **A-Anchored (HotpotQA):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 25.
* **Q-Anchored (NQ):** Starts at approximately 60, decreases to around 20-30 between layers 1 and 5, then fluctuates between 30 and 50 until layer 25.
* **A-Anchored (NQ):** Starts at approximately 60, decreases to around 40-50 between layers 1 and 5, then remains relatively stable around 50-60 until layer 25.
### Key Observations
* All lines exhibit a significant decrease in "I-Don't-Know Rate" within the first 5 layers.
* The "Q-Anchored" lines generally have lower "I-Don't-Know Rates" than the "A-Anchored" lines, especially after the initial decrease.
* The "I-Don't-Know Rate" stabilizes after layer 5 for most lines, with some fluctuations.
* The trends are remarkably similar between the two models (Llama-3.2-1B and Llama-3.2-3B), although the right chart extends to a higher layer value.
### Interpretation
The charts demonstrate how the "I-Don't-Know Rate" changes with the depth (layer) of the Llama models. The initial rapid decrease suggests that the models quickly learn to answer questions within the first few layers. The subsequent stabilization indicates that further layers provide diminishing returns in terms of reducing the "I-Don't-Know Rate."
The difference between "Q-Anchored" and "A-Anchored" methods suggests that question-anchoring is more effective than answer-anchoring in reducing uncertainty. This could be because question-anchoring provides more direct guidance to the model during training.
The similarity between the two models (1B and 3B) suggests that the underlying learning dynamics are consistent, even with different model sizes. The extended layer range in the 3B model doesn't drastically alter the overall trend, implying that increasing model size alone may not be sufficient to significantly reduce the "I-Don't-Know Rate" beyond a certain point. The data suggests that the models are learning to avoid answering questions they are unsure about, and this behavior is influenced by the anchoring method used during training.
</details>
<details>
<summary>x91.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Llama Models
### Overview
The image presents two line charts, side-by-side, displaying the "I-Don't-Know Rate" as a function of "Layer" for two different Llama models: Llama-3-8B and Llama-3-70B. Each chart contains multiple lines representing different question-answering datasets and anchoring methods. The charts aim to visualize how the model's uncertainty (expressed as the I-Don't-Know Rate) changes across different layers of the neural network.
### Components/Axes
* **X-axis:** "Layer" - Ranges from 0 to 30 for Llama-3-8B and 0 to 80 for Llama-3-70B. The scale is linear.
* **Y-axis:** "I-Don't-Know Rate" - Ranges from 0 to 100, representing a percentage. The scale is linear.
* **Title (Left Chart):** "Llama-3-8B"
* **Title (Right Chart):** "Llama-3-70B"
* **Legend:** Located at the bottom of the image, below both charts. It identifies the different lines based on anchoring method ("Q-Anchored" or "A-Anchored") and the question-answering dataset (PopQA, TriviaQA, HotpotQA, NQ).
* Q-Anchored (PopQA) - Blue solid line
* A-Anchored (PopQA) - Orange dashed line
* Q-Anchored (TriviaQA) - Light Blue solid line
* A-Anchored (TriviaQA) - Purple dashed line
* Q-Anchored (HotpotQA) - Green dashed line
* A-Anchored (HotpotQA) - Red dashed line
* Q-Anchored (NQ) - Cyan solid line
* A-Anchored (NQ) - Magenta dashed line
### Detailed Analysis or Content Details
**Llama-3-8B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 95% I-Don't-Know Rate at Layer 0, rapidly decreasing to around 10% by Layer 5, then fluctuating between 10% and 30% for the remainder of the layers, with a slight upward trend towards Layer 30, ending at approximately 30%.
* **A-Anchored (PopQA):** Starts at approximately 70% I-Don't-Know Rate at Layer 0, decreasing to around 50% by Layer 5, then remaining relatively stable between 50% and 70% for the rest of the layers, ending at approximately 60%.
* **Q-Anchored (TriviaQA):** Starts at approximately 80% I-Don't-Know Rate at Layer 0, decreasing to around 20% by Layer 5, then fluctuating between 20% and 40% for the remainder of the layers, ending at approximately 30%.
* **A-Anchored (TriviaQA):** Starts at approximately 60% I-Don't-Know Rate at Layer 0, decreasing to around 40% by Layer 5, then remaining relatively stable between 40% and 60% for the rest of the layers, ending at approximately 50%.
* **Q-Anchored (HotpotQA):** Starts at approximately 60% I-Don't-Know Rate at Layer 0, decreasing to around 20% by Layer 5, then fluctuating between 20% and 40% for the remainder of the layers, ending at approximately 30%.
* **A-Anchored (HotpotQA):** Starts at approximately 50% I-Don't-Know Rate at Layer 0, decreasing to around 30% by Layer 5, then remaining relatively stable between 30% and 50% for the rest of the layers, ending at approximately 40%.
* **Q-Anchored (NQ):** Starts at approximately 70% I-Don't-Know Rate at Layer 0, decreasing to around 20% by Layer 5, then fluctuating between 20% and 40% for the remainder of the layers, ending at approximately 30%.
* **A-Anchored (NQ):** Starts at approximately 50% I-Don't-Know Rate at Layer 0, decreasing to around 30% by Layer 5, then remaining relatively stable between 30% and 50% for the rest of the layers, ending at approximately 40%.
**Llama-3-70B Chart:**
* **Q-Anchored (PopQA):** Starts at approximately 90% I-Don't-Know Rate at Layer 0, decreasing to around 20% by Layer 10, then fluctuating between 20% and 40% for the remainder of the layers, ending at approximately 30%.
* **A-Anchored (PopQA):** Starts at approximately 70% I-Don't-Know Rate at Layer 0, decreasing to around 50% by Layer 10, then remaining relatively stable between 50% and 70% for the rest of the layers, ending at approximately 60%.
* **Q-Anchored (TriviaQA):** Starts at approximately 80% I-Don't-Know Rate at Layer 0, decreasing to around 30% by Layer 10, then fluctuating between 30% and 50% for the remainder of the layers, ending at approximately 40%.
* **A-Anchored (TriviaQA):** Starts at approximately 60% I-Don't-Know Rate at Layer 0, decreasing to around 40% by Layer 10, then remaining relatively stable between 40% and 60% for the rest of the layers, ending at approximately 50%.
* **Q-Anchored (HotpotQA):** Starts at approximately 60% I-Don't-Know Rate at Layer 0, decreasing to around 20% by Layer 10, then fluctuating between 20% and 40% for the remainder of the layers, ending at approximately 30%.
* **A-Anchored (HotpotQA):** Starts at approximately 50% I-Don't-Know Rate at Layer 0, decreasing to around 30% by Layer 10, then remaining relatively stable between 30% and 50% for the rest of the layers, ending at approximately 40%.
* **Q-Anchored (NQ):** Starts at approximately 70% I-Don't-Know Rate at Layer 0, decreasing to around 20% by Layer 10, then fluctuating between 20% and 40% for the remainder of the layers, ending at approximately 30%.
* **A-Anchored (NQ):** Starts at approximately 50% I-Don't-Know Rate at Layer 0, decreasing to around 30% by Layer 10, then remaining relatively stable between 30% and 50% for the rest of the layers, ending at approximately 40%.
### Key Observations
* In both models, the I-Don't-Know Rate generally decreases rapidly in the initial layers (0-10) and then plateaus.
* Q-Anchored methods consistently exhibit lower I-Don't-Know Rates compared to A-Anchored methods across all datasets.
* The PopQA dataset generally shows a higher I-Don't-Know Rate than other datasets, particularly for A-Anchored methods.
* The 70B model appears to have a slightly lower I-Don't-Know Rate overall compared to the 8B model, especially in the later layers.
### Interpretation
The charts demonstrate how the model's confidence (or lack thereof) evolves as information propagates through its layers. The initial high I-Don't-Know Rate suggests that the model initially lacks sufficient information to answer questions. As the data flows through the layers, the model learns to reduce its uncertainty. The difference between Q-Anchored and A-Anchored methods suggests that the method used to provide context (question vs. answer) influences the model's confidence. The higher I-Don't-Know Rate for PopQA might indicate that this dataset presents more challenging or ambiguous questions. The larger model (70B) generally exhibits lower uncertainty, indicating that increased model size can lead to improved performance and confidence. The plateauing of the I-Don't-Know Rate suggests that further increasing the number of layers may not significantly improve the model's ability to answer questions, or that other factors are limiting performance.
</details>
<details>
<summary>x92.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate vs. Layer for Mistral Models
### Overview
The image presents two line charts, side-by-side, comparing the "I-Don't-Know Rate" across different layers of two Mistral language models: Mistral-7B-v0.1 and Mistral-7B-v0.3. The x-axis represents the "Layer" (ranging from 0 to 30), and the y-axis represents the "I-Don't-Know Rate" (ranging from 0 to 100). Each chart displays multiple lines, each representing a different question-answering dataset and anchoring method. The charts are visually similar, with the primary difference being the specific rates observed for each model.
### Components/Axes
* **X-axis:** Layer (0 to 30)
* **Y-axis:** I-Don't-Know Rate (0 to 100)
* **Left Chart Title:** Mistral-7B-v0.1
* **Right Chart Title:** Mistral-7B-v0.3
* **Legend (Bottom):**
* Blue Line: Q-Anchored (PopQA)
* Orange Line: A-Anchored (PopQA)
* Green Line: Q-Anchored (TriviaQA)
* Purple Line: A-Anchored (TriviaQA)
* Gray Dashed Line: Q-Anchored (HotpotQA)
* Red Dashed Line: A-Anchored (HotpotQA)
* Light Blue Line: Q-Anchored (NQ)
* Light Purple Line: A-Anchored (NQ)
### Detailed Analysis or Content Details
**Mistral-7B-v0.1 (Left Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 80, rapidly decreases to around 10 by layer 5, then fluctuates between 5 and 20 for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 80, decreases more gradually than the Q-Anchored version, reaching around 40 by layer 5, and then stabilizes between 40 and 60.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 70, decreases to around 30 by layer 5, and then fluctuates between 30 and 50.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 70, decreases to around 40 by layer 5, and then stabilizes between 40 and 60.
* **Q-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately 80, decreases to around 40 by layer 5, and then fluctuates between 30 and 50.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 80, decreases to around 50 by layer 5, and then stabilizes between 50 and 70.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 60, decreases to around 10 by layer 5, and then fluctuates between 10 and 20.
* **A-Anchored (NQ) - Light Purple Line:** Starts at approximately 60, decreases to around 30 by layer 5, and then stabilizes between 30 and 50.
**Mistral-7B-v0.3 (Right Chart):**
* **Q-Anchored (PopQA) - Blue Line:** Starts at approximately 80, rapidly decreases to around 10 by layer 5, then fluctuates between 10 and 20 for the remainder of the layers.
* **A-Anchored (PopQA) - Orange Line:** Starts at approximately 80, decreases more gradually than the Q-Anchored version, reaching around 40 by layer 5, and then stabilizes between 40 and 60.
* **Q-Anchored (TriviaQA) - Green Line:** Starts at approximately 70, decreases to around 30 by layer 5, and then fluctuates between 30 and 50.
* **A-Anchored (TriviaQA) - Purple Line:** Starts at approximately 70, decreases to around 40 by layer 5, and then stabilizes between 40 and 60.
* **Q-Anchored (HotpotQA) - Gray Dashed Line:** Starts at approximately 80, decreases to around 40 by layer 5, and then fluctuates between 30 and 50.
* **A-Anchored (HotpotQA) - Red Dashed Line:** Starts at approximately 80, decreases to around 50 by layer 5, and then stabilizes between 50 and 70.
* **Q-Anchored (NQ) - Light Blue Line:** Starts at approximately 60, decreases to around 10 by layer 5, and then fluctuates between 10 and 20.
* **A-Anchored (NQ) - Light Purple Line:** Starts at approximately 60, decreases to around 30 by layer 5, and then stabilizes between 30 and 50.
### Key Observations
* All lines exhibit a steep initial decrease in "I-Don't-Know Rate" from layer 0 to layer 5.
* The "Q-Anchored" lines generally have lower "I-Don't-Know Rates" than the "A-Anchored" lines, especially after layer 5.
* The "PopQA" and "NQ" datasets consistently show lower "I-Don't-Know Rates" compared to "TriviaQA" and "HotpotQA".
* The Mistral-7B-v0.3 model generally exhibits slightly lower "I-Don't-Know Rates" across all datasets and anchoring methods compared to Mistral-7B-v0.1.
* After layer 5, the lines tend to stabilize, indicating that the model's knowledge reaches a plateau.
### Interpretation
The charts demonstrate how the "I-Don't-Know Rate" changes as information propagates through the layers of the Mistral language models. The initial steep decline suggests that the early layers are crucial for acquiring basic knowledge. The subsequent stabilization indicates that the model has learned most of what it can from the training data.
The difference between "Q-Anchored" and "A-Anchored" suggests that the method of anchoring questions and answers impacts the model's confidence. "Q-Anchored" methods, which likely focus on the question itself, lead to lower "I-Don't-Know Rates," implying the model is more confident when directly addressing the query.
The varying rates across datasets ("PopQA", "TriviaQA", "HotpotQA", "NQ") highlight the model's performance differences depending on the complexity and nature of the questions. "PopQA" and "NQ" appear to be easier for the model to answer, resulting in lower "I-Don't-Know Rates."
The slight improvement in Mistral-7B-v0.3 compared to Mistral-7B-v0.1 suggests that the model updates have led to a more knowledgeable and confident model, as evidenced by the consistently lower "I-Don't-Know Rates." This could be due to changes in the training data, model architecture, or training process.
</details>
Figure 35: Comparisons of i-donât-know rate between pathways, probing mlp activations of the last exact answer token.
Appendix H Pathway-Aware Detection
Method LLama-3.2-1B LLama-3.2-3B PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 60.00 49.65 43.34 52.83 54.58 51.76 47.73 53.78 Logits-mean 74.89 60.24 60.18 49.92 73.47 63.46 60.35 54.89 Logits-max 58.56 52.37 52.29 46.19 56.03 54.33 48.65 48.88 Logits-min 78.66 62.37 67.14 51.20 80.92 69.60 71.11 58.24 Scores-mean 72.91 61.13 62.16 64.67 67.99 61.96 64.91 61.71 Scores-max 69.33 59.74 61.29 64.08 63.34 61.92 61.09 57.56 Scores-min 64.84 55.93 59.28 55.81 61.51 56.76 63.95 57.43 Probing Baseline 94.25 77.17 90.25 74.83 90.96 76.61 86.54 74.20 \rowcolor mygray MoP-RandomGate 83.69 69.20 84.11 68.76 79.69 72.38 75.13 67.11 \rowcolor mygray MoP-VanillaExperts 93.86 78.63 90.91 75.73 90.98 77.68 86.41 75.30 \rowcolor mygray MoP 95.85 80.07 91.51 79.19 92.74 78.72 88.16 78.14 \rowcolor mygray PR 96.18 84.22 92.80 86.45 95.70 80.66 90.66 81.91
Table 8: Comparison of hallucination detection performance (AUC) on LLama-3.2-1B and LLama-3.2-3B.
Method LLama-3-8B LLama-3-70B PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 55.85 49.92 52.14 53.27 54.83 50.96 49.39 51.18 Logits-mean 74.52 60.39 51.94 52.63 67.81 52.40 50.45 48.28 Logits-max 58.08 52.20 46.40 47.89 56.21 48.16 43.42 45.33 Logits-min 85.36 70.89 61.28 56.50 79.96 61.53 62.63 52.16 Scores-mean 62.87 62.09 62.06 60.32 56.81 60.70 60.91 58.05 Scores-max 56.62 60.24 59.85 56.06 55.15 59.60 57.32 51.93 Scores-min 60.99 58.27 60.33 57.68 58.77 58.22 64.06 58.05 Probing Baseline 88.71 77.58 82.23 70.20 86.88 81.59 84.45 74.39 \rowcolor mygray MoP-RandomGate 75.52 69.17 79.88 66.56 67.96 70.56 72.16 66.28 \rowcolor mygray MoP-VanillaExperts 89.11 78.73 84.57 71.21 86.04 82.47 82.48 73.85 \rowcolor mygray MoP 92.11 81.18 85.45 74.64 88.54 84.12 86.65 76.12 \rowcolor mygray PR 94.01 83.13 87.81 79.10 90.08 84.21 87.69 78.24
Table 9: Comparison of hallucination detection performance (AUC) on LLama-3-8B and LLama-3-70B.
Method Mistral-7B-v0.1 Mistral-7B-v0.3 PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 48.78 50.43 51.94 55.52 45.49 47.61 57.87 52.79 Logits-mean 69.09 64.95 54.47 59.41 69.52 66.76 55.45 57.88 Logits-max 54.37 54.76 46.74 56.45 54.34 55.24 48.39 54.37 Logits-min 86.02 76.56 68.06 53.73 87.05 77.33 68.08 54.40 Scores-mean 59.00 59.61 64.18 57.60 58.84 60.22 63.28 60.05 Scores-max 51.71 56.58 63.29 55.82 53.00 55.55 63.13 57.73 Scores-min 60.00 57.48 61.17 48.51 60.59 57.84 59.85 50.76 Probing Baseline 89.61 78.43 83.76 74.10 87.39 81.74 83.19 73.60 \rowcolor mygray MoP-RandomGate 80.50 68.27 74.51 68.05 79.81 70.88 72.23 61.19 \rowcolor mygray MoP-VanillaExperts 89.82 79.51 83.54 74.78 88.53 80.93 82.93 73.77 \rowcolor mygray MoP 92.44 84.03 84.63 76.38 91.66 83.57 85.82 76.87 \rowcolor mygray PR 94.72 84.66 89.04 80.92 93.09 84.36 89.03 79.09
Table 10: Comparison of hallucination detection performance (AUC) on Mistral-7B-v0.1 and Mistral-7B-v0.3.