# Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
> Corresponding author
Abstract
Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question–answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
Wen Luo $\heartsuit$ , Guangyue Peng $\heartsuit$ , Wei Li $\heartsuit$ , Shaohang Wei $\heartsuit$ , Feifan Song $\heartsuit$ , Liang Wang ♠, Nan Yang ♠, Xingxing Zhang ♠, Jing Jin $\heartsuit$ , Furu Wei ♠, Houfeng Wang $\heartsuit$ thanks: Corresponding author $\heartsuit$ State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University ♠ Microsoft Research Asia
1 Introduction
Despite their remarkable capabilities in natural language understanding and generation, large language models (LLMs) often produce hallucinations —outputs that appear plausible but are factually incorrect. This phenomenon poses a critical challenge for deploying LLMs in real-world applications where reliability and trustworthiness are paramount (Shi et al., 2024; Bai et al., 2024). One line of research tackles hallucination detection from an extrinsic perspective (Min et al., 2023; Hu et al., 2025; Huang et al., 2025), evaluating only the model’s outputs while disregarding its internal dynamics. Although such approaches can identify surface-level textual inconsistencies, their extrinsic focus limits the insight they offer into the underlying causes of hallucinations. Complementing these efforts, another line of work investigates the intrinsic properties of LLMs, revealing that their internal representations encode rich truthfulness signals (Burns et al., 2023; Li et al., 2023; Chen et al., 2024; Orgad et al., 2025; Niu et al., 2025). These internal truthfulness signals can be exploited to detect an LLM’s own generative hallucinations by training a linear classifier (i.e., a probe) on its hidden representations. However, while prior work establishes the presence of such cues, the mechanisms by which they arise and operate remain largely unexplored. Recent studies indicate well-established mechanisms in LLMs that underpin complex capabilities such as in-context learning (Wang et al., 2023), long-context retrieval (Wu et al., 2025), and reasoning (Qian et al., 2025). This observation naturally leads to a key question: how do truthfulness cues arise and function within LLMs?
In this paper, we uncover that truthfulness signals in LLMs arise from two distinct information pathways: (1) a Question-Anchored (Q-Anchored) pathway, which depends on the flow of information from the input question to the generated answer, and (2) an Answer-Anchored (A-Anchored) pathway, which derives self-contained evidence directly from the model’s own outputs. We begin with a preliminary study using saliency analysis to quantify information flow potentially relevant to hallucination detection. Results reveal a bimodal distribution of dependency on question–answer interactions, suggesting heterogeneous truthfulness encoding mechanisms. To validate this hypothesis, we design two experiments across 4 diverse datasets using 12 models that vary in both architecture and scale, including base, instruction-tuned, and reasoning-oriented models. By (i) blocking critical question–answer information flow through attention knockout (Geva et al., 2023; Fierro et al., 2025) and (ii) injecting hallucinatory cues into questions via token patching (Ghandeharioun et al., 2024; Todd et al., 2024), we disentangle these truthfulness pathways. Our analyses confirm that Q-Anchored signals rely heavily on question-derived cues, whereas A-Anchored signals are robust to their removal and primarily originate from the generated answer itself.
Building on this foundation, we further investigate emergent properties of these truthfulness pathways through large-scale experiments. Our findings highlight two intriguing characteristics: (1) Association with knowledge boundaries: Q-anchored encoding predominates for well-established facts that fall within the knowledge boundary, whereas A-anchored encoding is favored in long-tail cases. (2) Self-awareness: LLM internal states can distinguish which mechanism is being employed, suggesting intrinsic awareness of pathway distinctions.
Finally, these analyses not only deepen our mechanistic understanding of hallucinations but also enable practical applications. Specifically, by leveraging the fundamentally different dependencies of the truthfulness pathways and the model’s intrinsic awareness, we propose two pathway-aware strategies to enhance hallucination detection. (1) Mixture-of-Probes (MoP): Motivated by the specialization of internal pathways, MoP employs a set of expert probing classifiers, each tailored to capture distinct truthfulness encoding mechanisms. (2) Pathway Reweighting (PR): From the perspective of selectively emphasizing pathway-relevant internal cues, PR modulates information intensity to amplify signals that are most informative for hallucination detection, aligning internal activations with pathway-specific evidence. Experiments demonstrate that our proposed methods consistently outperform competing approaches, achieving up to a 10% AUC gain across various datasets and models.
Overall, our key contributions are summarized as follows:
- (Mechanism) We conduct a systematic investigation into how internal truthfulness signals emerge and operate within LLMs, revealing two distinct information pathways: a Question-Anchored pathway that relies on question–answer information flow, and an Answer-Anchored pathway that derives self-contained evidence from the generated output.
- (Discovery) Through large-scale experiments across multiple datasets and model families, we identify two key properties of these mechanisms: (i) association with knowledge boundaries, and (ii) intrinsic self-awareness of pathway distinctions.
- (Application) Building on these findings, we propose two pathway-aware detection methods that exploit the complementary nature of the two mechanisms to enhance hallucination detection, providing new insights for building more reliable generative systems.
2 Background
2.1 Hallucination Detection
Given an LLM $f$ , we denote the dataset as $D=\{(q_{i},\hat{y}^{f}_{i},z^{f}_{i})\}_{i=1}^{N}$ , where $q_{i}$ is the question, $\hat{y}^{f}_{i}$ the model’s answer in open-ended generation, and $z^{f}_{i}∈\{0,1\}$ indicates whether the answer is hallucinatory. The task is to predict $z^{f}_{i}$ given the input $x^{f}_{i}=[q_{i},\hat{y}^{f}_{i}]$ for each instance. Cases in which the model refuses to answer are excluded, as they are not genuine hallucinations and can be trivially classified. Methods based on internal signals assume access to the model’s hidden representations but no external resources (e.g., retrieval systems or fact–checking APIs) (Xue et al., 2025a). Within this paradigm, probing trains a lightweight linear classifier on hidden activations to discriminate between hallucinatory and factual outputs, and has been shown to be among the most effective approaches in this class of internal-signal-based methods (Orgad et al., 2025).
2.2 Exact Question and Answer Tokens
To analyze the origins and mechanisms of truthfulness signals in LLMs, we primarily focus on exact tokens in question–answer pairs. Not all tokens contribute equally to detecting factual errors: some carry core information essential to the meaning of the question or answer, while others provide peripheral details. We draw on semantic frame theory (Baker et al., 1998; Pagnoni et al., 2021), which represents a situation or event along with its participants and their roles. In the theory, frame elements are categorized as: (1) Core frame elements, which define the situation itself, and (2) Non-core elements, which provide additional, non-essential context.
As shown in Table 1, we define: (1) Exact question tokens: core frame elements in the question, typically including the exact subject and property tokens (i.e., South Carolina and capital). (2) Exact answer tokens: core frame elements in the answer that convey the critical information required to respond correctly (i.e., Columbia). Humans tend to rely more on core elements when detecting errors, as these tokens carry the most precise information. Consistent with this intuition, recent work (Orgad et al., 2025) shows that probing activations on the exact answer tokens offers the strongest signal for hallucination detection, outperforming all other token choices. Motivated by these findings, our analysis mainly centers on exact tokens to probe truthfulness signals in LLMs. Moreover, to validate the robustness of our conclusions, we also conduct comprehensive experiments using alternative, non–exact-token configurations (see Appendix B.2).
| Question: What is the capital of South Carolina? |
| --- |
| Answer: It is Columbia, a hub for government, culture, and education that houses the South Carolina State House and the University of South Carolina. |
Table 1: Example of exact question and answer tokens. Colors indicate token types: – exact property, – exact subject, and – exact answer tokens.
3 Two Internal Truthfulness Pathways
We begin with a preliminary analysis using metrics based on saliency scores (§ 3.1). The quantitative results reveal two distinct information pathways for truthfulness encoding: (1) a Question-Anchored (Q-Anchored) Pathway, which relies heavily on exact question tokens (i.e., the questions), and (2) an Answer-Anchored (A-Anchored) Pathway, in which the truthfulness signal is largely independent of the question-to-answer information flow. Section 3.2 presents experiments validating this hypothesis. In particular, we show that Q-Anchored Pathway depends critically on information flowing from the question to the answer, whereas the signals along the A-Anchored Pathway are primarily derived from the LLM-generated answer itself.
3.1 Saliency-Driven Preliminary Study
This section investigates the intrinsic characteristics of LLM attention interactions and their potential role in truthfulness encoding. We employ saliency analysis (Simonyan et al., 2014), a widely used interpretability method, to reveal how attention among tokens influences probe decisions. Following common practice (Michel et al., 2019; Wang et al., 2023), we compute the saliency score as:
$$
S^{l}(i,j)=\left|A^{l}(i,j)\frac{\partial\mathcal{L}(x)}{\partial A^{l}(i,j)}\right|, \tag{1}
$$
where $S^{l}$ denotes the saliency score matrix of the $l$ -th layer, $A^{l}$ represents the attention weights of that layer, and $\mathcal{L}$ is the loss function for hallucination detection (i.e., the binary cross-entropy loss). Scores are averaged over all attention heads within each layer. In particular, $S^{l}(i,j)$ quantifies the saliency of attention from query $i$ to key $j$ , capturing how strongly the information flow from $j$ to $i$ contributes to the detection. We study two types of information flow: (1) $S_{E_{Q}→ E_{A}}$ , the saliency of direct information flow from the exact question tokens to the exact answer tokens, and (2) $S_{E_{Q}→*}$ , the saliency of the total information disseminated by the exact question tokens.
Results
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Charts: Llama-3-8B and Llama-3-70B Saliency Score Distributions
### Overview
Two side-by-side density plots compare saliency score distributions for two Llama-3 model variants (8B and 70B parameters). Each chart shows four overlapping density curves representing different evaluation scenarios, with distinct color coding for TriviaQA and NQ datasets.
### Components/Axes
- **X-axis (Saliency Score)**:
- Llama-3-8B: 0.0–1.5
- Llama-3-70B: 0.0–0.2
- **Y-axis (Density)**:
- Llama-3-8B: 0.00–0.75
- Llama-3-70B: 0–4
- **Legend** (bottom center):
- Blue: S_Eq → E_A (TriviaQA)
- Green: S_Eq → E_A (NQ)
- Orange: S_Eq → * (TriviaQA)
- Red: S_Eq → * (NQ)
### Detailed Analysis
**Llama-3-8B Chart**:
- **Blue (TriviaQA)**: Peaks at ~0.5 with density ~0.6, tapering to ~0.1 at 1.5
- **Green (NQ)**: Peaks at ~0.4 with density ~0.7, broader spread than blue
- **Orange (TriviaQA*)**: Peaks at ~0.6 with density ~0.5, flatter curve
- **Red (NQ*)**: Peaks at ~0.7 with density ~0.4, most right-shifted distribution
**Llama-3-70B Chart**:
- **Blue (TriviaQA)**: Peaks at ~0.05 with density ~3.5, sharp drop-off
- **Green (NQ)**: Peaks at ~0.03 with density ~4, narrowest distribution
- **Orange (TriviaQA*)**: Peaks at ~0.07 with density ~2.5, wider than blue
- **Red (NQ*)**: Peaks at ~0.09 with density ~2, most right-shifted in 70B model
### Key Observations
1. **Scale Differences**: 70B model's saliency scores are compressed (0–0.2 vs 0–1.5), suggesting different normalization or measurement scales.
2. **Peak Density**: 8B model shows higher maximum densities (up to 0.7 vs 4 in 70B), but 70B's values are scaled differently.
3. **Distribution Shapes**:
- 8B model shows broader, more varied distributions
- 70B model exhibits sharper, more concentrated peaks
4. **Dataset Performance**:
- TriviaQA (blue/orange) consistently shows higher saliency scores than NQ (green/red)
- NQ* (red) in 8B model has the highest saliency scores (~0.7)
### Interpretation
The charts demonstrate that:
1. **Model Size Impact**: The 70B model shows more focused attention (narrower distributions) compared to 8B, but with lower absolute saliency values.
2. **Dataset Characteristics**: TriviaQA consistently elicits higher saliency scores than NQ across both models, suggesting different cognitive demands.
3. **Performance Tradeoffs**: While 70B has more precise attention (sharper peaks), 8B maintains broader coverage of saliency scores, potentially indicating better generalization.
4. **Normalization Concerns**: The stark difference in x-axis ranges between models suggests potential methodological differences in saliency score calculation or scaling between model sizes.
The data implies that larger models develop more specialized attention patterns, but smaller models may retain broader contextual awareness. The consistent TriviaQA > NQ pattern across models suggests dataset-specific cognitive processing differences.
</details>
Figure 1: Kernel density estimates of saliency‐score distributions for critical question-to-answer information flows. The bimodal pattern suggests two distinct information mechanisms.
We demonstrate Kernel Density Estimation results of the saliency scores on TriviaQA (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019) datasets. As shown in Figure 1, probability densities reveal a clear bimodal distribution: for all examined information types originating from the question, the probability mass concentrates around two peaks, one near zero saliency and another at a substantially higher value. The near-zero peak suggests that, for a substantial subset of samples, the question-to-answer information flow contributes minimally to hallucination detection, whereas the higher peak reflects strong dependence on such flow.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: ΔP Across Layers for Different Models and Anchoring Methods
### Overview
The image displays three line charts comparing the change in ΔP (ΔP) across layers for three language models: Llama-3-8B, Llama-3-70B, and Mistral-7B-v0.3. Each chart includes multiple data series representing different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis represents ΔP (ranging from -80 to 0), and the x-axis represents layers (0 to 30 or 80, depending on the model). The charts show trends in ΔP values as layers increase, with distinct patterns for each method and dataset.
---
### Components/Axes
- **X-axis (Layer)**:
- Llama-3-8B: 0 to 30 (increments of 10)
- Llama-3-70B: 0 to 80 (increments of 20)
- Mistral-7B-v0.3: 0 to 30 (increments of 10)
- **Y-axis (ΔP)**:
- Range: -80 to 0 (increments of 20)
- Labels: "ΔP" (delta P)
- **Legends**:
- **Llama-3-8B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: A-Anchored (HotpotQA)
- Solid gray: Q-Anchored (NQ)
- Dashed brown: A-Anchored (NQ)
- **Llama-3-70B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: A-Anchored (HotpotQA)
- Solid gray: Q-Anchored (NQ)
- Dashed brown: A-Anchored (NQ)
- **Mistral-7B-v0.3**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: A-Anchored (HotpotQA)
- Solid gray: Q-Anchored (NQ)
- Dashed brown: A-Anchored (NQ)
---
### Detailed Analysis
#### Llama-3-8B Panel
- **Q-Anchored (PopQA)**: Solid blue line starts near 0 and declines sharply to ~-80 by layer 30, with minor fluctuations.
- **A-Anchored (PopQA)**: Dashed orange line remains near 0 throughout, showing minimal change.
- **Q-Anchored (TriviaQA)**: Solid green line declines gradually to ~-60 by layer 30.
- **A-Anchored (TriviaQA)**: Dashed red line shows a slight decline to ~-40 by layer 30.
- **Q-Anchored (HotpotQA)**: Solid purple line declines to ~-70 by layer 30.
- **A-Anchored (HotpotQA)**: Dashed pink line declines to ~-50 by layer 30.
- **Q-Anchored (NQ)**: Solid gray line declines to ~-75 by layer 30.
- **A-Anchored (NQ)**: Dashed brown line declines to ~-60 by layer 30.
#### Llama-3-70B Panel
- **Q-Anchored (PopQA)**: Solid blue line starts near 0 and declines to ~-80 by layer 80, with oscillations.
- **A-Anchored (PopQA)**: Dashed orange line remains near 0, showing no significant change.
- **Q-Anchored (TriviaQA)**: Solid green line declines to ~-60 by layer 80.
- **A-Anchored (TriviaQA)**: Dashed red line declines to ~-40 by layer 80.
- **Q-Anchored (HotpotQA)**: Solid purple line declines to ~-70 by layer 80.
- **A-Anchored (HotpotQA)**: Dashed pink line declines to ~-50 by layer 80.
- **Q-Anchored (NQ)**: Solid gray line declines to ~-75 by layer 80.
- **A-Anchored (NQ)**: Dashed brown line declines to ~-60 by layer 80.
#### Mistral-7B-v0.3 Panel
- **Q-Anchored (PopQA)**: Solid blue line starts near 0 and declines to ~-80 by layer 30.
- **A-Anchored (PopQA)**: Dashed orange line remains near 0.
- **Q-Anchored (TriviaQA)**: Solid green line declines to ~-60 by layer 30.
- **A-Anchored (TriviaQA)**: Dashed red line declines to ~-40 by layer 30.
- **Q-Anchored (HotpotQA)**: Solid purple line declines to ~-70 by layer 30.
- **A-Anchored (HotpotQA)**: Dashed pink line declines to ~-50 by layer 30.
- **Q-Anchored (NQ)**: Solid gray line declines to ~-75 by layer 30.
- **A-Anchored (NQ)**: Dashed brown line declines to ~-60 by layer 30.
---
### Key Observations
1. **Q-Anchored vs. A-Anchored**:
- Q-Anchored methods (solid lines) consistently show steeper declines in ΔP compared to A-Anchored methods (dashed lines) across all models and datasets.
- A-Anchored methods (dashed lines) exhibit minimal or no change in ΔP, remaining close to 0.
2. **Dataset-Specific Trends**:
- **PopQA**: Q-Anchored methods show the most significant ΔP decline, while A-Anchored methods remain stable.
- **TriviaQA**: Q-Anchored methods decline moderately, while A-Anchored methods show slight declines.
- **HotpotQA**: Q-Anchored methods decline sharply, while A-Anchored methods show moderate declines.
- **NQ**: Q-Anchored methods decline steeply, while A-Anchored methods show moderate declines.
3. **Model-Specific Variations**:
- **Llama-3-8B**: All Q-Anchored methods show steep declines, with PopQA and NQ having the most pronounced drops.
- **Llama-3-70B**: Similar trends to Llama-3-8B, but with more oscillations in Q-Anchored lines.
- **Mistral-7B-v0.3**: Q-Anchored methods show steep declines, while A-Anchored methods remain stable.
4. **Fluctuations**:
- Some lines (e.g., Q-Anchored (TriviaQA) in Llama-3-70B) exhibit oscillations, suggesting variability in ΔP across layers.
---
### Interpretation
The data suggests that **Q-Anchored methods** (solid lines) are more sensitive to layer changes, resulting in larger ΔP declines compared to **A-Anchored methods** (dashed lines), which remain relatively stable. This implies that Q-Anchored approaches may be more effective or impactful in certain contexts, depending on the dataset.
- **Dataset Influence**:
- PopQA and NQ datasets show the most significant ΔP declines for Q-Anchored methods, indicating these datasets may be more challenging or require greater adjustments across layers.
- TriviaQA and HotpotQA datasets exhibit moderate declines, suggesting they are less sensitive to anchoring methods.
- **Model Size**:
- Llama-3-70B (larger model) shows more oscillations in Q-Anchored lines, possibly due to increased complexity or parameter interactions.
- Mistral-7B-v0.3 (smaller model) exhibits smoother trends, suggesting simpler layer dynamics.
- **Anomalies**:
- The A-Anchored (PopQA) lines in all panels remain nearly flat, indicating minimal impact of anchoring on ΔP for this dataset.
- Oscillations in Q-Anchored lines (e.g., Llama-3-70B) may reflect model-specific architectural or training characteristics.
This analysis highlights the importance of anchoring methods and dataset selection in shaping ΔP trends, which could inform model optimization or evaluation strategies.
</details>
Figure 2: $\Delta\mathrm{P}$ under attention knockout. The layer axis indicates the Transformer layer on which the probe is trained. Shaded regions indicate 95% confidence intervals. Full results in Appendix C.
Hypothesis
These observations lead to the hypothesis that there are two distinct mechanisms of internal truthfulness encoding for hallucination detection: (1) one characterized by strong reliance on the key question-to-answer information from the exact question tokens, and (2) one in which truthfulness encoding is largely independent of the question. We validate the proposed hypothesis through further experiments in the next section.
3.2 Disentangling Information Mechanisms
We hypothesize that the internal truthfulness encoding operates through two distinct information flow mechanisms, driven by the attention modules within Transformer blocks. To validate the hypothesis, we first block information flows associated with the exact question tokens and analyze the resulting changes in the probe’s predictions. Subsequently, we apply a complementary technique, called token patching, to further substantiate the existence of these two mechanisms. Finally, we demonstrate that the self-contained information from the LLM-generated answer itself drives the truthfulness encoding for the A-Anchored type.
3.2.1 Experimental Setup
Our analysis covers a diverse collection of 12 LLMs that vary in both scale and architectural design. Specifically, we consider three categories: (1) base models, including Llama-3.2-1B (Grattafiori et al., 2024), Llama-3.2-3B, Llama-3-8B, Llama-3-70B, Mistral-7B-v0.1 (Jiang et al., 2023), and Mistral-7B-v0.3; (2) instruction-tuned models, including Llama-3.2-3B-Instruct, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Mistral-7B-Instruct-v0.3; and (3) reasoning-oriented models, namely Qwen3-8B (Yang et al., 2025) and Qwen3-32B. We conduct experiments on 4 widely used question-answering datasets: PopQA (Mallen et al., 2023), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019). Additional implementation details are provided in Appendix B.
3.2.2 Identifying Anchored Modes via Attention Knockout
Experiment
To investigate whether internal truthfulness encoding operates via distinct information mechanisms, we perform an attention knockout experiment targeting the exact question tokens. Specifically, for a probe trained on representations from the $k$ -th layer, we set $A_{l}(i,E_{Q})=0$ for layers $l∈\{1,...,k\}$ and positions $i>E_{Q}$ . This procedure blocks the information flow from question tokens to subsequent positions in the representation. We then examine how the probe’s predictions respond to this intervention. To provide a clearer picture, instances are categorized according to whether their prediction $\hat{z}$ changes after the attention knockout:
$$
\text{Mode}(x)=\begin{cases}\text{Q-Anchored},&\text{if }\hat{z}\neq\tilde{\hat{z}}\\
\text{A-Anchored},&\text{otherwise}\end{cases} \tag{2}
$$
where $\hat{z}$ and $\tilde{\hat{z}}$ denote predictions before and after the attention knockout, respectively.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison Across Models and Datasets
### Overview
The image presents three grouped bar charts comparing prediction flip rates for three language models (Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3) across four datasets (PopQA, TriviaQA, HotpotQA, NQ). Each dataset group contains four bars representing different anchoring strategies: Q-Anchored (exact_question), Q-Anchored (random), A-Anchored (exact_question), and A-Anchored (random). The y-axis measures prediction flip rate (0-80%), and the x-axis lists datasets.
### Components/Axes
- **X-Axis**: Datasets (PopQA, TriviaQA, HotpotQA, NQ)
- **Y-Axis**: Prediction Flip Rate (%) with scale 0-80
- **Legend**: Located at the bottom, color-coded as:
- Red: Q-Anchored (exact_question)
- Maroon: Q-Anchored (random)
- Gray: A-Anchored (exact_question)
- Dark Gray: A-Anchored (random)
- **Models**: Separate charts for Llama-3-8B (left), Llama-3-70B (center), Mistral-7B-v0.3 (right)
### Detailed Analysis
#### Llama-3-8B
- **PopQA**:
- Q-Anchored (exact): ~75%
- A-Anchored (exact): ~35%
- Q-Anchored (random): ~10%
- A-Anchored (random): ~2%
- **TriviaQA**:
- Q-Anchored (exact): ~70%
- A-Anchored (exact): ~30%
- Q-Anchored (random): ~12%
- A-Anchored (random): ~3%
- **HotpotQA**:
- Q-Anchored (exact): ~65%
- A-Anchored (exact): ~10%
- Q-Anchored (random): ~13%
- A-Anchored (random): ~4%
- **NQ**:
- Q-Anchored (exact): ~60%
- A-Anchored (exact): ~15%
- Q-Anchored (random): ~8%
- A-Anchored (random): ~1%
#### Llama-3-70B
- **PopQA**:
- Q-Anchored (exact): ~78%
- A-Anchored (exact): ~32%
- Q-Anchored (random): ~15%
- A-Anchored (random): ~2%
- **TriviaQA**:
- Q-Anchored (exact): ~75%
- A-Anchored (exact): ~35%
- Q-Anchored (random): ~18%
- A-Anchored (random): ~3%
- **HotpotQA**:
- Q-Anchored (exact): ~70%
- A-Anchored (exact): ~20%
- Q-Anchored (random): ~12%
- A-Anchored (random): ~5%
- **NQ**:
- Q-Anchored (exact): ~65%
- A-Anchored (exact): ~25%
- Q-Anchored (random): ~10%
- A-Anchored (random): ~2%
#### Mistral-7B-v0.3
- **PopQA**:
- Q-Anchored (exact): ~70%
- A-Anchored (exact): ~25%
- Q-Anchored (random): ~10%
- A-Anchored (random): ~1%
- **TriviaQA**:
- Q-Anchored (exact): ~72%
- A-Anchored (exact): ~28%
- Q-Anchored (random): ~12%
- A-Anchored (random): ~2%
- **HotpotQA**:
- Q-Anchored (exact): ~68%
- A-Anchored (exact): ~22%
- Q-Anchored (random): ~9%
- A-Anchored (random): ~3%
- **NQ**:
- Q-Anchored (exact): ~66%
- A-Anchored (exact): ~27%
- Q-Anchored (random): ~11%
- A-Anchored (random): ~2%
### Key Observations
1. **Q-Anchored (exact_question)** consistently achieves the highest prediction flip rates across all models and datasets, suggesting it is the most effective anchoring strategy.
2. **A-Anchored (exact_question)** outperforms random anchoring but lags behind Q-Anchored (exact_question).
3. **Random anchoring** (both Q and A) results in the lowest flip rates, indicating poor performance.
4. **Model size correlation**: Llama-3-70B (larger model) generally achieves higher flip rates than Llama-3-8B and Mistral-7B-v0.3, though the difference is less pronounced than the impact of anchoring strategy.
5. **Dataset variability**: NQ shows the lowest flip rates overall, while PopQA and TriviaQA perform better.
### Interpretation
The data demonstrates that **exact question anchoring** (both Q and A) significantly improves prediction accuracy compared to random anchoring. This suggests that grounding the model's reasoning in specific question or answer content enhances reliability. While larger models (e.g., Llama-3-70B) perform better, the anchoring method has a more substantial impact than model size alone. The NQ dataset's lower performance may indicate greater complexity or ambiguity in its questions, requiring more precise anchoring to achieve higher accuracy. The consistent trend across models highlights the importance of anchoring strategies in reducing prediction errors.
</details>
Figure 3: Prediction flip rate under token patching. Q-Anchored samples demonstrate significantly higher sensitivity than the counterparts when hallucinatory cues are injected into exact questions. Full results in Appendix D.
Results
The results in Figure 2 and Appendix C reveal a clear bifurcation of behaviors: for one subset of instances, probabilities shift substantially, while for another subset, probabilities remain nearly unchanged across all layers. Shaded regions indicate 95% confidence intervals, confirming that this qualitative separation is statistically robust. This sharp divergence supports the hypothesis that internal truthfulness encoding operates via two distinct mechanisms with respect to question–answer information. In Appendix C, we conduct a comprehensive analysis of alternative configurations for token selection, activation extraction, and various instruction- or reasoning-oriented models, and observe consistent patterns across all settings. Moreover, Figure 16 in Appendix C shows that blocking information from randomly selected question tokens yields negligible changes, in contrast to blocking exact question tokens, underscoring the nontrivial nature of the identified mechanisms.
3.2.3 Further Validation via Token Patching
Experiment
To further validate our findings, we employ a critical token patching technique to investigate how the internal representations of the LLM respond to hallucinatory signals originating from exact question tokens under the two proposed mechanisms. Given a context sample $d_{c}$ , we randomly select a patch sample $d_{p}$ and replace the original question tokens $E_{Q}^{c}$ in $d_{c}$ with the exact question tokens $E_{Q}^{p}$ from $d_{p}$ . This operation introduces hallucinatory cues into the context sample, allowing us to assess whether the LLM’s internal states appropriately reflect the injected changes. We restrict our analysis to context instances where the original LLM answers are factual, ensuring that any observed changes can be attributed solely to the injected hallucinatory cues.
Results
We measure the sensitivity of the truthfulness signals using the prediction flip rate, defined as the frequency with which the probe’s prediction changes after hallucinatory cues are introduced. Figure 3 and Appendix D present the results of the best-performing layer of each model on four datasets when patching the exact subject tokens. Across models and datasets, Q-Anchored mode exhibits significantly higher sensitivity compared to A-Anchored mode when exposed to hallucination cues from the questions. Furthermore, within each pathway, the flip rates where exact question tokens are patched are substantially higher than those observed when random tokens are patched, ruling out the possibility that the observed effects are mainly due to general semantic disruption from token replacement. These consistent results provide further support for our hypothesis regarding distinct mechanisms of information pathways.
3.2.4 What Drives A-Anchored Encoding?
Experiment
Since the A-Anchored mode operates largely independently of the question-to-answer information flow, it is important to investigate the source of information it uses to identify hallucinations. To this end, we remove the questions entirely from each sample and perform a separate forward pass using only the LLM-generated answers. This procedure yields answer-only hidden states, which are subsequently provided as input to the probe. We then evaluate how the probe’s predictions change under this “answer-only” condition. This setup enables us to assess whether A-Anchored predictions rely primarily on the generated answer itself rather than on the original question.
Results
As shown in Figure 4 and Appendix E, Q-Anchored instances exhibit substantial changes in prediction probability when the question is removed, reflecting their dependence on question-to-answer information. In contrast, A-Anchored instances remain largely invariant, indicating that the probe continues to detect hallucinations using information encoded within the LLM-generated answer itself. These findings suggest that the A-Anchored mechanism primarily leverages self-contained answer information to build signals about truthfulness.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison Across Models and Datasets
### Overview
The image presents a grouped bar chart comparing performance metrics (-ΔP) across three language models (Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3) and four datasets (PopQA, TriviaQA, HotpotQA, NQ). Two configurations are compared: Q-Anchored (red bars) and A-Anchored (gray bars). The y-axis measures -ΔP (higher values indicate better performance), while the x-axis categorizes datasets.
### Components/Axes
- **X-Axis**: Datasets (PopQA, TriviaQA, HotpotQA, NQ) repeated for each model section.
- **Y-Axis**: -ΔP values (0–80 range, increments of 20).
- **Legend**:
- Red = Q-Anchored
- Gray = A-Anchored
- **Model Sections**: Three distinct sub-charts labeled by model name (top-left to top-right).
### Detailed Analysis
#### Llama-3-8B Section
- **PopQA**: Q-Anchored ≈ 50, A-Anchored ≈ 5
- **TriviaQA**: Q-Anchored ≈ 60, A-Anchored ≈ 10
- **HotpotQA**: Q-Anchored ≈ 50, A-Anchored ≈ 20
- **NQ**: Q-Anchored ≈ 30, A-Anchored ≈ 5
#### Llama-3-70B Section
- **PopQA**: Q-Anchored ≈ 50, A-Anchored ≈ 5
- **TriviaQA**: Q-Anchored ≈ 45, A-Anchored ≈ 10
- **HotpotQA**: Q-Anchored ≈ 40, A-Anchored ≈ 20
- **NQ**: Q-Anchored ≈ 40, A-Anchored ≈ 5
#### Mistral-7B-v0.3 Section
- **PopQA**: Q-Anchored ≈ 70, A-Anchored ≈ 15
- **TriviaQA**: Q-Anchored ≈ 50, A-Anchored ≈ 5
- **HotpotQA**: Q-Anchored ≈ 40, A-Anchored ≈ 20
- **NQ**: Q-Anchored ≈ 45, A-Anchored ≈ 3
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored configurations consistently outperform A-Anchored across all models and datasets (e.g., Mistral-7B-v0.3 on PopQA: 70 vs. 15).
2. **Model-Specific Trends**:
- Llama-3-8B shows the largest gap in TriviaQA (60 vs. 10).
- Mistral-7B-v0.3 achieves the highest Q-Anchored value (70) on PopQA.
3. **NQ Dataset Anomaly**: NQ has the smallest performance gaps (e.g., Llama-3-8B: 30 vs. 5; Mistral-7B-v0.3: 45 vs. 3).
4. **A-Anchored Peaks**: A-Anchored values peak in HotpotQA for Llama-3-70B and Mistral-7B-v0.3 (20).
### Interpretation
The data suggests that Q-Anchored configurations generally yield superior performance, with performance gaps varying by model architecture and dataset complexity. Mistral-7B-v0.3 demonstrates the strongest Q-Anchored performance, particularly on PopQA, despite being smaller than Llama-3-70B. The NQ dataset exhibits minimal sensitivity to anchoring, possibly due to its nature (e.g., question-answering vs. open-domain tasks). The consistent A-Anchored peaks in HotpotQA may indicate dataset-specific challenges where alternative anchoring strategies resonate better. Model size does not strictly correlate with performance, as Mistral-7B-v0.3 outperforms larger Llama models in Q-Anchored tasks on PopQA, highlighting architectural efficiency as a critical factor.
</details>
Figure 4: $-\Delta\mathrm{P}$ with only the LLM-generated answer. Q-Anchored instances exhibit substantial shifts, whereas A-Anchored instances remain stable, confirming that A-Anchored truthfulness encoding relies on information in the LLM-generated answer itself. Full results in Appendix E.
4 Properties of Truthfulness Pathways
This section examines notable properties and distinct behaviors of intrinsic truthfulness encoding: (1) Associations with knowledge boundaries: samples within the LLM’s knowledge boundary tend to encode truthfulness via the Q-Anchored pathway, whereas samples beyond the boundary often rely on the A-Anchored signal; (2) Self-awareness: internal representations can be used to predict which mechanism is being employed, suggesting that LLMs possess intrinsic awareness of pathway distinctions.
4.1 Associations with Knowledge Boundaries
We find that distinct patterns of truthfulness encoding are closely associated with the knowledge boundaries of LLMs. To characterize these boundaries, three complementary metrics are employed: (1) Answer accuracy, the most direct indicator of an LLM’s factual competence; (2) I-don’t-know rate (shown in Appendix G), which reflects the model’s ability to recognize and express its own knowledge limitations; (3) Entity popularity, which is widely used to distinguish between common and long-tail factual knowledge (Mallen et al., 2023).
As shown in Figure 5 and Appendix F, Q-Anchored samples achieve significantly higher accuracy than those driven by the A-Anchored pathway. The results for the I-don’t-know rate, reported in Appendix G, exhibit trends consistent with answer accuracy, further indicating stronger knowledge handling in Q-Anchored samples. Moreover, entity popularity, shown in Figure 6, provides a more fine-grained perspective on knowledge boundaries. Specifically, Q-Anchored samples tend to involve more popular entities, whereas A-Anchored samples are more frequently associated with less popular, long-tail factual knowledge. These findings suggest that truthfulness encoding is strongly aligned with the availability of stored knowledge: when LLMs possess the requisite knowledge, they predominantly rely on question–answer information flow (Q-Anchored); when knowledge is unavailable, they instead draw upon internal patterns within their own generated outputs (A-Anchored).
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Graphs: Answer Accuracy Across Layers for Different Models and Datasets
### Overview
The image contains three line graphs comparing answer accuracy across transformer model layers for different architectures (Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3) and datasets (PopQA, TriviaQA, HotpotQA, NQ). Each graph shows two anchoring methods: Q-Anchored (question-focused) and A-Anchored (answer-focused), with shaded regions indicating variability.
### Components/Axes
- **X-axis**: Layer number (0–30 for Llama-3-8B/Mistral-7B, 0–80 for Llama-3-70B)
- **Y-axis**: Answer Accuracy (%) (0–100)
- **Legends**:
- **Q-Anchored**: Solid lines (PopQA: blue, TriviaQA: green, HotpotQA: purple, NQ: pink)
- **A-Anchored**: Dashed lines (PopQA: orange, TriviaQA: gray, HotpotQA: red, NQ: dark gray)
- **Shading**: Represents confidence intervals or variability around each line.
### Detailed Analysis
#### Llama-3-8B
- **Q-Anchored (PopQA)**: Blue line starts at ~20% accuracy, peaks at ~85% by layer 10, then declines to ~60% by layer 30.
- **A-Anchored (PopQA)**: Orange line starts at ~40%, peaks at ~70% by layer 10, then drops to ~30% by layer 30.
- **TriviaQA**: Q-Anchored (green) peaks at ~90% by layer 15, declines to ~70%. A-Anchored (gray) peaks at ~60%, drops to ~40%.
- **HotpotQA**: Q-Anchored (purple) peaks at ~80% by layer 20, declines to ~60%. A-Anchored (red) peaks at ~50%, drops to ~30%.
- **NQ**: Q-Anchored (pink) peaks at ~75% by layer 10, declines to ~50%. A-Anchored (dark gray) peaks at ~45%, drops to ~25%.
#### Llama-3-70B
- **Q-Anchored (PopQA)**: Blue line peaks at ~90% by layer 20, declines to ~70% by layer 80.
- **A-Anchored (PopQA)**: Orange line peaks at ~75% by layer 20, declines to ~50% by layer 80.
- **TriviaQA**: Q-Anchored (green) peaks at ~95% by layer 40, declines to ~80%. A-Anchored (gray) peaks at ~70%, drops to ~50%.
- **HotpotQA**: Q-Anchored (purple) peaks at ~85% by layer 60, declines to ~70%. A-Anchored (red) peaks at ~60%, drops to ~40%.
- **NQ**: Q-Anchored (pink) peaks at ~80% by layer 40, declines to ~60%. A-Anchored (dark gray) peaks at ~55%, drops to ~35%.
#### Mistral-7B-v0.3
- **Q-Anchored (PopQA)**: Blue line peaks at ~90% by layer 10, declines to ~70% by layer 30.
- **A-Anchored (PopQA)**: Orange line peaks at ~70% by layer 10, declines to ~40% by layer 30.
- **TriviaQA**: Q-Anchored (green) peaks at ~85% by layer 15, declines to ~65%. A-Anchored (gray) peaks at ~60%, drops to ~40%.
- **HotpotQA**: Q-Anchored (purple) peaks at ~80% by layer 20, declines to ~60%. A-Anchored (red) peaks at ~50%, drops to ~30%.
- **NQ**: Q-Anchored (pink) peaks at ~75% by layer 10, declines to ~55%. A-Anchored (dark gray) peaks at ~45%, drops to ~25%.
### Key Observations
1. **Q-Anchored Methods**: Consistently show higher initial accuracy across all models/datasets but experience sharper declines as layers increase.
2. **A-Anchored Methods**: Start with lower accuracy but exhibit more gradual declines, suggesting better stability in deeper layers.
3. **Dataset Variability**:
- TriviaQA and NQ show the most pronounced layer-dependent performance drops.
- PopQA maintains higher accuracy in Q-Anchored configurations across all models.
4. **Shaded Regions**: Indicate significant variability in performance, particularly for A-Anchored methods in deeper layers (e.g., Llama-3-70B layer 60+).
### Interpretation
The data suggests that Q-Anchored methods prioritize early-layer question understanding, leading to strong initial performance but reduced effectiveness in deeper layers. A-Anchored methods, while starting weaker, may better maintain answer coherence in later layers. The dataset-specific trends imply that question complexity (e.g., TriviaQA/NQ) amplifies layer-dependent performance degradation. The shaded regions highlight the need for robustness testing, as variability increases with layer depth. These findings could inform model architecture design, emphasizing the trade-off between question comprehension and answer stability across transformer layers.
</details>
Figure 5: Comparisons of answer accuracy between pathways. Q-Anchored samples show higher accuracy than A-Anchored ones, highlighting the association between truthfulness encoding and LLM knowledge boundaries. Full results in Appendix F and G.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Entity Frequency Comparison Across Models
### Overview
The chart compares entity frequency distributions between Q-Anchored and A-Anchored models across four language models: Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3, and Mistral-7B-v0.1. Two bar series are shown: red for Q-Anchored and gray for A-Anchored.
### Components/Axes
- **X-Axis**: Model names (Llama-3-8B, Llama-3-70B, Mistral-7B-v0.3, Mistral-7B-v0.1)
- **Y-Axis**: Entity Frequency (0 to 80,000 in increments of 20,000)
- **Legend**: Located at bottom center, with red = Q-Anchored, gray = A-Anchored
- **Title**: "Entity Frequency" (y-axis label)
### Detailed Analysis
1. **Llama-3-8B**:
- Q-Anchored: ~65,000 (±2,000)
- A-Anchored: ~18,000 (±1,500)
2. **Llama-3-70B**:
- Q-Anchored: ~25,000 (±1,800)
- A-Anchored: ~12,000 (±1,200)
3. **Mistral-7B-v0.3**:
- Q-Anchored: ~55,000 (±2,500)
- A-Anchored: ~14,000 (±1,300)
4. **Mistral-7B-v0.1**:
- Q-Anchored: ~75,000 (±3,000)
- A-Anchored: ~22,000 (±2,000)
### Key Observations
- Q-Anchored models consistently show **3-5x higher entity frequencies** than A-Anchored across all models
- Mistral-7B-v0.1 has the **highest Q-Anchored frequency** (75k) and **largest gap** between anchoring methods (53k difference)
- Llama-3-70B shows the **smallest Q-Anchored frequency** (25k) and **narrowest gap** (13k difference)
- A-Anchored frequencies remain relatively stable (±1,500) across models
### Interpretation
The data demonstrates that Q-Anchored models significantly outperform A-Anchored models in entity frequency capture across all tested architectures. The performance gap widens with model size (Llama-3-8B: 47k difference vs. Mistral-7B-v0.1: 53k difference), suggesting anchoring method effectiveness scales with model capacity. The Mistral-7B-v0.1 variant shows optimal performance in both anchoring methods, indicating potential architectural improvements in later versions. The consistent pattern across models implies that anchoring methodology has a more substantial impact on entity frequency than model size alone.
</details>
Figure 6: Entity frequency distributions for both pathways on PopQA. Q-Anchored samples concentrate on more popular entities, whereas A-Anchored samples skew toward long-tail entities.
4.2 Self-Awareness of Pathway Distinctions
Given that LLMs encode truthfulness via two distinct mechanisms, this section investigates whether their internal representations contain discriminative information that can be used to distinguish between these mechanisms. To this end, we train probing classifiers on the models’ original internal states (i.e., without knockout interventions) to predict which mechanism is being utilized.
Table 2 reports the pathway classification results of the best-performing layers in hallucination detection across different models. Our findings demonstrate that different mechanisms can be reliably inferred from internal representations, suggesting that, in addition to encoding truthfulness, LLMs exhibit intrinsic awareness of pathway distinctions. These findings highlight a potential avenue for fine-grained improvements targeting specific truthfulness encoding mechanisms.
Datasets Llama-3-8B Llama-3-70B Mistral-7B-v0.3 PopQA 87.80 92.66 87.64 TriviaQA 75.10 83.91 85.87 HotpotQA 86.31 87.34 92.13 NQ 78.31 84.14 84.83
Table 2: AUCs for encoding pathway classification. The predictability from internal representations indicates that LLMs possess intrinsic awareness of pathway distinctions.
5 Pathway-Aware Detection
Building on the intriguing findings, we explore how the discovered pathway distinctions can be leveraged to improve hallucination detection. Specifically, two simple yet effective pathway-aware strategies are proposed: (1) Mixture-of-Probes (MoP) (§ 5.1), which allows expert probes to specialize in Q-Anchored and A-Anchored pathways respectively, and (2) Pathway Reweighting (PR) (§ 5.2), a plug-and-play approach that amplifies pathway-relevant cues salient for detection.
5.1 Mixture-of-Probes
Motivated by the fundamentally different dependencies of the two encoding pathways and the LLMs’ intrinsic awareness of them, we propose a Mixture-of-Probes (MoP) framework that explicitly captures this heterogeneity. Rather than training a single probe to handle all inputs, MoP employs two pathway-specialized experts and leverages the self-awareness probe (§ 4.2) as a gating network to combine their predictions. Let $\mathbf{h}^{l^{*}}(x)\!∈\!\mathbb{R}^{d}$ be the token hidden state from the best detection layer $l^{*}$ . Two expert probes $p_{Q}(·)$ and $p_{A}(·)$ are trained separately for two pathway samples, and the self-awareness probe provides a gating coefficient $\pi(\mathbf{h}^{l^{*}}(x))\!∈\![0,1]$ . The final prediction is a convex combination, requiring no extra training:
$$
\displaystyle p_{\text{MoP}}(z\!=\!1\mid\mathbf{h}^{l^{*}}(x)) \displaystyle=\pi_{Q}\,p_{Q}(z\!=\!1\mid\mathbf{h}^{l^{*}}(x)) \displaystyle\quad+(1-\pi_{Q})\,p_{A}(z\!=\!1\mid\mathbf{h}^{l^{*}}(x)). \tag{3}
$$
5.2 Pathway Reweighting
From the perspective of emphasizing pathway-relevant internal cues, we introduce a plug-and-play Pathway Reweighting (PR) method that directly modulates the question–answer information flow. The key idea is to adjust the attention from exact answer to question tokens according to the predicted pathway, amplifying the signals most salient for hallucination detection. For each layer $l≤ l^{*}$ , two learnable scalars $\alpha_{Q}^{l},\alpha_{A}^{l}>0$ are introduced. Given self-awareness probability $\pi(\mathbf{h}^{l^{*}}(x))$ , we rescale attention edges $i\!∈\!E_{A}$ , $j\!∈\!E_{Q}$ to construct representations tailored for detection:
$$
\tilde{A}^{l}(i,j)=\begin{cases}\bigl[1+s(\mathbf{h}^{l^{*}}(x))\bigr]A^{l}(i,j),&i\!\in\!E_{A},j\!\in\!E_{Q},\\
A^{l}(i,j),&\text{otherwise},\end{cases} \tag{4}
$$
where
$$
s(\mathbf{h}^{l^{*}}(x))=\pi_{Q}\,\alpha_{Q}^{l}-(1-\pi_{Q})\,\alpha_{A}^{l}. \tag{5}
$$
The extra parameters serve as a lightweight adapter, used only during detection to guide salient truthfulness cues and omitted during generation, leaving the generation capacity unaffected.
Method Llama-3-8B Mistral-7B-v0.3 PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 55.85 49.92 52.14 53.27 45.49 47.61 57.87 52.79 Logits-mean 74.52 60.39 51.94 52.63 69.52 66.76 55.45 57.88 Logits-min 85.36 70.89 61.28 56.50 87.05 77.33 68.08 54.40 Probing Baseline 88.71 77.58 82.23 70.20 87.39 81.74 83.19 73.60 \rowcolor mygray MoP-RandomGate 75.52 69.17 79.88 66.56 79.81 70.88 72.23 61.19 \rowcolor mygray MoP-VanillaExperts 89.11 78.73 84.57 71.21 88.53 80.93 82.93 73.77 \rowcolor mygray MoP 92.11 81.18 85.45 74.64 91.66 83.57 85.82 76.87 \rowcolor mygray PR 94.01 83.13 87.81 79.10 93.09 84.36 89.03 79.09
Table 3: Comparison of hallucination detection performance (AUC). Full results in Appendix H.
5.3 Experiments
Setup
The experimental setup follows Section 3.2.1. We compare our method against several internal-based baselines, including (1) P(True) (Kadavath et al., 2022), (2) uncertainty-based metrics (Aichberger et al., 2024; Xue et al., 2025a), and (3) probing classifiers (Chen et al., 2024; Orgad et al., 2025). Results are averaged over three random seeds. Additional implementation details are provided in Appendix B.5 and B.6.
Results
As shown in Table 3 and Appendix H, both MoP and PR consistently outperform competing approaches across different datasets and model scales. Specifically, for MoP, we further examine two ablated variants: (1) MoP-RandomGate, which randomly routes the two pathway experts without leveraging the self-awareness probe; and (2) MoP-VanillaExperts, which replaces the expert probes with two vanilla probes to serve as a simple ensemble strategy. Both ablated variants exhibit substantially degraded performance compared to MoP, underscoring the roles of pathway specialization and self-awareness gating. For PR, the method proves particularly effective in improving performance by dynamically adjusting the focus on salient truthfulness cues. These results demonstrate that explicitly modeling truthfulness encoding heterogeneity can effectively translate the insights of our analysis into practical gains for hallucination detection.
6 Related Work
Hallucination detection in LLMs has received increasing attention because of its critical role in building reliable and trustworthy generative systems (Tian et al., 2024; Shi et al., 2024; Bai et al., 2024). Existing approaches can be broadly grouped by whether they rely on external resources (e.g., retrieval systems or fact–checking APIs). Externally assisted methods cross-verify output texts against external knowledge bases (Min et al., 2023; Hu et al., 2025; Huang et al., 2025) or specialized LLM judges (Luo et al., 2024; Bouchard and Chauhan, 2025; Zhang et al., 2025). Resource-free methods avoid external data and instead exploit the model’s own intermediate computations. Some leverage the model’s self-awareness of knowledge boundaries (Kadavath et al., 2022; Luo et al., 2025), while others use uncertainty-based measures (Aichberger et al., 2024; Xue et al., 2025a), treating confidence as a proxy for truthfulness. These techniques analyze output distributions (e.g., logits) (Aichberger et al., 2024), variance across multiple samples (e.g., consistency) (Min et al., 2023; Aichberger et al., 2025), or other statistical indicators of prediction uncertainty (Xue et al., 2025b). Another line of work trains linear probing classifiers on hidden representations to capture intrinsic truthfulness signals. Prior work (Burns et al., 2023; Li et al., 2023; Chen et al., 2024; Orgad et al., 2025) shows that LLMs encode rich latent features correlated with factual accuracy, enabling efficient detection with minimal overhead. Yet the mechanisms behind these internal truthfulness encoding remain poorly understood. Compared to previous approaches, our work addresses this gap by dissecting how such intrinsic signals emerge and operate, revealing distinct information pathways that not only yield explanatory insights but also enhance detection performance.
7 Conclusion
We investigate how LLMs encode truthfulness, revealing two complementary pathways: a Question-Anchored pathway relying on question–answer flow, and an Answer-Anchored pathway extracting self-contained evidence from generated outputs. Analyses across datasets and models highlight their ties to knowledge boundaries and intrinsic self-awareness. Building on these insights, we further propose two applications to improve hallucination detection. Overall, our findings not only advance mechanistic understanding of intrinsic truthfulness encoding but also offer practical applications for building more reliable generative systems.
Limitations
While this work provides a systematic analysis of intrinsic truthfulness encoding mechanisms in LLMs and demonstrates their utility for hallucination detection, one limitation is that, similar to prior work on mechanistic interpretability, our analyses and pathway-aware applications assume access to internal model representations. Such access may not always be available in strictly black-box settings. In these scenarios, additional engineering or alternative approximations may be required for practical deployment, which we leave for future work.
Ethics Statement
Our work presents minimal potential for negative societal impact, primarily due to the use of publicly available datasets and models. This accessibility inherently reduces the risk of adverse effects on individuals or society.
References
- Aichberger et al. (2024) Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. 2024. Semantically diverse language generation for uncertainty estimation in language models. arXiv preprint arXiv:2406.04306.
- Aichberger et al. (2025) Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. 2025. Improving uncertainty estimation through semantically diverse language generation. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.
- Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 7421–7454. Association for Computational Linguistics.
- Baker et al. (1998) Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90.
- Bouchard and Chauhan (2025) Dylan Bouchard and Mohit Singh Chauhan. 2025. Uncertainty quantification for language models: A suite of black-box, white-box, llm judge, and ensemble scorers. arXiv preprint arXiv:2504.19254.
- Burns et al. (2023) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2023. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Fierro et al. (2025) Constanza Fierro, Negar Foroutan, Desmond Elliott, and Anders Søgaard. 2025. How do multilingual language models remember facts? In Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 16052–16106. Association for Computational Linguistics.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12216–12235. Association for Computational Linguistics.
- Ghandeharioun et al. (2024) Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. 2024. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Hu et al. (2025) Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, and Qing Li. 2025. Removal of hallucination on hallucination: Debate-augmented RAG. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 15839–15853. Association for Computational Linguistics.
- Huang et al. (2025) Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yuxuan Gu, Yangfan Ye, Liang Zhao, Weihong Zhong, Baoxin Wang, Dayong Wu, Guoping Hu, Lingpeng Kong, Tong Xiao, Ting Liu, and Bing Qin. 2025. Alleviating hallucinations from knowledge misalignment in large language models via selective abstention learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24564–24579. Association for Computational Linguistics.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
- Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, T. J. Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova Dassarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, and 17 others. 2022. Language models (mostly) know what they know. ArXiv, abs/2207.05221.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Li et al. (2023) Kenneth Li, Oam Patel, Fernanda B. Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Luo et al. (2024) Wen Luo, Tianshu Shen, Wei Li, Guangyue Peng, Richeng Xuan, Houfeng Wang, and Xi Yang. 2024. Halludial: A large-scale benchmark for automatic dialogue-level hallucination evaluation. Preprint, arXiv:2406.07070.
- Luo et al. (2025) Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, and Houfeng Wang. 2025. Odysseus navigates the sirens’ song: Dynamic focus decoding for factual and diverse open-ended text generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27200–27218, Vienna, Austria. Association for Computational Linguistics.
- Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
- Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? Advances in neural information processing systems, 32.
- Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12076–12100. Association for Computational Linguistics.
- Niu et al. (2025) Mengjia Niu, Hamed Haddadi, and Guansong Pang. 2025. Robust hallucination detection in llms via adaptive token selection. arXiv preprint arXiv:2504.07863.
- Orgad et al. (2025) Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. 2025. Llms know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.
- Pagnoni et al. (2021) Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829.
- Qian et al. (2025) Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. 2025. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning. arXiv preprint arXiv:2506.02867.
- Shi et al. (2024) Zhengliang Shi, Shuo Zhang, Weiwei Sun, Shen Gao, Pengjie Ren, Zhumin Chen, and Zhaochun Ren. 2024. Generate-then-ground in retrieval-augmented generation for multi-hop question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 7339–7353. Association for Computational Linguistics.
- Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings.
- Tian et al. (2024) Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, and Yongdong Zhang. 2024. Chimed-gpt: A chinese medical large language model with full training regime and better alignment to human preferences. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 7156–7173. Association for Computational Linguistics.
- Todd et al. (2024) Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. 2024. Function vectors in large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Wang et al. (2023) Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855.
- Wu et al. (2025) Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2025. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net.
- Xue et al. (2025a) Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, and Kam-Fai Wong. 2025a. UAlign: Leveraging uncertainty estimations for factuality alignment on large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6002–6024, Vienna, Austria. Association for Computational Linguistics.
- Xue et al. (2025b) Yihao Xue, Kristjan Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. 2025b. Verify when uncertain: Beyond self-consistency in black box hallucination detection. arXiv preprint arXiv:2502.15845.
- Yang et al. (2025) An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report. Preprint, arXiv:2505.09388.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- Zhang et al. (2025) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, and 1 others. 2025. Siren’s song in the ai ocean: A survey on hallucination in large language models. Computational Linguistics, pages 1–46.
Appendix A LLM Usage
In this work, we employ LLMs solely for language refinement to enhance clarity and explanatory quality. All content has been carefully verified for factual accuracy, and the authors take full responsibility for the entire manuscript. The core ideas, experimental design, and methodological framework are conceived and developed independently by the authors, without the use of LLMs.
Appendix B Implementation Details
B.1 Identifying Exact Question and Answer Tokens
To locate the exact question and answer tokens within a QA pair, we prompt GPT-4o (version gpt-4o_2024-11-20) to identify the precise positions of the core frame elements. The instruction templates are presented in Tables 5 and 6. A token is considered an exact question or exact answer if and only if it constitutes a valid substring of the corresponding question or answer. To mitigate potential biases, each example is prompted at most five times, and only successfully extracted instances are retained for downstream analysis. Prior work (Orgad et al., 2025) has shown that LLMs can accurately identify exact answer tokens, typically achieving over 95% accuracy. In addition, we manually verified GPT-4o’s identification quality in our setting. Specifically, it achieves 99.92%, 95.83%, and 96.62% accuracy on exact subject tokens, exact property tokens, and exact answer tokens, respectively. Furthermore, we also explore alternative configurations without the use of exact tokens to ensure the robustness of our findings (see Section B.2).
B.2 Probing Implementation Details
We investigate multiple probing configurations. For token selection, we consider three types of tokens: (1) the final token of the answer, which is the most commonly adopted choice in prior work due to its global receptive field under attention (Chen et al., 2024); (2) the token immediately preceding the exact answer span; and (3) the final token within the exact answer span. For activation extraction, we obtain representations from either (1) the output of each attention sublayer or (2) the output of the final multi-layer perceptron (MLP) in each transformer layer. Across all configurations, our experimental results exhibit consistent trends, indicating that the observed findings are robust to these design choices. For the probing classifier, we follow standard practice (Chen et al., 2024; Orgad et al., 2025) and employ a logistic regression model implemented in scikit-learn.
B.3 Models
Our analysis covers a diverse collection of 12 LLMs that vary in both scale and architectural design. Specifically, we consider three categories: (1) base models, including Llama-3.2-1B (Grattafiori et al., 2024), Llama-3.2-3B, Llama-3-8B, Llama-3-70B, Mistral-7B-v0.1 (Jiang et al., 2023), and Mistral-7B-v0.3; (2) instruction-tuned models, including Llama-3.2-3B-Instruct, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Mistral-7B-Instruct-v0.3; and (3) reasoning-oriented models, namely Qwen3-8B (Yang et al., 2025) and Qwen3-32B.
B.4 Datasets
We consider four widely used question–answering datasets: PopQA (Mallen et al., 2023), TriviaQA (Joshi et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019).
PopQA is an open-domain question-answering dataset that emphasizes entity-centric factual knowledge with a long-tail distribution. It is designed to probe LLMs’ ability to memorize less frequent facts, highlighting limitations in parametric knowledge.
TriviaQA is a reading comprehension dataset constructed by pairing trivia questions authored independently of evidence documents. The questions are often complex, requiring multi-sentence reasoning, and exhibit substantial lexical and syntactic variability.
HotpotQA is a challenging multi-hop question-answering dataset that requires reasoning. It includes diverse question types—span extraction, yes/no, and novel comparison questions—along with sentence-level supporting fact annotations, promoting the development of explainable QA systems.
Natural Questions is an open-domain dataset consisting of real, anonymized questions from Google search queries. Each question is annotated with both a long answer (paragraph or section) and a short answer (span or yes/no), or marked as null when no answer is available. Due to computational constraints, we randomly sample 2,000 training samples and 2,000 test samples for each dataset.
B.5 Implementation Details of Baselines
In our experiments regarding applications, we compare our proposed methods against several internal-based baselines for hallucination detection. These baselines leverage the LLM’s internal signals, such as output probabilities, logits, and hidden representations, without relying on external resources. Below, we detail the implementation of each baseline.
P(True)
P(True) (Kadavath et al., 2022) exploits the LLM’s self-awareness of its knowledge boundaries by prompting the model to assess the correctness of its own generated answer. Specifically, for each question-answer pair $(q_{i},\hat{y}^{f}_{i})$ , we prompt the LLM with a template that asks it to evaluate whether its answer is factually correct. Following Kadavath et al. (2022), the prompt template is shown in Table 4.
| Question: {Here is the question} |
| --- |
| Possible answer: {Here is the answer} |
| Is the possible answer: |
| (A) True |
| (B) False |
| The possible answer is: |
Table 4: Prompt template used for the P(True) baseline.
Logits-based Baselines
The logits-based baselines utilize the raw logits produced by the LLM during the generation of the exact answer tokens. Let $\hat{y}^{f}_{i,E_{A}}=[t_{1},t_{2},...,t_{m}]$ represent the sequence of exact answer tokens for a given question-answer pair, where $m$ is the number of exact answer tokens. For each token $t_{j}$ (where $j∈\{1,...,m\}$ ), the LLM produces a logit vector $L_{j}∈\mathbb{R}^{V}$ , where $V$ is the vocabulary size, and the logit for the generated token $t_{j}$ is denoted $L_{j}[t_{j}]$ . The logits-based metrics are defined as follows:
- Logits-mean: The average of the logits across all exact answer tokens:
$$
\text{Logits-mean}=\frac{1}{m}\sum_{j=1}^{m}L_{j}[t_{j}] \tag{6}
$$
- Logits-max: The maximum logit value among the exact answer tokens:
$$
\text{Logits-max}=\max_{j\in\{1,\dots,m\}}L_{j}[t_{j}] \tag{7}
$$
- Logits-min: The minimum logit value among the exact answer tokens:
$$
\text{Logits-min}=\min_{j\in\{1,\dots,m\}}L_{j}[t_{j}] \tag{8}
$$
These metrics serve as proxies for the model’s confidence in the generated answer, with lower logit values potentially indicating uncertainty or hallucination.
Scores-based Baselines
The scores-based baselines are derived from the softmax probabilities of the exact answer tokens. Using the same notation as above, for each exact answer token $t_{j}$ , the softmax probability is computed as:
$$
p_{j}[t_{j}]=\frac{\exp(L_{j}[t_{j}])}{\sum_{k=1}^{V}\exp(L_{j}[k])} \tag{9}
$$
where $L_{j}[k]$ is the logit for the $k$ -th token in the vocabulary. The scores-based metrics are defined as follows:
- Scores-mean: The average of the softmax probabilities across all exact answer tokens:
$$
\text{Scores-mean}=\frac{1}{m}\sum_{j=1}^{m}p_{j}[t_{j}] \tag{10}
$$
- Scores-max: The maximum softmax probability among the exact answer tokens:
$$
\text{Scores-max}=\max_{j\in\{1,\dots,m\}}p_{j}[t_{j}] \tag{11}
$$
- Scores-min: The minimum softmax probability among the exact answer tokens:
$$
\text{Scores-min}=\min_{j\in\{1,\dots,m\}}p_{j}[t_{j}] \tag{12}
$$
These probabilities provide a normalized measure of the model’s confidence, bounded between 0 and 1, with lower values potentially indicating a higher likelihood of hallucination.
Probing Baseline
The probing baseline follows the standard approach described in Chen et al. (2024); Orgad et al. (2025). A linear classifier is trained on the hidden representations of the last exact answer token from the best-performing layer. The training and evaluation data for the probing classifier are constructed following the procedure described in Appendix B.4. The classifier is implemented using scikit-learn with default hyperparameters, consistent with the probing setup described in Appendix B.2. The probing baseline serves as a direct comparison to our proposed applications, as it relies on the same type of internal signals but does not account for the heterogeneity of truthfulness encoding pathways.
B.6 Implementation Details of MoP and PR
Model Backbone and Hidden Representations
All experiments use the same base LLM as in the main paper. Hidden representations $\mathbf{h}^{l^{*}}(x)$ are extracted from the best-performing layer $l^{*}$ determined on a held-out validation split.
Mixture-of-Probes (MoP)
Similar to Appendix B.5, the two expert probes $p_{Q}$ and $p_{A}$ are implemented using scikit-learn with default hyperparameters, consistent with the probing setup described in Appendix B.2. The gating network is directly from the self-awareness probe described in Section 4.2. The training and evaluation data for the probing classifier are the same as Appendix B.5. The proposed MoP framework requires no additional retraining: we directly combine the two expert probes with the pathway-discrimination classifier described in Section 4.2 and perform inference without further parameter updates.
Pathway Reweighting (PR)
The training and evaluation data used for the probing classifier are identical to those described in Appendix B.5. For each Transformer layer $l≤ l^{*}$ , we introduce two learnable scalars $\alpha_{Q}^{l}$ and $\alpha_{A}^{l}$ for every attention head. These parameters, together with the probe parameters, are optimized using the Adam optimizer with a learning rate of $1× 10^{-2}$ , $\beta_{1}=0.9$ , and $\beta_{2}=0.999$ . Training is conducted with a batch size of 512 for 10 epochs, while all original LLM parameters remain frozen.
| You are given a factual open-domain question-answer pair. |
| --- |
| Your task is to identify: |
| 1. Core Entity (c) - the known specific entity in the question that the answer is about (a person, place, organization, or other proper noun). |
| 2. Relation (r) - the minimal phrase in the question that expresses what is being asked about the core entity, using only words from the question. |
| Guidelines: |
| The core entity must be a concrete, known entity mentioned in the question, not a general category. |
| If multiple entities appear, choose the one most central to the question—the entity the answer primarily concerns. |
| The relation should be the smallest meaningful span that directly connects the core entity to the answer. |
| Use only words from the question; do not paraphrase or add new words. |
| Exclude extra context, modifiers, or descriptive phrases that are not essential to defining the relationship. |
| For complex questions with long modifiers or embedded clauses, focus on the words that directly express the property, action, or attribute of the core entity relevant to the answer. |
| If you cannot confidently identify the core entity or the relation, output NO ANSWER. |
| Output format: |
| Core Entity: exact text |
| Relation: exact text |
| Example 1 |
| Question: Who was the director of Finale? |
| Answer: Ken Kwapis |
| Core Entity: Finale |
| Relation: director |
| Example 2 |
| Question: What film, in production between 2007 and 2009, is directed by James Cameron ("Titanic")? |
| Answer: Avatāra |
| Core Entity: James Cameron |
| Relation: film directed by |
| Example 3 |
| Question: Which novel, written in 1925 and often cited as a classic of American literature, was authored by F. Scott Fitzgerald? |
| Answer: The Great Gatsby |
| Core Entity: F. Scott Fitzgerald |
| Relation: novel authored by |
| Question: {Here is the question} |
| Answer: {Hese is the answer} |
Table 5: Prompt template used to locate the exact question tokens.
| You are given a factual open-domain question-answer pair. |
| --- |
| Your task is to extract the short exact answer from the long answer. |
| Guidelines: |
| Retain only the relevant tokens from the answer. |
| Use only words from the answer; do not paraphrase or add new words. |
| Extract the short answer regardless of whether the long answer is correct or incorrect. |
| If the short answer cannot be confidently determined, or if the long answer does not address the question, output NO ANSWER. |
| Example 1 |
| Q: Which musical featured the song The Street Where You Live? |
| A: The song "The Street Where You Live" is from the Lerner and Loewe musical "My Fair Lady." It is one of the most famous songs from the show, and it is sung by Professor Henry Higgins as he reflects on the transformation of Eliza Doolittle and the memories they have shared together. |
| Exact answer: My Fair Lady |
| Example 2 |
| Q: Which Swedish actress won the Best Supporting Actress Oscar for Murder on the Orient Express? |
| A: I’m glad you asked about a Swedish actress who won an Oscar for "Murder on the Orient Express," but I must clarify that there seems to be a misunderstanding here. No Swedish actress has won an Oscar for Best Supporting Actress for that film. The 1974 "Murder on the Orient Express" was an American production, and the cast was predominantly British and American. If you have any other questions or if there’s another |
| Exact answer: NO ANSWER |
| Example 3 |
| Q: Who invented the telephone? |
| A: The telephone was invented by Thomas Edison in the 19th century. |
| Exact answer: Thomas Edison |
| Q: {Here is the question} |
| A: {Here is the answer} |
| Exact answer: |
Table 6: Prompt template used to locate the exact answer tokens.
| You are given one factual question. Interpret it literally and think carefully. |
| --- |
| Your task is to decide whether you can answer it correctly with high confidence based only on your internal knowledge (no tools or web). If yes, output exactly: YES. If not or uncertain, output exactly: NO. You should output one word only. |
| Question: {Here is the question} |
| Your Output: |
Table 7: Prompt template used to obtain the i-don’t-know rate.
Appendix C Attention Knockout
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Graph: ΔP Trends Across Layers for Llama-3.2-1B and Llama-3.2-3B Models
### Overview
The image contains two side-by-side line graphs comparing ΔP (change in performance) across neural network layers for two versions of the Llama model (3.2-1B and 3.2-3B). Each graph tracks six distinct data series representing different anchoring methods (Q-Anchored vs. A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs show performance variability across layers, with shaded regions indicating uncertainty ranges.
### Components/Axes
- **Left Chart**: Llama-3.2-1B (15 layers)
- **Right Chart**: Llama-3.2-3B (25 layers)
- **Y-Axis**: ΔP (Performance Change) ranging from -60 to 0
- **X-Axis**: Layer number (0–15 for 1B, 0–25 for 3B)
- **Legend**: Located at the bottom, with six entries:
1. **Q-Anchored (PopQA)**: Solid blue line
2. **A-Anchored (PopQA)**: Dashed orange line
3. **Q-Anchored (TriviaQA)**: Dotted green line
4. **A-Anchored (TriviaQA)**: Dash-dot purple line
5. **Q-Anchored (HotpotQA)**: Solid purple line
6. **A-Anchored (NQ)**: Dashed gray line
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **Q-Anchored (PopQA)**: Starts at ~-10ΔP, dips to ~-50ΔP at layer 10, then rises to ~-30ΔP at layer 15.
- **A-Anchored (PopQA)**: Starts at ~-5ΔP, fluctuates between -10ΔP and -20ΔP, ending at ~-15ΔP.
- **Q-Anchored (TriviaQA)**: Begins at ~-20ΔP, peaks at ~-10ΔP at layer 5, then drops to ~-40ΔP.
- **A-Anchored (TriviaQA)**: Starts at ~-15ΔP, dips to ~-35ΔP at layer 10, then recovers to ~-25ΔP.
- **Q-Anchored (HotpotQA)**: Starts at ~-10ΔP, drops sharply to ~-50ΔP at layer 10, then rises to ~-30ΔP.
- **A-Anchored (NQ)**: Starts at ~-5ΔP, fluctuates between -10ΔP and -20ΔP, ending at ~-15ΔP.
#### Llama-3.2-3B (Right Chart)
- **Q-Anchored (PopQA)**: Starts at ~-10ΔP, dips to ~-50ΔP at layer 15, then rises to ~-30ΔP at layer 25.
- **A-Anchored (PopQA)**: Starts at ~-5ΔP, fluctuates between -10ΔP and -20ΔP, ending at ~-15ΔP.
- **Q-Anchored (TriviaQA)**: Begins at ~-20ΔP, peaks at ~-10ΔP at layer 10, then drops to ~-40ΔP.
- **A-Anchored (TriviaQA)**: Starts at ~-15ΔP, dips to ~-35ΔP at layer 20, then recovers to ~-25ΔP.
- **Q-Anchored (HotpotQA)**: Starts at ~-10ΔP, drops sharply to ~-60ΔP at layer 20, then rises to ~-40ΔP.
- **A-Anchored (NQ)**: Starts at ~-5ΔP, fluctuates between -10ΔP and -20ΔP, ending at ~-15ΔP.
### Key Observations
1. **Layer-Specific Trends**:
- ΔP generally decreases (worsens) as layers increase, with sharper declines in middle layers (e.g., layer 10–15 for 1B, layer 20 for 3B).
- Q-Anchored methods show more pronounced dips than A-Anchored methods in most cases.
2. **Dataset Variability**:
- HotpotQA (Q-Anchored) exhibits the most extreme ΔP drops (~-60ΔP in 3B model).
- NQ (A-Anchored) shows the least variability, maintaining ΔP between -10ΔP and -20ΔP.
3. **Uncertainty Patterns**:
- Shaded regions (likely confidence intervals) widen in middle layers, indicating higher variability in performance changes.
4. **Model Size Differences**:
- The 3B model (right chart) shows extended trends but similar patterns to the 1B model, with more pronounced fluctuations in later layers.
### Interpretation
The data suggests that anchoring methods (Q vs. A) and datasets significantly influence ΔP across layers. Q-Anchored methods generally underperform (lower ΔP) compared to A-Anchored methods, particularly in middle layers. The HotpotQA dataset amplifies performance drops, while NQ maintains stability. The 3B model’s extended layers reveal sustained trends but increased variability, implying that larger models may require more robust anchoring strategies to mitigate layer-specific performance degradation. The shaded regions highlight the need for further investigation into the sources of uncertainty, potentially linked to dataset complexity or model architecture differences.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Graph: Performance Comparison of Q-Anchored and A-Anchored Models Across Layers for Llama-3-8B and Llama-3-70B
### Overview
The image contains two side-by-side line graphs comparing the performance (ΔP) of Q-Anchored and A-Anchored models across layers for two Llama versions: 3-8B (left) and 3-70B (right). The graphs show six distinct data series, each representing a combination of anchoring method (Q or A) and dataset/method (PopQA, TriviaQA, HotpotQA, NQ). Performance is measured as ΔP (change in performance) across layers, with values ranging from -80 to +20.
### Components/Axes
- **X-Axis (Horizontal)**: Layer (0 to 30 for Llama-3-8B, 0 to 80 for Llama-3-70B)
- **Y-Axis (Vertical)**: ΔP (Performance Change, range: -80 to +20)
- **Legend**: Located at the bottom, with six entries:
1. **Q-Anchored (PopQA)**: Solid blue line
2. **Q-Anchored (TriviaQA)**: Dotted green line
3. **Q-Anchored (HotpotQA)**: Dashed purple line
4. **Q-Anchored (NQ)**: Dotted pink line
5. **A-Anchored (PopQA)**: Solid orange line
6. **A-Anchored (TriviaQA)**: Dotted orange line
7. **A-Anchored (HotpotQA)**: Dashed orange line
8. **A-Anchored (NQ)**: Dotted gray line
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **Q-Anchored (PopQA)**: Starts near 0, decreases sharply to ~-60 by layer 30.
- **Q-Anchored (TriviaQA)**: Begins at ~-10, fluctuates between -20 and 0, ending near -40.
- **Q-Anchored (HotpotQA)**: Starts at ~-5, peaks at ~+5 around layer 15, then drops to ~-30.
- **Q-Anchored (NQ)**: Starts at ~-15, stabilizes near -10 by layer 30.
- **A-Anchored (PopQA)**: Starts at ~-5, fluctuates between -10 and 0, ending near -15.
- **A-Anchored (TriviaQA)**: Begins at ~-20, rises to ~-5 by layer 10, then drops to ~-35.
- **A-Anchored (HotpotQA)**: Starts at ~-10, peaks at ~+10 around layer 15, then declines to ~-25.
- **A-Anchored (NQ)**: Starts at ~-25, stabilizes near -20 by layer 30.
#### Llama-3-70B (Right Chart)
- **Q-Anchored (PopQA)**: Starts near 0, decreases to ~-70 by layer 80.
- **Q-Anchored (TriviaQA)**: Begins at ~-10, fluctuates between -30 and -10, ending near -50.
- **Q-Anchored (HotpotQA)**: Starts at ~-5, peaks at ~+10 around layer 40, then drops to ~-60.
- **Q-Anchored (NQ)**: Starts at ~-20, stabilizes near -30 by layer 80.
- **A-Anchored (PopQA)**: Starts at ~-5, fluctuates between -15 and 0, ending near -20.
- **A-Anchored (TriviaQA)**: Begins at ~-20, rises to ~-5 by layer 20, then drops to ~-40.
- **A-Anchored (HotpotQA)**: Starts at ~-10, peaks at ~+15 around layer 40, then declines to ~-35.
- **A-Anchored (NQ)**: Starts at ~-25, stabilizes near -35 by layer 80.
### Key Observations
1. **General Trend**: Most lines show a downward trend in ΔP as layers increase, indicating performance degradation.
2. **Model Size Impact**: Llama-3-70B exhibits more pronounced fluctuations and steeper declines compared to Llama-3-8B.
3. **Anchoring Method**: Q-Anchored models generally perform worse (lower ΔP) than A-Anchored models across most datasets.
4. **Dataset Variability**: HotpotQA and NQ datasets show higher volatility in performance compared to PopQA and TriviaQA.
5. **Layer-Specific Peaks**: Some lines (e.g., Q-Anchored HotpotQA in Llama-3-70B) show mid-layer performance peaks before declining.
### Interpretation
The data suggests that anchoring method (Q vs. A) and dataset choice significantly influence model performance. A-Anchored models consistently outperform Q-Anchored counterparts, particularly in larger models (70B). The HotpotQA and NQ datasets appear more challenging, causing sharper performance drops. The mid-layer peaks observed in some lines (e.g., HotpotQA) may indicate temporary stabilization or optimization points. The larger model (70B) shows greater sensitivity to anchoring choices, with more extreme performance variations. These trends highlight the importance of anchoring strategy and dataset selection in fine-tuning large language models.
</details>
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Graph: ΔP vs. Layer in Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side line graphs comparing the performance of Q-Anchored and A-Anchored methods across different datasets (PopQA, TriviaQA, HotpotQA, NQ) in Mistral-7B models (v0.1 and v0.3). The y-axis represents ΔP (change in performance), and the x-axis represents model layers (0–30). Each line corresponds to a specific anchoring method and dataset, with distinct colors and styles.
---
### Components/Axes
- **Y-Axis**: ΔP (Performance Change), ranging from -60 to 0.
- **X-Axis**: Layer (0–30), representing model depth.
- **Legends**:
- **Left Graph (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed gray: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: A-Anchored (HotpotQA)
- Solid red: Q-Anchored (NQ)
- Dashed brown: A-Anchored (NQ)
- **Right Graph (v0.3)**:
- Same legend as v0.1 but applied to updated model version.
---
### Detailed Analysis
#### Mistral-7B-v0.1
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ΔP ≈ 0 (layer 0).
- Sharp decline to ΔP ≈ -50 (layer 10).
- Fluctuates between -30 and -50 until layer 30.
2. **A-Anchored (PopQA)** (dashed orange):
- Starts at ΔP ≈ 0.
- Gradual decline to ΔP ≈ -30 (layer 10).
- Stabilizes around -25–-30.
3. **Q-Anchored (TriviaQA)** (solid green):
- Sharp drop to ΔP ≈ -40 (layer 5).
- Oscillates between -20 and -40.
4. **A-Anchored (TriviaQA)** (dashed gray):
- Smoother decline to ΔP ≈ -25 (layer 10).
- Stabilizes around -20–-25.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Moderate decline to ΔP ≈ -35 (layer 15).
- Fluctuates between -25 and -35.
6. **A-Anchored (HotpotQA)** (dashed pink):
- Gradual decline to ΔP ≈ -20 (layer 20).
- Stabilizes around -15–-20.
7. **Q-Anchored (NQ)** (solid red):
- Sharp drop to ΔP ≈ -55 (layer 10).
- Recovers to ΔP ≈ -40 (layer 30).
8. **A-Anchored (NQ)** (dashed brown):
- Steady decline to ΔP ≈ -30 (layer 20).
- Stabilizes around -25–-30.
#### Mistral-7B-v0.3
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ΔP ≈ 0.
- Gradual decline to ΔP ≈ -25 (layer 20).
- Stabilizes around -20–-25.
2. **A-Anchored (PopQA)** (dashed orange):
- Smooth decline to ΔP ≈ -20 (layer 20).
- Stabilizes around -15–-20.
3. **Q-Anchored (TriviaQA)** (solid green):
- Moderate decline to ΔP ≈ -30 (layer 15).
- Fluctuates between -20 and -30.
4. **A-Anchored (TriviaQA)** (dashed gray):
- Gradual decline to ΔP ≈ -22 (layer 25).
- Stabilizes around -18–-22.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Slight decline to ΔP ≈ -15 (layer 10).
- Stabilizes around -10–-15.
6. **A-Anchored (HotpotQA)** (dashed pink):
- Minimal decline to ΔP ≈ -10 (layer 20).
- Stabilizes around -5–-10.
7. **Q-Anchored (NQ)** (solid red):
- Sharp drop to ΔP ≈ -45 (layer 10).
- Recovers to ΔP ≈ -30 (layer 30).
8. **A-Anchored (NQ)** (dashed brown):
- Steady decline to ΔP ≈ -25 (layer 25).
- Stabilizes around -20–-25.
---
### Key Observations
1. **General Trend**: Both models show a decline in ΔP across layers, but v0.3 exhibits smoother and more stable trends.
2. **Q-Anchored vs. A-Anchored**:
- Q-Anchored methods (solid lines) exhibit sharper initial declines and greater volatility, especially in v0.1.
- A-Anchored methods (dashed lines) show more gradual and stable performance.
3. **Dataset Impact**:
- **PopQA/TriviaQA**: Higher volatility in Q-Anchored methods.
- **HotpotQA/NQ**: Smoother trends, with NQ showing the most extreme initial drops.
4. **Version Comparison**:
- v0.3 demonstrates improved stability across all methods, with reduced fluctuations compared to v0.1.
---
### Interpretation
The data suggests that anchoring methods significantly influence model performance stability. Q-Anchored methods are more sensitive to layer changes, leading to larger ΔP variations, while A-Anchored methods maintain steadier performance. The datasets' complexity correlates with volatility: simpler datasets (e.g., PopQA) show sharper declines, while complex ones (e.g., HotpotQA) exhibit smoother trends. The transition from v0.1 to v0.3 indicates architectural improvements, reducing performance instability. Notably, Q-Anchored (NQ) in v0.1 experiences the most drastic drop (-55), suggesting potential overfitting or dataset-specific challenges. These findings highlight the importance of anchoring strategy selection based on dataset characteristics and model version.
</details>
Figure 7: $\Delta\mathrm{P}$ under attention knockout, probing attention activations of the final token.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Llama-3.2-1B and Llama-3.2-3B Layer Performance Comparison
### Overview
The image contains two side-by-side line charts comparing performance metrics (ΔP) across neural network layers for two versions of the Llama-3.2 model (1B and 3B parameters). Each chart tracks performance across 15 and 25 layers respectively, with multiple data series representing different anchoring strategies and datasets.
### Components/Axes
- **X-axis (Layer)**:
- Llama-3.2-1B: 0–15 (integer increments)
- Llama-3.2-3B: 0–25 (integer increments)
- **Y-axis (ΔP)**:
- Range: -80 to +20 (integer increments)
- Units: Unspecified performance metric (likely perplexity or task-specific score)
- **Legends**:
- **Llama-3.2-1B Panel**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: A-Anchored (PopQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- **Llama-3.2-3B Panel**:
- Same color coding as above, with additional lines for 3B-specific data
### Detailed Analysis
**Llama-3.2-1B Panel**:
- **Q-Anchored (PopQA)**: Starts at ~0ΔP, declines to -50ΔP by layer 15 (blue line)
- **Q-Anchored (TriviaQA)**: Peaks at -10ΔP (layer 5), ends at -45ΔP (green dashed line)
- **A-Anchored (PopQA)**: Starts at ~0ΔP, fluctuates between -10ΔP and +5ΔP (orange dotted line)
- **A-Anchored (TriviaQA)**: Declines from 0ΔP to -30ΔP (red dashed line)
- **Q-Anchored (HotpotQA)**: Sharp drop to -60ΔP by layer 10, recovers slightly (purple solid line)
- **Q-Anchored (NQ)**: Most volatile, reaches -70ΔP at layer 12 (pink dashed line)
**Llama-3.2-3B Panel**:
- **Q-Anchored (PopQA)**: Starts at ~0ΔP, ends at -40ΔP (blue solid line)
- **Q-Anchored (TriviaQA)**: Peaks at -15ΔP (layer 5), ends at -55ΔP (green dashed line)
- **A-Anchored (PopQA)**: Starts at ~0ΔP, ends at -20ΔP (orange dotted line)
- **A-Anchored (TriviaQA)**: Declines from 0ΔP to -45ΔP (red dashed line)
- **Q-Anchored (HotpotQA)**: Sharp drop to -75ΔP at layer 10, recovers to -50ΔP (purple solid line)
- **Q-Anchored (NQ)**: Most volatile, reaches -85ΔP at layer 15 (pink dashed line)
### Key Observations
1. **Model Size Impact**:
- 3B model shows more pronounced performance drops (ΔP) in later layers (15–25) compared to 1B model
- Q-Anchored models in 3B panel exhibit 20–30% greater ΔP magnitude than 1B counterparts
2. **Dataset Sensitivity**:
- HotpotQA and NQ datasets show the most extreme performance drops (up to -85ΔP)
- TriviaQA consistently shows mid-range performance across both models
3. **Anchoring Strategy**:
- Q-Anchored models generally outperform A-Anchored in later layers
- A-Anchored models show more stability but lower peak performance
4. **Layer-Specific Trends**:
- Layer 5–10 shows steepest performance declines across all models
- 3B model exhibits increased volatility in layers 15–20 (e.g., NQ line has 3 peaks > -70ΔP)
### Interpretation
The data suggests that:
1. **Q-Anchored models** demonstrate better scalability with increased layers, maintaining performance advantages over A-Anchored models in deeper networks
2. **Dataset complexity** correlates with performance degradation, with HotpotQA/NQ showing the most challenging patterns
3. **Model size tradeoffs**: While 3B models achieve greater absolute performance gains, they exhibit 2–3x greater layer-to-layer volatility
4. **Anchoring mechanism**: Q-Anchored appears more effective for complex datasets but requires careful layer management to avoid performance cliffs
The charts highlight critical design considerations for model scaling, particularly the need for dataset-specific anchoring strategies and layer-wise performance monitoring in large language models.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Llama-3-8B and Llama-3-70B Model Performance Comparison
### Overview
The image contains two side-by-side line charts comparing the performance of Q-Anchored and A-Anchored models across different datasets (PopQA, TriviaQA, HotpotQA, NQ) for two versions of the Llama-3 model (3-8B and 3-70B). The y-axis represents ΔP (change in performance), and the x-axis represents model layers. Each chart shows distinct trends for Q-Anchored (solid lines) and A-Anchored (dashed lines) configurations.
---
### Components/Axes
- **X-Axis (Layer)**:
- Llama-3-8B: 0 to 30 (integer increments).
- Llama-3-70B: 0 to 80 (integer increments).
- **Y-Axis (ΔP)**:
- Range: -80 to 20 (integer increments).
- **Legends**:
- Positioned at the bottom of each chart.
- Colors and styles correspond to:
- **Q-Anchored**: Solid lines (blue, green, purple, pink).
- **A-Anchored**: Dashed lines (orange, gray, brown, black).
- Datasets: PopQA, TriviaQA, HotpotQA, NQ.
---
### Detailed Analysis
#### Llama-3-8B Chart
- **Q-Anchored (PopQA)**: Blue solid line. Starts at 0, dips sharply to -60 by layer 10, then fluctuates between -40 and -20.
- **Q-Anchored (TriviaQA)**: Green dashed line. Starts at 0, drops to -50 by layer 15, then stabilizes near -30.
- **Q-Anchored (HotpotQA)**: Purple solid line. Starts at 0, declines to -70 by layer 20, then oscillates between -50 and -30.
- **Q-Anchored (NQ)**: Pink dashed line. Starts at 0, dips to -40 by layer 10, then stabilizes near -20.
- **A-Anchored (PopQA)**: Orange solid line. Remains near 0 with minor fluctuations.
- **A-Anchored (TriviaQA)**: Gray dashed line. Starts at 0, dips to -10 by layer 10, then stabilizes.
- **A-Anchored (HotpotQA)**: Brown solid line. Starts at 0, fluctuates between -5 and 5.
- **A-Anchored (NQ)**: Black dashed line. Starts at 0, dips to -5 by layer 10, then stabilizes.
#### Llama-3-70B Chart
- **Q-Anchored (PopQA)**: Blue solid line. Starts at 0, drops to -80 by layer 40, then fluctuates between -60 and -40.
- **Q-Anchored (TriviaQA)**: Green dashed line. Starts at 0, declines to -70 by layer 50, then stabilizes near -50.
- **Q-Anchored (HotpotQA)**: Purple solid line. Starts at 0, drops to -90 by layer 60, then oscillates between -70 and -50.
- **Q-Anchored (NQ)**: Pink dashed line. Starts at 0, dips to -60 by layer 30, then stabilizes near -40.
- **A-Anchored (PopQA)**: Orange solid line. Remains near 0 with minor fluctuations.
- **A-Anchored (TriviaQA)**: Gray dashed line. Starts at 0, dips to -15 by layer 20, then stabilizes.
- **A-Anchored (HotpotQA)**: Brown solid line. Starts at 0, fluctuates between -10 and 10.
- **A-Anchored (NQ)**: Black dashed line. Starts at 0, dips to -10 by layer 10, then stabilizes.
---
### Key Observations
1. **Q-Anchored vs. A-Anchored**:
- Q-Anchored models show larger ΔP deviations (negative trends) across all datasets, especially in deeper layers.
- A-Anchored models exhibit smaller, more stable ΔP values, often remaining near 0.
2. **Model Size Impact**:
- Llama-3-70B shows more pronounced ΔP declines for Q-Anchored models compared to Llama-3-8B, suggesting scalability challenges.
3. **Dataset Sensitivity**:
- HotpotQA (Q-Anchored) demonstrates the steepest ΔP decline in both models, indicating higher sensitivity to anchoring methods.
4. **Layer Depth Correlation**:
- ΔP trends generally worsen as layer depth increases, particularly for Q-Anchored configurations.
---
### Interpretation
The data suggests that **Q-Anchored models** are more sensitive to layer depth and dataset complexity, leading to larger performance deviations (ΔP). This could imply that Q-Anchored configurations struggle with maintaining consistency in deeper layers or with complex datasets like HotpotQA. In contrast, **A-Anchored models** maintain stability, indicating robustness to layer depth and dataset variations. The Llama-3-70B model’s amplified ΔP trends for Q-Anchored configurations highlight potential scalability issues, suggesting that anchoring strategies may need adjustment for larger models. The divergence between Q and A anchoring methods underscores the importance of anchoring choice in model performance optimization.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Graph: ΔP Values Across Layers in Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side line graphs comparing ΔP (change in performance?) values across 30 layers of the Mistral-7B model in versions v0.1 (left) and v0.3 (right). Each graph includes six data series representing different anchoring methods (Q-Anchored/A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis ranges from -80 to 20, while the x-axis spans layers 0–30.
---
### Components/Axes
- **Left Graph**: Mistral-7B-v0.1
- **Right Graph**: Mistral-7B-v0.3
- **Y-Axis**: ΔP (values from -80 to 20)
- **X-Axis**: Layer (0–30)
- **Legend**: Located at the bottom, with six entries:
1. **Q-Anchored (PopQA)**: Solid blue line
2. **A-Anchored (PopQA)**: Dashed orange line
3. **Q-Anchored (TriviaQA)**: Dotted green line
4. **A-Anchored (TriviaQA)**: Dash-dot purple line
5. **Q-Anchored (HotpotQA)**: Solid purple line
6. **A-Anchored (NQ)**: Dashed orange line (note: overlaps with A-Anchored PopQA style)
---
### Detailed Analysis
#### Mistral-7B-v0.1 (Left Graph)
- **Q-Anchored (PopQA)**: Starts at 0, dips to ~-45 at layer 10, recovers to ~-10 by layer 30.
- **A-Anchored (PopQA)**: Starts at ~-5, fluctuates between -10 and 0, ending at ~-5.
- **Q-Anchored (TriviaQA)**: Starts at ~-5, dips to ~-30 at layer 15, recovers to ~-15.
- **A-Anchored (TriviaQA)**: Starts at ~-10, peaks at ~-5 at layer 5, ends at ~-20.
- **Q-Anchored (HotpotQA)**: Starts at ~-5, dips to ~-40 at layer 20, recovers to ~-10.
- **A-Anchored (NQ)**: Starts at ~-5, fluctuates between -10 and 0, ending at ~-5.
#### Mistral-7B-v0.3 (Right Graph)
- **Q-Anchored (PopQA)**: Starts at 0, plunges to ~-60 at layer 15, recovers to ~-20 by layer 30.
- **A-Anchored (PopQA)**: Starts at ~-5, dips to ~-40 at layer 10, fluctuates to ~-10.
- **Q-Anchored (TriviaQA)**: Starts at ~-5, dips to ~-50 at layer 12, recovers to ~-25.
- **A-Anchored (TriviaQA)**: Starts at ~-10, peaks at ~-5 at layer 5, ends at ~-30.
- **Q-Anchored (HotpotQA)**: Starts at ~-5, dips to ~-60 at layer 18, recovers to ~-30.
- **A-Anchored (NQ)**: Starts at ~-5, fluctuates between -10 and 0, ending at ~-5.
---
### Key Observations
1. **Model Version Differences**:
- v0.3 shows more extreme ΔP fluctuations (e.g., Q-Anchored PopQA drops to -60 vs. -45 in v0.1).
- v0.1 trends are smoother, while v0.3 exhibits sharper dips and recoveries.
2. **Anchoring Method Trends**:
- **Q-Anchored** methods generally show deeper ΔP dips (e.g., Q-Anchored PopQA in v0.3 reaches -60).
- **A-Anchored** methods exhibit more stability but smaller magnitude changes.
3. **Dataset-Specific Behavior**:
- **PopQA**: Largest ΔP swings in both versions (e.g., -60 in v0.3).
- **NQ**: Minimal ΔP variation across layers (consistent ~-5 to 0).
4. **Layer-Specific Anomalies**:
- Sharpest dips occur in middle layers (10–20) for most methods.
- v0.3’s Q-Anchored HotpotQA shows a unique U-shaped recovery after layer 20.
---
### Interpretation
- **Performance Implications**: Lower ΔP values (more negative) may indicate better performance, suggesting Q-Anchored methods are more effective in reducing ΔP, particularly in later layers.
- **Model Version Impact**: v0.3’s increased volatility could reflect architectural changes or training adjustments affecting layer-specific behavior.
- **Dataset Sensitivity**: PopQA and TriviaQA show greater sensitivity to anchoring methods, while NQ remains stable, possibly due to dataset complexity or question type.
- **Outliers**: The extreme -60 ΔP in v0.3’s Q-Anchored PopQA at layer 15 may indicate a critical layer adjustment or dataset-specific failure mode.
---
### Spatial Grounding & Legend Verification
- **Legend Placement**: Bottom-center, aligned with x-axis.
- **Color/Style Consistency**: All lines match legend entries (e.g., Q-Anchored PopQA = solid blue).
- **Axis Labels**: Clear and unambiguous (ΔP, Layer).
---
### Content Details
- **Numerical Approximations**:
- v0.1 Q-Anchored PopQA: ~-45 (layer 10), ~-10 (layer 30).
- v0.3 Q-Anchored PopQA: ~-60 (layer 15), ~-20 (layer 30).
- A-Anchored NQ: ~-5 (layers 0/30), ~-10 (layer 15).
- **Trend Verification**:
- Q-Anchored lines generally slope downward then recover.
- A-Anchored lines show smaller amplitude oscillations.
---
### Final Notes
The graphs highlight how anchoring methods and model versions interact to shape layer-specific ΔP values. Further investigation is needed to clarify ΔP’s exact meaning (e.g., performance metric, error rate) and contextualize these findings within the broader model evaluation framework.
</details>
Figure 8: $\Delta\mathrm{P}$ under attention knockout, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Graphs: ΔP vs Layer for QA Models in LLaMA-3.2-1B and 3B
### Overview
The image contains two line graphs comparing the change in perplexity (ΔP) across transformer layers for different question-answering (QA) models in LLaMA-3.2-1B (left) and LLaMA-3.2-3B (right). The graphs show six data series with distinct line styles and colors, representing combinations of anchoring strategies (Q-Anchored vs A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). Shaded regions indicate confidence intervals.
### Components/Axes
- **X-axis (Layer)**: Integer values from 0 to 15 (1B) and 0 to 25 (3B), representing transformer layers.
- **Y-axis (ΔP)**: Change in perplexity, ranging from -80 to 0.
- **Legends**:
- **Q-Anchored (PopQA)**: Solid blue
- **A-Anchored (PopQA)**: Dashed orange
- **Q-Anchored (TriviaQA)**: Dotted green
- **A-Anchored (TriviaQA)**: Dash-dot pink
- **Q-Anchored (HotpotQA)**: Solid purple
- **A-Anchored (NQ)**: Dotted gray
### Detailed Analysis
#### LLaMA-3.2-1B (Left Graph)
1. **Q-Anchored (PopQA)**: Starts at 0, sharply drops to ~-60 at layer 10, then recovers to ~-20 by layer 15.
2. **A-Anchored (PopQA)**: Remains relatively stable, fluctuating between 0 and -10.
3. **Q-Anchored (TriviaQA)**: Drops to ~-50 at layer 5, recovers to ~-10 by layer 15.
4. **A-Anchored (TriviaQA)**: Stable between 0 and -5.
5. **Q-Anchored (HotpotQA)**: Dips to ~-40 at layer 10, recovers to ~-15 by layer 15.
6. **A-Anchored (NQ)**: Stable between 0 and -5.
#### LLaMA-3.2-3B (Right Graph)
1. **Q-Anchored (PopQA)**: Starts at 0, drops to ~-70 at layer 10, recovers to ~-30 by layer 25.
2. **A-Anchored (PopQA)**: Stable between 0 and -5.
3. **Q-Anchored (TriviaQA)**: Drops to ~-60 at layer 5, recovers to ~-20 by layer 25.
4. **A-Anchored (TriviaQA)**: Stable between 0 and -5.
5. **Q-Anchored (HotpotQA)**: Sharp dip to ~-80 at layer 20, recovers to ~-40 by layer 25.
6. **A-Anchored (NQ)**: Stable between 0 and -5.
### Key Observations
1. **Q-Anchored models** consistently show larger ΔP dips than A-Anchored counterparts, especially in deeper layers.
2. **HotpotQA** datasets exhibit the most extreme ΔP fluctuations, particularly in the 3B model.
3. **A-Anchored (NQ)** remains the most stable across all layers and model sizes.
4. **Confidence intervals** (shaded regions) are widest for Q-Anchored models, indicating higher variability.
5. **Layer-specific trends**: ΔP dips correlate with mid-to-late layers (5–20), suggesting architectural or training dynamics in these regions.
### Interpretation
The data suggests that **Q-Anchored models** are more sensitive to layer-specific changes in perplexity, particularly when trained on complex datasets like HotpotQA. The pronounced dips in ΔP for Q-Anchored models may reflect:
- **Architectural bottlenecks**: Certain layers struggle more with QA tasks when anchored to specific datasets.
- **Training dynamics**: Q-Anchored models might overfit to dataset-specific patterns in later layers.
- **Scalability differences**: The 3B model’s extended layers show similar trends, implying that larger models amplify these effects.
The stability of A-Anchored models (especially NQ) suggests they are less prone to layer-specific degradation, making them more robust for general QA tasks. The extreme fluctuations in HotpotQA datasets highlight challenges in handling multi-hop reasoning or domain-specific knowledge in later transformer layers.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Graphs: Llama-3-8B and Llama-3-70B Performance Comparison
### Overview
The image contains two side-by-side line graphs comparing performance metrics (ΔP) across layers for different model configurations of Llama-3-8B and Llama-3-70B. Each graph tracks six distinct data series representing variations in anchoring strategies (Q-Anchored vs. A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs show significant fluctuations in ΔP values across layers, with notable differences between model sizes.
### Components/Axes
- **X-axis (Layer)**:
- Llama-3-8B: 0–30 (discrete increments)
- Llama-3-70B: 0–80 (discrete increments)
- **Y-axis (ΔP)**:
- Range: -80 to 20 (continuous scale)
- Gridlines at 20-unit intervals
- **Legend**:
- Position: Bottom center
- Entries (color-coded):
- Q-Anchored (PopQA): Solid blue
- A-Anchored (PopQA): Dashed orange
- Q-Anchored (TriviaQA): Solid green
- A-Anchored (TriviaQA): Dashed red
- Q-Anchored (HotpotQA): Solid purple
- A-Anchored (HotpotQA): Dashed gray
- Q-Anchored (NQ): Solid pink
- A-Anchored (NQ): Dashed black
### Detailed Analysis
#### Llama-3-8B Graph
- **Q-Anchored (PopQA)**:
- Starts at ~0ΔP, trends downward to ~-60ΔP by layer 30
- Sharpest drop between layers 5–15
- **A-Anchored (PopQA)**:
- Starts at ~0ΔP, fluctuates between -10ΔP and 10ΔP
- Minimal net change
- **Q-Anchored (TriviaQA)**:
- Starts at ~0ΔP, drops to ~-50ΔP by layer 20
- Gradual decline with minor oscillations
- **A-Anchored (TriviaQA)**:
- Starts at ~0ΔP, peaks at ~15ΔP (layer 10), then declines to ~-30ΔP
- **Q-Anchored (HotpotQA)**:
- Starts at ~0ΔP, drops to ~-70ΔP by layer 30
- Steepest decline among all series
- **A-Anchored (HotpotQA)**:
- Starts at ~0ΔP, fluctuates between -20ΔP and 10ΔP
- **Q-Anchored (NQ)**:
- Starts at ~0ΔP, drops to ~-40ΔP by layer 30
- Moderate decline with mid-layer peaks
- **A-Anchored (NQ)**:
- Starts at ~0ΔP, fluctuates between -15ΔP and 5ΔP
#### Llama-3-70B Graph
- **Q-Anchored (PopQA)**:
- Starts at ~0ΔP, drops to ~-70ΔP by layer 40
- Sharp decline followed by stabilization
- **A-Anchored (PopQA)**:
- Starts at ~0ΔP, fluctuates between -5ΔP and 5ΔP
- Minimal net change
- **Q-Anchored (TriviaQA)**:
- Starts at ~0ΔP, drops to ~-60ΔP by layer 60
- Gradual decline with mid-layer oscillations
- **A-Anchored (TriviaQA)**:
- Starts at ~0ΔP, peaks at ~20ΔP (layer 20), then declines to ~-40ΔP
- **Q-Anchored (HotpotQA)**:
- Starts at ~0ΔP, drops to ~-80ΔP by layer 80
- Steepest and most sustained decline
- **A-Anchored (HotpotQA)**:
- Starts at ~0ΔP, fluctuates between -30ΔP and 10ΔP
- **Q-Anchored (NQ)**:
- Starts at ~0ΔP, drops to ~-50ΔP by layer 80
- Moderate decline with mid-layer peaks
- **A-Anchored (NQ)**:
- Starts at ~0ΔP, fluctuates between -20ΔP and 10ΔP
### Key Observations
1. **Model Size Impact**: Llama-3-70B shows more pronounced fluctuations and steeper declines in ΔP compared to Llama-3-8B.
2. **Anchoring Strategy**:
- Q-Anchored models consistently show larger negative ΔP values across datasets.
- A-Anchored models exhibit smaller magnitude changes but greater variability.
3. **Dataset Sensitivity**:
- HotpotQA triggers the largest ΔP drops in both models.
- NQ (No Query) shows the least severe declines.
4. **Layer Dynamics**:
- Early layers (0–20) exhibit the most significant ΔP changes.
- Later layers (40–80 for 70B) show stabilization or minor fluctuations.
### Interpretation
The data suggests that anchoring strategies significantly influence model performance across layers. Q-Anchored configurations (query-focused) demonstrate greater sensitivity to dataset complexity, particularly with HotpotQA, resulting in larger ΔP drops. A-Anchored models (answer-focused) show more stability but less performance differentiation between datasets. The larger Llama-3-70B model amplifies these trends, indicating that scale exacerbates dataset-specific performance gaps. The NQ baseline suggests that query anchoring inherently introduces performance variability compared to answer anchoring. These patterns highlight trade-offs between query specificity and generalization in large language models.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Line Chart: ΔP vs Layer for Mistral-7B Model Versions
### Overview
The image contains two side-by-side line charts comparing the performance of different anchoring methods (Q-Anchored and A-Anchored) across model versions (Mistral-7B-v0.1 and Mistral-7B-v0.3). The y-axis represents ΔP (change in performance), and the x-axis represents model layers (0-30). Multiple data series are plotted with distinct line styles and colors.
### Components/Axes
- **X-axis (Layer)**: Labeled "Layer" with ticks at 0, 10, 20, 30. Represents model layers.
- **Y-axis (ΔP)**: Labeled "ΔP" with values ranging from -80 to 20. Represents performance change.
- **Legends**:
- **Left Chart (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed red: A-Anchored (PopQA)
- Dotted green: Q-Anchored (TriviaQA)
- Dash-dot pink: A-Anchored (TriviaQA)
- **Right Chart (v0.3)**:
- Solid blue: Q-Anchored (HotpotQA)
- Dashed red: A-Anchored (HotpotQA)
- Dotted green: Q-Anchored (NQ)
- Dash-dot pink: A-Anchored (NQ)
### Detailed Analysis
#### Left Chart (Mistral-7B-v0.1)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ΔP ≈ 0 at layer 0.
- Sharp decline to ΔP ≈ -60 at layer 10.
- Fluctuates between -40 and -20 until layer 30.
2. **A-Anchored (PopQA)** (dashed red):
- Starts at ΔP ≈ 0.
- Gradual decline to ΔP ≈ -20 at layer 10.
- Stabilizes near ΔP ≈ -10 by layer 30.
3. **Q-Anchored (TriviaQA)** (dotted green):
- Starts at ΔP ≈ 0.
- Sharp drop to ΔP ≈ -50 at layer 10.
- Recovers slightly to ΔP ≈ -30 by layer 30.
4. **A-Anchored (TriviaQA)** (dash-dot pink):
- Starts at ΔP ≈ 0.
- Gradual decline to ΔP ≈ -15 at layer 10.
- Stabilizes near ΔP ≈ -5 by layer 30.
#### Right Chart (Mistral-7B-v0.3)
1. **Q-Anchored (HotpotQA)** (solid blue):
- Starts at ΔP ≈ 0.
- Sharp drop to ΔP ≈ -50 at layer 10.
- Recovers to ΔP ≈ -20 by layer 30.
2. **A-Anchored (HotpotQA)** (dashed red):
- Starts at ΔP ≈ 0.
- Gradual decline to ΔP ≈ -10 at layer 10.
- Stabilizes near ΔP ≈ -5 by layer 30.
3. **Q-Anchored (NQ)** (dotted green):
- Starts at ΔP ≈ 0.
- Sharp drop to ΔP ≈ -70 at layer 10.
- Recovers to ΔP ≈ -40 by layer 30.
4. **A-Anchored (NQ)** (dash-dot pink):
- Starts at ΔP ≈ 0.
- Gradual decline to ΔP ≈ -15 at layer 10.
- Stabilizes near ΔP ≈ -5 by layer 30.
### Key Observations
1. **Version Differences**:
- v0.1 shows more pronounced fluctuations in ΔP compared to v0.3.
- v0.3 demonstrates greater stability in performance across layers.
2. **Anchoring Method Trends**:
- Q-Anchored methods consistently show sharper initial drops in ΔP.
- A-Anchored methods exhibit smoother, more gradual declines.
3. **Dataset-Specific Behavior**:
- NQ dataset in v0.3 shows the most extreme ΔP drop (-70 at layer 10).
- PopQA in v0.1 has the least severe initial drop (-60 at layer 10).
### Interpretation
The data suggests that anchoring methods significantly impact model performance across layers, with Q-Anchored approaches causing more abrupt performance changes. Version v0.3 shows improved stability compared to v0.1, particularly for the NQ dataset. The A-Anchored methods appear more robust to layer-specific variations, maintaining closer-to-zero ΔP values throughout the model. The extreme drop in Q-Anchored (NQ) for v0.3 highlights potential dataset-specific vulnerabilities in the anchoring strategy.
</details>
Figure 9: $\Delta\mathrm{P}$ under attention knockout, probing attention activations of the last exact answer token.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Line Chart: Llama-3.2-1B and Llama-3.2-3B Performance Across Layers
### Overview
The image displays two line charts comparing the performance of different anchoring strategies (Q-Anchored vs. A-Anchored) across layers for two Llama-3.2 models (1B and 3B). The y-axis represents ΔP (change in performance), and the x-axis represents the layer number. Each line corresponds to a specific anchoring strategy and dataset (e.g., PopQA, TriviaQA, HotpotQA, NQ).
---
### Components/Axes
- **Panels**:
- **Left Panel**: Llama-3.2-1B (1 billion parameters).
- **Right Panel**: Llama-3.2-3B (3 billion parameters).
- **Axes**:
- **Y-axis (ΔP)**: Ranges from -80 to 0, labeled "ΔP".
- **X-axis (Layer)**: Ranges from 0 to 15 (left panel) and 0 to 25 (right panel), labeled "Layer".
- **Legend**: Located at the bottom, with seven entries:
1. **Q-Anchored (PopQA)**: Blue solid line.
2. **A-Anchored (PopQA)**: Orange dashed line.
3. **Q-Anchored (TriviaQA)**: Green dotted line.
4. **A-Anchored (TriviaQA)**: Red dotted line.
5. **Q-Anchored (HotpotQA)**: Purple dash-dot line.
6. **A-Anchored (HotpotQA)**: Gray dashed line.
7. **Q-Anchored (NQ)**: Pink dotted line.
---
### Detailed Analysis
#### Llama-3.2-1B (Left Panel)
- **Q-Anchored (PopQA)**: Starts at 0, dips to ~-40 at layer 5, rises to ~-20 at layer 10, and ends at ~-30.
- **A-Anchored (PopQA)**: Starts at 0, dips to ~-20 at layer 5, rises to ~-10 at layer 10, and ends at ~-15.
- **Q-Anchored (TriviaQA)**: Starts at 0, dips to ~-30 at layer 5, rises to ~-10 at layer 10, and ends at ~-20.
- **A-Anchored (TriviaQA)**: Starts at 0, dips to ~-25 at layer 5, rises to ~-15 at layer 10, and ends at ~-20.
- **Q-Anchored (HotpotQA)**: Starts at 0, dips to ~-35 at layer 5, rises to ~-15 at layer 10, and ends at ~-25.
- **A-Anchored (HotpotQA)**: Starts at 0, dips to ~-25 at layer 5, rises to ~-10 at layer 10, and ends at ~-15.
#### Llama-3.2-3B (Right Panel)
- **Q-Anchored (PopQA)**: Starts at 0, dips to ~-40 at layer 5, rises to ~-20 at layer 10, and ends at ~-30.
- **A-Anchored (PopQA)**: Starts at 0, dips to ~-20 at layer 5, rises to ~-10 at layer 10, and ends at ~-15.
- **Q-Anchored (TriviaQA)**: Starts at 0, dips to ~-30 at layer 5, rises to ~-10 at layer 10, and ends at ~-20.
- **A-Anchored (TriviaQA)**: Starts at 0, dips to ~-25 at layer 5, rises to ~-15 at layer 10, and ends at ~-20.
- **Q-Anchored (HotpotQA)**: Starts at 0, dips to ~-35 at layer 5, rises to ~-15 at layer 10, and ends at ~-25.
- **A-Anchored (HotpotQA)**: Starts at 0, dips to ~-25 at layer 5, rises to ~-10 at layer 10, and ends at ~-15.
- **Q-Anchored (NQ)**: Starts at 0, dips to ~-40 at layer 5, rises to ~-20 at layer 10, and ends at ~-30.
---
### Key Observations
1. **Q-Anchored vs. A-Anchored**:
- Q-Anchored models (e.g., PopQA, TriviaQA, HotpotQA) consistently show larger ΔP decreases compared to A-Anchored models.
- Example: In Llama-3.2-1B, Q-Anchored PopQA drops to ~-40, while A-Anchored PopQA only reaches ~-20.
2. **Model Size Impact**:
- The 3B model (right panel) exhibits more pronounced ΔP decreases, especially for Q-Anchored strategies.
- Example: Q-Anchored NQ in the 3B model drops to ~-40, the lowest value across all lines.
3. **Layer Trends**:
- ΔP decreases sharply in early layers (e.g., layer 5) and stabilizes or slightly recovers in later layers (e.g., layer 10–15/25).
- The 3B model shows more variability in recovery (e.g., Q-Anchored NQ recovers to ~-20 at layer 10 but drops again to ~-30 at layer 25).
---
### Interpretation
- **Anchoring Strategy**: Q-Anchored models (question-based anchoring) experience greater performance degradation (ΔP) compared to A-Anchored models (answer-based anchoring). This suggests that answer anchoring may be more effective for maintaining performance across layers.
- **Model Complexity**: The 3B model (larger) shows more severe ΔP drops, indicating that increased model size amplifies the impact of anchoring strategies.
- **NQ Anchoring**: The Q-Anchored (NQ) line in the 3B model demonstrates the most drastic ΔP decrease, highlighting the critical role of question anchoring in mitigating performance loss.
- **Layer-Specific Behavior**: Early layers (e.g., layer 5) are more sensitive to anchoring strategies, while later layers show partial recovery, possibly due to model adaptation or optimization.
The data underscores the importance of anchoring strategies in balancing performance across layers, with Q-Anchored models being more vulnerable to degradation, particularly in larger models.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Graph: Performance Comparison of Llama-3 Models Across Layers
### Overview
The image contains two side-by-side line graphs comparing the performance of different Llama-3 model configurations (8B and 70B) across layers. Each graph tracks the change in ΔP (likely a performance metric) across layers, with multiple data series representing different anchoring strategies and datasets.
### Components/Axes
- **X-axis**: Layer (0 to 30 for 8B, 0 to 80 for 70B)
- **Y-axis**: ΔP (ranging from -80 to 0)
- **Legends**:
- **Left Graph (Llama-3-8B)**:
- Blue: Q-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Red: Q-Anchored (HotpotQA)
- Pink: Q-Anchored (NQ)
- Orange: A-Anchored (PopQA)
- Purple: A-Anchored (TriviaQA)
- Gray: A-Anchored (HotpotQA)
- Pink Dashed: A-Anchored (NQ)
- **Right Graph (Llama-3-70B)**:
- Same legend as above, with lines extending to 80 layers.
### Detailed Analysis
#### Llama-3-8B (Left Graph)
- **Q-Anchored (PopQA)**: Starts at 0, drops sharply to ~-60 by layer 20, then stabilizes with minor fluctuations.
- **Q-Anchored (TriviaQA)**: Begins at 0, declines to ~-40 by layer 20, then fluctuates between -30 and -50.
- **Q-Anchored (HotpotQA)**: Similar to TriviaQA but with more pronounced oscillations.
- **Q-Anchored (NQ)**: Remains near 0 with slight oscillations.
- **A-Anchored (PopQA)**: Starts at 0, drops to ~-40 by layer 20, then stabilizes.
- **A-Anchored (TriviaQA)**: Declines to ~-30 by layer 20, then fluctuates between -20 and -40.
- **A-Anchored (HotpotQA)**: Similar to TriviaQA but with more variability.
- **A-Anchored (NQ)**: Stays near 0 with minimal changes.
#### Llama-3-70B (Right Graph)
- **Q-Anchored (PopQA)**: Starts at 0, drops to ~-40 by layer 40, then stabilizes.
- **Q-Anchored (TriviaQA)**: Declines to ~-30 by layer 40, then fluctuates between -20 and -40.
- **Q-Anchored (HotpotQA)**: Similar to TriviaQA but with more pronounced oscillations.
- **Q-Anchored (NQ)**: Remains near 0 with slight oscillations.
- **A-Anchored (PopQA)**: Starts at 0, drops to ~-30 by layer 40, then stabilizes.
- **A-Anchored (TriviaQA)**: Declines to ~-20 by layer 40, then fluctuates between -10 and -30.
- **A-Anchored (HotpotQA)**: Similar to TriviaQA but with more variability.
- **A-Anchored (NQ)**: Stays near 0 with minimal changes.
### Key Observations
1. **Q-Anchored vs. A-Anchored**: Q-Anchored models generally show steeper declines in ΔP compared to A-Anchored models, especially in the 8B version.
2. **Dataset Impact**: PopQA and TriviaQA datasets exhibit more variability than NQ (No Query) models, which remain near 0.
3. **Model Size**: The 70B model shows more stability across layers compared to the 8B model, with less extreme ΔP values.
4. **Layer-Specific Trends**: In the 8B model, the sharpest drops occur in the first 20 layers, while the 70B model shows gradual changes.
### Interpretation
The data suggests that anchoring strategies (Q vs. A) significantly influence performance, with Q-Anchored models experiencing more pronounced declines in ΔP. The 70B model's larger size appears to mitigate these declines, resulting in more stable performance across layers. The NQ models (no anchoring) maintain near-zero ΔP, indicating baseline performance without optimization. The dataset-specific trends (e.g., PopQA vs. TriviaQA) highlight how different data types interact with anchoring methods, suggesting that model architecture and data characteristics jointly determine performance outcomes.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: ΔP Across Layers for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side line charts comparing the ΔP metric across 30 layers for two versions of the Mistral-7B model (v0.1 and v0.3). Each chart includes six data series represented by distinct colored lines, with a legend at the bottom-left corner. The y-axis measures ΔP (ranging from -80 to 20), and the x-axis represents layer numbers (0 to 30).
---
### Components/Axes
- **Y-Axis**: ΔP (values from -80 to 20, increments of 20)
- **X-Axis**: Layer (0 to 30, increments of 10)
- **Legend**: Located at bottom-left, with six entries:
1. **Q-Anchored (PopQA)**: Solid blue line
2. **A-Anchored (PopQA)**: Dashed orange line
3. **Q-Anchored (TriviaQA)**: Dotted green line
4. **A-Anchored (TriviaQA)**: Dash-dot red line
5. **Q-Anchored (HotpotQA)**: Solid purple line
6. **A-Anchored (HotpotQA)**: Dotted gray line
7. **Q-Anchored (NQ)**: Dash-dot pink line
8. **A-Anchored (NQ)**: Solid gray line
---
### Detailed Analysis
#### Left Panel (Mistral-7B-v0.1)
- **Q-Anchored (PopQA)**: Starts at 0, declines sharply to ~-60 by layer 30 (blue line).
- **A-Anchored (PopQA)**: Starts at 0, fluctuates minimally, ending near 0 (orange dashed line).
- **Q-Anchored (TriviaQA)**: Drops from 0 to ~-50 by layer 30 (dotted green line).
- **A-Anchored (TriviaQA)**: Peaks at ~+10 around layer 10, then declines to ~-10 (red dash-dot line).
- **Q-Anchored (HotpotQA)**: Declines from 0 to ~-50 by layer 30 (solid purple line).
- **A-Anchored (HotpotQA)**: Starts at 0, fluctuates between +5 and -5 (dotted gray line).
- **Q-Anchored (NQ)**: Drops from 0 to ~-55 by layer 30 (dash-dot pink line).
- **A-Anchored (NQ)**: Starts at 0, fluctuates between +5 and -5 (solid gray line).
#### Right Panel (Mistral-7B-v0.3)
- **Q-Anchored (PopQA)**: Declines from 0 to ~-50 by layer 30 (blue line).
- **A-Anchored (PopQA)**: Starts at 0, fluctuates minimally, ending near 0 (orange dashed line).
- **Q-Anchored (TriviaQA)**: Drops from 0 to ~-45 by layer 30 (dotted green line).
- **A-Anchored (TriviaQA)**: Peaks at ~+15 around layer 10, then declines to ~-5 (red dash-dot line).
- **Q-Anchored (HotpotQA)**: Declines from 0 to ~-40 by layer 30 (solid purple line).
- **A-Anchored (HotpotQA)**: Starts at 0, fluctuates between +5 and -5 (dotted gray line).
- **Q-Anchored (NQ)**: Drops from 0 to ~-50 by layer 30 (dash-dot pink line).
- **A-Anchored (NQ)**: Starts at 0, fluctuates between +5 and -5 (solid gray line).
---
### Key Observations
1. **General Trend**: Most Q-Anchored lines show a consistent downward trend in ΔP across layers, while A-Anchored lines remain relatively stable or exhibit minor fluctuations.
2. **Version Differences**:
- v0.3 shows smaller ΔP magnitudes compared to v0.1 for most Q-Anchored lines (e.g., Q-Anchored (PopQA) drops from -60 to -50).
- A-Anchored (TriviaQA) in v0.3 has a higher peak (~+15 vs. +10 in v0.1).
3. **Anomalies**:
- A-Anchored (TriviaQA) in v0.1 has a pronounced peak at layer 10.
- Q-Anchored (HotpotQA) in v0.1 shows a sharp dip at layer 15.
---
### Interpretation
The charts suggest that anchoring strategies (Q-Anchored vs. A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ) influence ΔP trends across model layers. The reduction in ΔP magnitude in v0.3 compared to v0.1 implies improved stability or performance in the updated model version. Notably, Q-Anchored methods exhibit larger ΔP declines, potentially indicating greater sensitivity to layer depth or dataset-specific challenges. The stability of A-Anchored lines suggests robustness to layer variations. These trends could reflect architectural changes in the model versions or dataset-specific optimization effects.
</details>
Figure 10: $\Delta\mathrm{P}$ under attention knockout, probing mlp activations of the final token.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Graph: ΔP vs. Layer for Llama-3.2-1B and Llama-3.2-3B Models
### Overview
The image contains two side-by-side line graphs comparing the performance of Q-Anchored and A-Anchored models across layers for two versions of the Llama-3.2 architecture (1B and 3B parameters). The y-axis represents ΔP (change in performance), and the x-axis represents the layer number. Each graph includes multiple data series differentiated by color, line style, and legend labels.
---
### Components/Axes
- **X-axis (Layer)**:
- Llama-3.2-1B: 0 to 15 (increments of 5)
- Llama-3.2-3B: 0 to 25 (increments of 5)
- **Y-axis (ΔP)**:
- Range: -80 to 0 (increments of 20)
- **Legend**:
- Positioned at the bottom of both graphs.
- Labels include:
- **Q-Anchored (PopQA)**: Blue solid line
- **A-Anchored (PopQA)**: Orange dashed line
- **Q-Anchored (TriviaQA)**: Green solid line
- **A-Anchored (TriviaQA)**: Red dashed line
- **Q-Anchored (HotpotQA)**: Purple solid line
- **A-Anchored (HotpotQA)**: Gray dashed line
- **Q-Anchored (NQ)**: Pink solid line
- **A-Anchored (NQ)**: Black dashed line
---
### Detailed Analysis
#### Llama-3.2-1B Graph
- **Q-Anchored (PopQA)**: Starts at ~-20 (Layer 0), decreases to ~-60 (Layer 15). Trend: Steady decline.
- **A-Anchored (PopQA)**: Starts at ~0 (Layer 0), decreases to ~-20 (Layer 15). Trend: Gradual decline.
- **Q-Anchored (TriviaQA)**: Starts at ~-40 (Layer 0), decreases to ~-80 (Layer 15). Trend: Sharp decline.
- **A-Anchored (TriviaQA)**: Starts at ~-20 (Layer 0), decreases to ~-60 (Layer 15). Trend: Moderate decline.
- **Q-Anchored (HotpotQA)**: Starts at ~-60 (Layer 0), decreases to ~-80 (Layer 15). Trend: Slight decline.
- **A-Anchored (HotpotQA)**: Starts at ~-40 (Layer 0), decreases to ~-60 (Layer 15). Trend: Steady decline.
- **Q-Anchored (NQ)**: Starts at ~-80 (Layer 0), decreases to ~-100 (Layer 15). Trend: Sharp decline.
- **A-Anchored (NQ)**: Starts at ~-60 (Layer 0), decreases to ~-80 (Layer 15). Trend: Moderate decline.
#### Llama-3.2-3B Graph
- **Q-Anchored (PopQA)**: Starts at ~-10 (Layer 0), decreases to ~-50 (Layer 25). Trend: Steady decline.
- **A-Anchored (PopQA)**: Starts at ~0 (Layer 0), decreases to ~-20 (Layer 25). Trend: Gradual decline.
- **Q-Anchored (TriviaQA)**: Starts at ~-30 (Layer 0), decreases to ~-70 (Layer 25). Trend: Sharp decline.
- **A-Anchored (TriviaQA)**: Starts at ~-10 (Layer 0), decreases to ~-50 (Layer 25). Trend: Moderate decline.
- **Q-Anchored (HotpotQA)**: Starts at ~-50 (Layer 0), decreases to ~-70 (Layer 25). Trend: Slight decline.
- **A-Anchored (HotpotQA)**: Starts at ~-30 (Layer 0), decreases to ~-50 (Layer 25). Trend: Steady decline.
- **Q-Anchored (NQ)**: Starts at ~-70 (Layer 0), decreases to ~-90 (Layer 25). Trend: Sharp decline.
- **A-Anchored (NQ)**: Starts at ~-50 (Layer 0), decreases to ~-70 (Layer 25). Trend: Moderate decline.
---
### Key Observations
1. **General Trend**: All data series show a downward trend in ΔP as layer number increases, indicating performance degradation with deeper layers.
2. **Model Size Impact**:
- Llama-3.2-3B (larger model) exhibits less severe ΔP declines compared to Llama-3.2-1B, suggesting better layer-wise performance in larger models.
3. **Dataset-Specific Patterns**:
- **NQ (Natural Questions)**: Shows the steepest declines, indicating higher sensitivity to layer depth.
- **PopQA**: Exhibits the least severe declines, suggesting it is less affected by layer depth.
4. **Q-Anchored vs. A-Anchored**:
- Q-Anchored models consistently show lower ΔP values than A-Anchored models across all datasets, implying better performance stability.
---
### Interpretation
The data suggests that Q-Anchored models (e.g., PopQA, TriviaQA, HotpotQA, NQ) generally outperform A-Anchored models in terms of ΔP across layers. This could indicate that Q-Anchored architectures are more effective at maintaining performance in deeper layers. The larger Llama-3.2-3B model demonstrates improved layer-wise stability compared to the smaller 1B version, highlighting the benefits of increased model size. The NQ dataset’s pronounced declines suggest it is more challenging for the models, while PopQA’s minimal changes imply it is easier to handle. These trends may reflect differences in dataset complexity or model architecture design.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: ΔP Performance Across Layers for LLaMA-3-8B and LLaMA-3-70B Models
### Overview
The image contains two side-by-side line charts comparing the ΔP (performance change) of different model configurations (Q-Anchored and A-Anchored) across layers for two LLaMA-3 model sizes: 8B and 70B. Each chart includes multiple data series representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). The charts show how ΔP values evolve across model layers, with distinct trends for each configuration and dataset.
### Components/Axes
- **X-Axis (Layer)**:
- Left panel (LLaMA-3-8B): Layers 0–30
- Right panel (LLaMA-3-70B): Layers 0–80
- Discrete integer values with no intermediate markers
- **Y-Axis (ΔP)**:
- Range: -100 to 0 (negative values indicate performance degradation)
- Tick intervals: -100, -80, -60, -40, -20, 0
- **Legend**:
- Located at the bottom of both panels
- Line styles and colors:
- **Solid lines**: Q-Anchored configurations
- **Dashed lines**: A-Anchored configurations
- **Color coding**:
- Blue: PopQA
- Green: TriviaQA
- Purple: HotpotQA
- Red: NQ
- Labels:
- Q-Anchored (PopQA)
- Q-Anchored (TriviaQA)
- Q-Anchored (HotpotQA)
- Q-Anchored (NQ)
- A-Anchored (PopQA)
- A-Anchored (TriviaQA)
- A-Anchored (HotpotQA)
- A-Anchored (NQ)
### Detailed Analysis
#### LLaMA-3-8B Panel (Left)
- **Q-Anchored (PopQA)**:
- Starts near 0 at Layer 0, sharply declines to ~-80 by Layer 10, then fluctuates between -60 and -40 until Layer 30.
- **A-Anchored (PopQA)**:
- Begins at ~-10, rises to ~0 by Layer 5, then stabilizes near 0 with minor oscillations.
- **Q-Anchored (TriviaQA)**:
- Drops from 0 to ~-60 by Layer 15, then stabilizes with minor fluctuations.
- **A-Anchored (TriviaQA)**:
- Remains near 0 with slight oscillations throughout.
- **Q-Anchored (HotpotQA)**:
- Declines steeply to ~-80 by Layer 10, then fluctuates between -60 and -40.
- **A-Anchored (HotpotQA)**:
- Starts at ~-10, rises to ~0 by Layer 5, then stabilizes.
- **Q-Anchored (NQ)**:
- Sharp decline to ~-80 by Layer 10, then stabilizes with minor oscillations.
- **A-Anchored (NQ)**:
- Begins at ~-10, rises to ~0 by Layer 5, then stabilizes.
#### LLaMA-3-70B Panel (Right)
- **Q-Anchored (PopQA)**:
- Starts near 0, dips to ~-80 by Layer 20, then fluctuates between -60 and -40.
- **A-Anchored (PopQA)**:
- Begins at ~-10, rises to ~0 by Layer 10, then stabilizes with minor oscillations.
- **Q-Anchored (TriviaQA)**:
- Declines to ~-60 by Layer 30, then stabilizes with fluctuations.
- **A-Anchored (TriviaQA)**:
- Remains near 0 with slight oscillations.
- **Q-Anchored (HotpotQA)**:
- Drops to ~-80 by Layer 20, then fluctuates between -60 and -40.
- **A-Anchored (HotpotQA)**:
- Starts at ~-10, rises to ~0 by Layer 10, then stabilizes.
- **Q-Anchored (NQ)**:
- Sharp decline to ~-100 by Layer 20, then stabilizes with oscillations.
- **A-Anchored (NQ)**:
- Begins at ~-10, rises to ~0 by Layer 10, then stabilizes.
### Key Observations
1. **Model Size Impact**:
- The 70B model shows more pronounced fluctuations in ΔP values compared to the 8B model, particularly in the middle layers (e.g., Layers 20–40).
2. **Anchoring Strategy**:
- A-Anchored configurations generally maintain higher ΔP values (closer to 0) than Q-Anchored configurations across most datasets.
3. **Dataset Variability**:
- NQ (Natural Questions) shows the most severe performance degradation for Q-Anchored models, reaching ~-100 in the 70B model.
4. **Layer-Specific Trends**:
- Performance degradation for Q-Anchored models often occurs in the middle layers (e.g., Layers 10–30 for 8B, Layers 20–40 for 70B).
5. **Stability**:
- A-Anchored models exhibit greater stability, with ΔP values clustering near 0 after initial adjustments.
### Interpretation
The charts suggest that anchoring strategies (Q-Anchored vs. A-Anchored) significantly influence model performance across layers. A-Anchored configurations appear more robust, maintaining closer-to-zero ΔP values, while Q-Anchored models experience sharper declines, particularly in complex datasets like NQ. The 70B model’s increased layer count correlates with more variability in ΔP, indicating potential challenges in scaling. These trends highlight the importance of anchoring strategies in mitigating performance degradation in large language models.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Graph: ΔP vs. Layer for Mistral-7B-v0.1 and Mistral-7B-v0.3
### Overview
The image contains two side-by-side line graphs comparing the performance (ΔP) of different Q-Anchored and A-Anchored models across layers (0–30) in two versions of the Mistral-7B model (v0.1 and v0.3). Each graph includes multiple data series with distinct line styles and colors, representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ).
---
### Components/Axes
- **X-axis (Layer)**: Ranges from 0 to 30, labeled "Layer".
- **Y-axis (ΔP)**: Ranges from -80 to 0, labeled "ΔP".
- **Legends**:
- **Left Graph (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: A-Anchored (HotpotQA)
- Solid gray: Q-Anchored (NQ)
- Dashed gray: A-Anchored (NQ)
- **Right Graph (v0.3)**:
- Same legend as v0.1, but with updated line trends.
---
### Detailed Analysis
#### Mistral-7B-v0.1 (Left Graph)
- **Q-Anchored (PopQA)**: Solid blue line starts near 0, drops sharply to ~-60 by layer 10, then fluctuates between -40 and -60.
- **A-Anchored (PopQA)**: Dashed orange line remains relatively stable, oscillating between ~-10 and 0.
- **Q-Anchored (TriviaQA)**: Solid green line starts near 0, drops to ~-50 by layer 10, then stabilizes.
- **A-Anchored (TriviaQA)**: Dashed red line fluctuates between ~-10 and 0.
- **Q-Anchored (HotpotQA)**: Solid purple line starts near 0, drops to ~-50 by layer 10, then stabilizes.
- **A-Anchored (HotpotQA)**: Dashed pink line fluctuates between ~-10 and 0.
- **Q-Anchored (NQ)**: Solid gray line starts near 0, drops to ~-60 by layer 10, then stabilizes.
- **A-Anchored (NQ)**: Dashed gray line fluctuates between ~-10 and 0.
#### Mistral-7B-v0.3 (Right Graph)
- **Q-Anchored (PopQA)**: Solid blue line starts near 0, drops to ~-50 by layer 10, then fluctuates between -30 and -50.
- **A-Anchored (PopQA)**: Dashed orange line remains stable, oscillating between ~-10 and 0.
- **Q-Anchored (TriviaQA)**: Solid green line starts near 0, drops to ~-40 by layer 10, then stabilizes.
- **A-Anchored (TriviaQA)**: Dashed red line fluctuates between ~-10 and 0.
- **Q-Anchored (HotpotQA)**: Solid purple line starts near 0, drops to ~-40 by layer 10, then stabilizes.
- **A-Anchored (HotpotQA)**: Dashed pink line fluctuates between ~-10 and 0.
- **Q-Anchored (NQ)**: Solid gray line starts near 0, drops to ~-50 by layer 10, then stabilizes.
- **A-Anchored (NQ)**: Dashed gray line fluctuates between ~-10 and 0.
---
### Key Observations
1. **Q-Anchored Models**:
- All Q-Anchored lines (PopQA, TriviaQA, HotpotQA, NQ) show a **sharp decline** in ΔP (from ~0 to ~-40 to -60) in the first 10 layers, followed by stabilization.
- In v0.3, the decline is slightly less severe than in v0.1.
2. **A-Anchored Models**:
- All A-Anchored lines (PopQA, TriviaQA, HotpotQA, NQ) remain **relatively stable**, with minor fluctuations around ~-10 to 0.
3. **Version Differences**:
- v0.3 shows **reduced variability** in Q-Anchored models compared to v0.1, suggesting improved stability in later layers.
- A-Anchored models show **no significant change** between versions.
---
### Interpretation
- **Q-Anchored vs. A-Anchored**:
- Q-Anchored models exhibit **greater sensitivity to layer changes**, leading to larger ΔP variations. This suggests they may be more prone to overfitting or instability in early layers.
- A-Anchored models demonstrate **consistent performance**, indicating robustness across layers.
- **Version Impact**:
- The reduction in ΔP variability in v0.3 (compared to v0.1) implies architectural improvements in Mistral-7B, particularly in stabilizing Q-Anchored models.
- **Dataset-Specific Trends**:
- PopQA and NQ show the **most pronounced declines** in Q-Anchored models, possibly due to their complexity or data distribution.
- TriviaQA and HotpotQA exhibit **moderate declines**, suggesting they are less sensitive to layer-specific variations.
---
### Notes on Data Extraction
- All values are approximate, as the graph lacks explicit numerical markers.
- Line styles (solid/dashed) and colors (blue, orange, green, red, purple, gray) are strictly matched to the legend.
- No text or tables are present in the image beyond the axes, legends, and titles.
This analysis highlights the trade-offs between Q-Anchored and A-Anchored models in terms of stability and performance across layers, with version updates favoring Q-Anchored models in later iterations.
</details>
Figure 11: $\Delta\mathrm{P}$ under attention knockout, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x22.png Details</summary>

### Visual Description
## Line Graph: ΔP vs. Layer for Llama-3.2-1B and Llama-3.2-3B Models
### Overview
The image contains two line graphs comparing the performance (ΔP) of different Q-Anchored and A-Anchored models across layers in two versions of the Llama model (3.2-1B and 3.2-3B). The graphs show trends in ΔP values as a function of layer depth, with distinct lines representing different datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring strategies (Q-Anchored vs. A-Anchored).
---
### Components/Axes
- **X-axis (Layer)**: Represents the depth of the model layers, ranging from 0 to 15 for Llama-3.2-1B and 0 to 25 for Llama-3.2-3B.
- **Y-axis (ΔP)**: Represents the performance metric (ΔP), with values ranging from -80 to 0.
- **Legends**:
- **Llama-3.2-1B (Left Graph)**:
- **Blue Solid**: Q-Anchored (PopQA)
- **Green Dashed**: Q-Anchored (TriviaQA)
- **Orange Dotted**: A-Anchored (PopQA)
- **Red Dashed**: A-Anchored (TriviaQA)
- **Purple Dotted**: Q-Anchored (HotpotQA)
- **Pink Dashed**: Q-Anchored (NQ)
- **Llama-3.2-3B (Right Graph)**:
- **Blue Solid**: Q-Anchored (PopQA)
- **Green Dashed**: Q-Anchored (TriviaQA)
- **Orange Dotted**: A-Anchored (PopQA)
- **Red Dashed**: A-Anchored (TriviaQA)
- **Purple Dotted**: Q-Anchored (HotpotQA)
- **Pink Dashed**: Q-Anchored (NQ)
---
### Detailed Analysis
#### Llama-3.2-1B (Left Graph)
- **Q-Anchored (PopQA)**: Starts at 0, drops sharply to ~-60 by layer 5, then stabilizes with minor fluctuations.
- **Q-Anchored (TriviaQA)**: Similar to PopQA but with a slightly less steep decline, reaching ~-50 by layer 5.
- **A-Anchored (PopQA)**: Starts at 0, declines to ~-40 by layer 5, then stabilizes.
- **A-Anchored (TriviaQA)**: Similar to A-Anchored (PopQA) but with a slightly less steep decline.
- **Q-Anchored (HotpotQA)**: Starts at 0, drops to ~-50 by layer 5, then stabilizes.
- **Q-Anchored (NQ)**: Remains flat at 0 across all layers.
#### Llama-3.2-3B (Right Graph)
- **Q-Anchored (PopQA)**: Starts at 0, drops sharply to ~-70 by layer 5, then stabilizes with minor fluctuations.
- **Q-Anchored (TriviaQA)**: Similar to PopQA but with a slightly less steep decline, reaching ~-60 by layer 5.
- **A-Anchored (PopQA)**: Starts at 0, declines to ~-50 by layer 5, then stabilizes.
- **A-Anchored (TriviaQA)**: Similar to A-Anchored (PopQA) but with a slightly less steep decline.
- **Q-Anchored (HotpotQA)**: Starts at 0, drops to ~-60 by layer 5, then stabilizes.
- **Q-Anchored (NQ)**: Remains flat at 0 across all layers.
---
### Key Observations
1. **Initial Sharp Decline**: All Q-Anchored models (PopQA, TriviaQA, HotpotQA) show a sharp drop in ΔP within the first 5 layers, followed by stabilization.
2. **A-Anchored Models**: Show similar trends but with less pronounced declines and more gradual stabilization.
3. **NQ Models**: Remain flat at 0, indicating no significant change in ΔP across layers.
4. **Layer Depth**: The 3B version (right graph) extends to 25 layers, showing consistent trends but with more variability in later layers (e.g., oscillations in Q-Anchored (HotpotQA) around layer 20).
---
### Interpretation
- **Model Behavior**: The sharp initial decline in ΔP for Q-Anchored models suggests a strong initial impact of anchoring strategies, which diminishes as layers deepen. This could indicate that anchoring effects are most pronounced in early layers.
- **Dataset Differences**: PopQA and TriviaQA show similar trends, while HotpotQA exhibits slightly more variability, possibly due to differences in data complexity or model sensitivity.
- **Anchoring Strategy**: Q-Anchored models consistently outperform A-Anchored models in terms of ΔP magnitude, suggesting that Q-Anchored strategies are more effective in this context.
- **NQ Models**: The flat line for NQ models implies that non-anchored approaches do not show significant layer-dependent performance changes, highlighting the importance of anchoring in this analysis.
---
### Spatial Grounding
- **Legends**: Positioned at the bottom of each graph, with labels aligned to the left. Colors and line styles match the corresponding data series.
- **Axes**: ΔP (y-axis) is on the left, Layer (x-axis) is at the bottom. Both axes are labeled clearly.
- **Data Series**: Lines are plotted with distinct styles (solid, dashed, dotted) and colors (blue, green, orange, red, purple, pink) as per the legend.
---
### Uncertainties
- Approximate ΔP values are estimated from the graph (e.g., ~-60, ~-50) due to the lack of explicit numerical markers. Minor fluctuations in later layers (e.g., Llama-3.2-3B) may introduce slight variability in trend interpretation.
</details>
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Graphs: Performance Comparison of Q-Anchored and A-Anchored Methods in LLaMA-3 Models
### Overview
The image contains two side-by-side line graphs comparing the performance degradation (ΔP) of Q-Anchored and A-Anchored methods across different layers in LLaMA-3-8B and LLaMA-3-70B models. The graphs visualize how performance changes (ΔP) vary with model depth for four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Each line represents a specific anchoring method and dataset combination.
### Components/Axes
- **X-axis (Layer)**: Model depth, ranging from 0 to 30 for LLaMA-3-8B and 0 to 80 for LLaMA-3-70B.
- **Y-axis (ΔP)**: Performance change, measured in arbitrary units (range: -80 to 0).
- **Legends**:
- **LLaMA-3-8B** (left graph):
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: A-Anchored (HotpotQA)
- Solid gray: Q-Anchored (NQ)
- Dashed brown: A-Anchored (NQ)
- **LLaMA-3-70B** (right graph):
- Same color/style coding as above, with lines extending to layer 80.
### Detailed Analysis
#### LLaMA-3-8B (Left Graph)
- **Q-Anchored (PopQA)**: Starts at 0, slopes downward sharply to ~-80 by layer 30 (peak ΔP: -80).
- **A-Anchored (PopQA)**: Remains near 0 with minor fluctuations (ΔP: -2 to 0).
- **Q-Anchored (TriviaQA)**: Drops to ~-60 by layer 30 (ΔP: -60).
- **A-Anchored (TriviaQA)**: Stable near 0 (ΔP: -1 to 0).
- **Q-Anchored (HotpotQA)**: Declines to ~-70 (ΔP: -70).
- **A-Anchored (HotpotQA)**: Stable near 0 (ΔP: -2 to 0).
- **Q-Anchored (NQ)**: Plummets to ~-85 (ΔP: -85).
- **A-Anchored (NQ)**: Stable near 0 (ΔP: -1 to 0).
#### LLaMA-3-70B (Right Graph)
- **Q-Anchored (PopQA)**: Starts at 0, dips to ~-60 by layer 40, then stabilizes (ΔP: -60).
- **A-Anchored (PopQA)**: Fluctuates slightly above 0 (ΔP: -1 to 2).
- **Q-Anchored (TriviaQA)**: Drops to ~-50 by layer 60 (ΔP: -50).
- **A-Anchored (TriviaQA)**: Stable near 0 (ΔP: -1 to 1).
- **Q-Anchored (HotpotQA)**: Declines to ~-75 by layer 80 (ΔP: -75).
- **A-Anchored (HotpotQA)**: Stable near 0 (ΔP: -1 to 1).
- **Q-Anchored (NQ)**: Plummets to ~-90 by layer 80 (ΔP: -90).
- **A-Anchored (NQ)**: Stable near 0 (ΔP: -1 to 1).
### Key Observations
1. **Performance Degradation**: Q-Anchored methods show significant performance drops (ΔP) across all datasets, while A-Anchored methods remain stable (ΔP ≈ 0).
2. **Model Size Impact**: LLaMA-3-70B exhibits more gradual degradation than LLaMA-3-8B, suggesting larger models handle anchoring better.
3. **Dataset Variability**: NQ (Natural Questions) shows the steepest decline for Q-Anchored methods, indicating higher sensitivity to anchoring choices.
4. **Layer Stability**: A-Anchored methods maintain near-zero ΔP across all layers, while Q-Anchored methods degrade sharply in early layers.
### Interpretation
The data demonstrates that **A-Anchored methods preserve performance stability** across model layers, whereas **Q-Anchored methods degrade significantly**, especially in smaller models (LLaMA-3-8B). The larger LLaMA-3-70B model mitigates this degradation but does not eliminate it. The NQ dataset’s extreme sensitivity to anchoring suggests it poses unique challenges for Q-Anchored approaches. These trends highlight the importance of anchoring strategy in maintaining model performance during scaling.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: ΔP vs Layer for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side line charts comparing the performance of Mistral-7B models (v0.1 and v0.3) across different datasets and anchoring methods. Each chart tracks the change in ΔP (y-axis) across 30 layers (x-axis). Six distinct data series are plotted per model version, differentiated by line style and color.
---
### Components/Axes
- **X-Axis (Layer)**: Ranges from 0 to 30, labeled "Layer".
- **Y-Axis (ΔP)**: Ranges from -80 to 0, labeled "ΔP".
- **Legends**:
- **Left Chart (v0.1)**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- **Right Chart (v0.3)**:
- Same legend structure as v0.1, with updated line trends.
- **Line Styles/Colors**:
- PopQA: Blue (solid), Orange (dashed)
- TriviaQA: Green (solid), Red (dashed)
- HotpotQA: Purple (solid), Brown (dashed)
- NQ: Pink (solid), Gray (dashed)
---
### Detailed Analysis
#### Mistral-7B-v0.1
- **Q-Anchored (PopQA)**: Starts at ~0, drops sharply to ~-60 by layer 10, stabilizes at ~-60.
- **A-Anchored (PopQA)**: Starts at ~0, declines gradually to ~-40 by layer 30.
- **Q-Anchored (TriviaQA)**: Similar to PopQA but with sharper fluctuations (e.g., -50 to -70).
- **A-Anchored (TriviaQA)**: Less steep decline, ends near -50.
- **Q-Anchored (HotpotQA)**: Starts at ~0, drops to ~-70 by layer 20, then stabilizes.
- **A-Anchored (HotpotQA)**: Declines slowly to ~-50.
- **Q-Anchored (NQ)**: Minimal drop (-10 to -20).
- **A-Anchored (NQ)**: Stable near 0.
#### Mistral-7B-v0.3
- **Q-Anchored (PopQA)**: Starts at ~0, drops to ~-50 by layer 10, stabilizes.
- **A-Anchored (PopQA)**: Declines to ~-35 by layer 30.
- **Q-Anchored (TriviaQA)**: Slightly improved stability vs v0.1, ends near -60.
- **A-Anchored (TriviaQA)**: Ends near -45.
- **Q-Anchored (HotpotQA)**: Drops to ~-65 by layer 20, stabilizes.
- **A-Anchored (HotpotQA)**: Ends near -40.
- **Q-Anchored (NQ)**: Minimal drop (-5 to -15).
- **A-Anchored (NQ)**: Stable near 0.
---
### Key Observations
1. **Q-Anchored vs A-Anchored**: Q-Anchored methods consistently show steeper ΔP declines, especially in early layers (0–10).
2. **Model Version Comparison**: v0.3 generally maintains higher ΔP values than v0.1 across datasets, indicating improved performance.
3. **Dataset Variability**:
- PopQA and TriviaQA show the largest ΔP drops.
- NQ datasets exhibit minimal changes, suggesting lower sensitivity to anchoring.
4. **Layer Dependency**: All datasets exhibit ΔP stabilization after layer 20, with v0.3 showing earlier convergence.
---
### Interpretation
The data demonstrates that **Q-Anchored methods** (question-focused anchoring) lead to more significant ΔP reductions compared to **A-Anchored methods** (answer-focused anchoring), particularly in early layers. This suggests Q-Anchored approaches may prioritize question-context alignment at the expense of stability. The v0.3 model’s improved ΔP retention across datasets implies architectural or training optimizations that enhance robustness. Notably, the NQ dataset’s minimal ΔP changes indicate it is less affected by anchoring strategies, possibly due to simpler question structures. The stabilization after layer 20 suggests diminishing returns in anchoring impact as the model depth increases.
</details>
Figure 12: $\Delta\mathrm{P}$ under attention knockout, probing mlp activations of the last exact answer token.
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Graphs: ΔP vs Layer for Qwen3-8B and Qwen3-32B Models
### Overview
The image contains two line graphs comparing the performance metric ΔP (delta-P) across transformer model layers for two versions of the Qwen3 architecture: 8B (left) and 32B (right). Each graph shows four data series representing different anchoring strategies (Q-Anchored vs A-Anchored) across four datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs reveal layer-wise performance variations, with Q-Anchored models generally exhibiting more pronounced fluctuations than A-Anchored counterparts.
### Components/Axes
- **X-axis (Layer)**:
- Qwen3-8B: 0–30 (discrete increments)
- Qwen3-32B: 0–60 (discrete increments)
- **Y-axis (ΔP)**:
- Range: -100 to +20 (linear scale)
- Units: ΔP (delta-P, unspecified metric)
- **Legends**:
- **Q-Anchored**: Solid lines (blue, green, purple, pink)
- **A-Anchored**: Dashed lines (orange, brown, gray, black)
- **Datasets**:
- PopQA: Blue (solid) / Orange (dashed)
- TriviaQA: Green (solid) / Brown (dashed)
- HotpotQA: Purple (solid) / Gray (dashed)
- NQ: Pink (solid) / Black (dashed)
### Detailed Analysis
#### Qwen3-8B Graph
- **Q-Anchored (PopQA)**: Starts near 0, drops sharply to ~-80 at Layer 10, fluctuates between -60 and -20 until Layer 30.
- **A-Anchored (PopQA)**: Remains near 0 with minor oscillations (±5) throughout.
- **Q-Anchored (TriviaQA)**: Begins at 0, dips to ~-60 at Layer 15, stabilizes between -40 and -20.
- **A-Anchored (TriviaQA)**: Stays near 0 with slight dips to -5.
- **Q-Anchored (HotpotQA)**: Sharp drop to ~-70 at Layer 5, recovers to -30 by Layer 30.
- **A-Anchored (HotpotQA)**: Mild fluctuations between -5 and +5.
- **Q-Anchored (NQ)**: Oscillates between -50 and -10, peaking at -30 at Layer 20.
- **A-Anchored (NQ)**: Stable near 0 with minor ±3 variations.
#### Qwen3-32B Graph
- **Q-Anchored (PopQA)**: Starts at 0, plunges to ~-90 at Layer 10, recovers to -40 by Layer 60.
- **A-Anchored (PopQA)**: Remains near 0 with ±3 fluctuations.
- **Q-Anchored (TriviaQA)**: Drops to ~-70 at Layer 20, stabilizes between -50 and -30.
- **A-Anchored (TriviaQA)**: Stays near 0 with ±2 variations.
- **Q-Anchored (HotpotQA)**: Sharp decline to ~-85 at Layer 15, recovers to -50 by Layer 60.
- **A-Anchored (HotpotQA)**: Stable near 0 with ±1 fluctuations.
- **Q-Anchored (NQ)**: Oscillates between -80 and -20, peaking at -60 at Layer 40.
- **A-Anchored (NQ)**: Stays near 0 with ±2 variations.
### Key Observations
1. **Model Size Impact**: Qwen3-32B shows more extreme ΔP fluctuations than Qwen3-8B, particularly for Q-Anchored models.
2. **Anchoring Strategy**: A-Anchored models maintain near-zero ΔP across all layers and datasets, while Q-Anchored models exhibit significant layer-dependent variations.
3. **Dataset Sensitivity**:
- HotpotQA causes the most drastic ΔP drops in Q-Anchored models.
- NQ shows the largest amplitude fluctuations in Q-Anchored models.
4. **Layer Dynamics**:
- Early layers (0–10) show the most dramatic ΔP changes.
- Later layers (20–30/60) exhibit stabilization or partial recovery.
### Interpretation
The data suggests that anchoring strategy (Q vs A) critically influences model stability, with A-Anchored models demonstrating consistent performance across layers. Q-Anchored models, while potentially more expressive, suffer from layer-specific instability that worsens with model size. The dataset-specific patterns indicate varying sensitivity to anchoring methods, with complex reasoning tasks (HotpotQA, NQ) amplifying instability in Q-Anchored configurations. These findings highlight trade-offs between model expressiveness and stability in transformer architectures, with practical implications for model design and deployment.
</details>
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Graphs: ΔP Trends Across Layers in Qwen3-8B and Qwen3-32B Models
### Overview
The image contains two side-by-side line graphs comparing the performance of Q-Anchored and A-Anchored methods across different layers in Qwen3-8B and Qwen3-32B models. The y-axis represents ΔP (change in performance), and the x-axis represents Layer numbers. Each graph includes multiple data series with distinct line styles and colors, representing different anchoring methods and datasets.
### Components/Axes
- **Y-Axis**: ΔP (Performance Change), ranging from -80 to 0 in both graphs.
- **X-Axis**: Layer, with Qwen3-8B spanning 0–30 layers and Qwen3-32B spanning 0–60 layers.
- **Legends**:
- **Qwen3-8B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- **Qwen3-32B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- **Shaded Areas**: Confidence intervals (e.g., ±5% of ΔP values).
### Detailed Analysis
#### Qwen3-8B Graph
- **Q-Anchored (PopQA)**: Starts near 0, drops sharply to ~-80 by Layer 30 (blue solid line).
- **A-Anchored (PopQA)**: Remains near 0 throughout (orange dashed line).
- **Q-Anchored (TriviaQA)**: Starts at ~-20, declines to ~-70 by Layer 30 (green solid line).
- **A-Anchored (TriviaQA)**: Stays near 0 (red dashed line).
- **Q-Anchored (HotpotQA)**: Begins at ~-10, decreases to ~-75 by Layer 30 (purple solid line).
- **Q-Anchored (NQ)**: Starts at ~-15, declines to ~-70 by Layer 30 (pink dashed line).
#### Qwen3-32B Graph
- **Q-Anchored (PopQA)**: Starts near 0, drops to ~-80 by Layer 60 (blue solid line).
- **A-Anchored (PopQA)**: Remains near 0 (orange dashed line).
- **Q-Anchored (TriviaQA)**: Starts at ~-20, declines to ~-75 by Layer 60 (green solid line).
- **A-Anchored (TriviaQA)**: Stays near 0 (red dashed line).
- **Q-Anchored (HotpotQA)**: Begins at ~-10, decreases to ~-80 by Layer 60 (purple solid line).
- **Q-Anchored (NQ)**: Starts at ~-15, declines to ~-85 by Layer 60 (pink dashed line).
### Key Observations
1. **Q-Anchored Methods**: All Q-Anchored lines show a consistent downward trend in ΔP across layers, with steeper declines in larger models (Qwen3-32B).
2. **A-Anchored Methods**: All A-Anchored lines remain stable near 0, indicating minimal performance change.
3. **Confidence Intervals**: Shaded regions are wider for Q-Anchored methods, suggesting higher variability in performance measurements.
4. **Dataset-Specific Trends**:
- PopQA and TriviaQA show the most significant ΔP drops.
- NQ exhibits the least severe decline among Q-Anchored methods.
### Interpretation
The data suggests that **Q-Anchored methods** are more sensitive to layer depth, with performance degradation (ΔP) increasing as layers progress. This trend is amplified in larger models (Qwen3-32B), where ΔP values reach -80–-85. In contrast, **A-Anchored methods** maintain stable performance (ΔP ≈ 0), implying robustness to layer variations. The widening confidence intervals for Q-Anchored methods highlight greater uncertainty in their measurements, potentially due to model complexity or dataset-specific challenges. The dataset-specific trends (e.g., PopQA/TriviaQA vs. NQ) indicate that certain tasks may exacerbate performance drops in Q-Anchored approaches. These findings underscore the importance of anchoring strategy selection based on model size and task requirements.
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Graph: ΔP vs. Layer for GPT-3 Models (8B and 32B)
### Overview
The image contains two side-by-side line graphs comparing the performance (ΔP) of different question-answering (QA) and answer-anchored (A-Anchored) models across layers in two GPT-3 variants: **Qwen3-8B** (left) and **Qwen3-32B** (right). The y-axis represents ΔP (change in performance), and the x-axis represents the layer number. Each graph includes multiple data series with distinct line styles and colors, as defined in the legend.
---
### Components/Axes
- **X-Axis (Layer)**:
- Labeled "Layer" for both subplots.
- Ranges from 0 to 30 (8B) and 0 to 60 (32B).
- **Y-Axis (ΔP)**:
- Labeled "ΔP" for both subplots.
- Ranges from -80 to 0.
- **Legends**:
- **Left Subplot (8B)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Dotted green: Q-Anchored (TriviaQA)
- Dash-dot red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- **Right Subplot (32B)**:
- Same legend as 8B, but with additional dashed pink line for Q-Anchored (NQ).
---
### Detailed Analysis
#### Qwen3-8B (Left Subplot)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at 0, drops sharply to ~-60 by layer 10, then fluctuates between -60 and -40.
- Confidence interval (shaded area) widens slightly after layer 20.
2. **A-Anchored (PopQA)** (dashed orange):
- Remains near 0 throughout, with minimal fluctuation.
3. **Q-Anchored (TriviaQA)** (dotted green):
- Starts at ~-20, dips to ~-70 by layer 20, then stabilizes.
4. **A-Anchored (TriviaQA)** (dash-dot red):
- Starts at ~-10, dips to ~-50 by layer 20, then stabilizes.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Starts at ~-10, dips to ~-50 by layer 20, then stabilizes.
6. **Q-Anchored (NQ)** (dashed pink):
- Starts at ~-10, dips to ~-70 by layer 20, then fluctuates between -70 and -50.
#### Qwen3-32B (Right Subplot)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at 0, drops to ~-50 by layer 20, then stabilizes.
2. **A-Anchored (PopQA)** (dashed orange):
- Remains near 0 throughout.
3. **Q-Anchored (TriviaQA)** (dotted green):
- Starts at ~-30, dips to ~-70 by layer 40, then stabilizes.
4. **A-Anchored (TriviaQA)** (dash-dot red):
- Starts at ~-20, dips to ~-60 by layer 40, then stabilizes.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Starts at ~-20, dips to ~-60 by layer 40, then stabilizes.
6. **Q-Anchored (NQ)** (dashed pink):
- Starts at ~-10, dips to ~-80 by layer 60, then fluctuates between -80 and -60.
---
### Key Observations
1. **Stability of A-Anchored Models**:
- A-Anchored models (PopQA, TriviaQA, HotpotQA) consistently show minimal ΔP changes, remaining near 0 across layers.
2. **Volatility of Q-Anchored Models**:
- Q-Anchored models exhibit significant ΔP fluctuations, especially for NQ (Question-Answering) tasks.
3. **Layer-Specific Trends**:
- Layers 10–20 (8B) and 20–40 (32B) show the most pronounced performance drops for Q-Anchored models.
4. **Confidence Intervals**:
- Shaded areas around lines indicate uncertainty, which increases for Q-Anchored models in deeper layers.
---
### Interpretation
- **Anchoring Method Impact**:
- A-Anchored models (answer-focused) demonstrate stability, suggesting they are less sensitive to layer-specific variations.
- Q-Anchored models (question-focused) show higher variability, possibly due to the complexity of question-answering tasks.
- **Model Size Effects**:
- The 32B model exhibits more pronounced fluctuations than the 8B model, indicating that larger models may amplify the impact of anchoring methods.
- **NQ Task Challenges**:
- The Q-Anchored (NQ) line in both subplots shows the most erratic behavior, highlighting difficulties in handling open-ended questions.
- **Confidence Intervals**:
- Wider shaded regions for Q-Anchored models suggest greater uncertainty in performance measurements, particularly in deeper layers.
---
### Spatial Grounding
- **Legends**: Positioned at the bottom of each subplot, with clear color/style mappings.
- **Data Series**: Lines are plotted directly above their corresponding legend entries, with no overlap in color/style.
- **Axis Alignment**: Both subplots share identical axis labels and scales, enabling direct comparison.
---
### Content Details
- **Numerical Approximations**:
- ΔP values are estimated from the graph's scale (e.g., ~-60, ~-70) with ±5 uncertainty due to visual estimation.
- Layer numbers are exact (0–30 for 8B, 0–60 for 32B).
- **Text Embedding**: No additional text is present in the diagram beyond axis labels and legends.
---
### Final Notes
The graph emphasizes the trade-off between anchoring methods and model performance stability. A-Anchored models prioritize consistency, while Q-Anchored models trade stability for potential gains in specific tasks. The 32B model's increased layer count amplifies these trends, suggesting architectural complexity influences anchoring effectiveness.
</details>
Figure 13: $\Delta\mathrm{P}$ under attention knockout for reasoning models. Probing attention activations for the final token (top), the token immediately preceding the exact answer tokens (middle), and the last exact answer token (bottom).
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: ΔP Values Across Layers for Qwen3-8B and Qwen3-32B Models
### Overview
The image contains two side-by-side line charts comparing the ΔP (change in performance) values across layers for two model sizes: Qwen3-8B (left) and Qwen3-32B (right). Each chart tracks six distinct methods (Q-Anchored and A-Anchored variants across four datasets: PopQA, TriviaQA, HotpotQA, and NQ) as they progress through model layers. The charts use color-coded lines with distinct styles to differentiate methods.
### Components/Axes
- **X-axis**: "Layer" (0 to 30 for Qwen3-8B, 0 to 60 for Qwen3-32B)
- **Y-axis**: "ΔP" (ranging from -100 to 0)
- **Legends**:
- **Qwen3-8B**:
- Blue solid: Q-Anchored (PopQA)
- Orange dashed: A-Anchored (PopQA)
- Green dash-dot: Q-Anchored (TriviaQA)
- Red dotted: A-Anchored (TriviaQA)
- Purple solid: Q-Anchored (HotpotQA)
- Pink dashed: A-Anchored (HotpotQA)
- Gray dash-dot: Q-Anchored (NQ)
- Brown dotted: A-Anchored (NQ)
- **Qwen3-32B**:
- Same legend structure as Qwen3-8B but with adjusted line trends.
### Detailed Analysis
#### Qwen3-8B Chart
- **Q-Anchored (PopQA)** (blue solid): Starts near 0, drops sharply to ~-100 by layer 30, with oscillations.
- **A-Anchored (PopQA)** (orange dashed): Remains near 0 with minor fluctuations.
- **Q-Anchored (TriviaQA)** (green dash-dot): Gradual decline from 0 to ~-60, with volatility.
- **A-Anchored (TriviaQA)** (red dotted): Stable near 0.
- **Q-Anchored (HotpotQA)** (purple solid): Sharp drop to ~-80 by layer 20, then stabilizes.
- **A-Anchored (HotpotQA)** (pink dashed): Slight decline to ~-20, then stabilizes.
- **Q-Anchored (NQ)** (gray dash-dot): Oscillates between -20 and 0.
- **A-Anchored (NQ)** (brown dotted): Stable near 0.
#### Qwen3-32B Chart
- **Q-Anchored (PopQA)** (blue solid): Starts near 0, declines to ~-80 by layer 60, with oscillations.
- **A-Anchored (PopQA)** (orange dashed): Stable near 0.
- **Q-Anchored (TriviaQA)** (green dash-dot): Gradual decline to ~-60, with volatility.
- **A-Anchored (TriviaQA)** (red dotted): Stable near 0.
- **Q-Anchored (HotpotQA)** (purple solid): Sharp drop to ~-80 by layer 20, then stabilizes.
- **A-Anchored (HotpotQA)** (pink dashed): Slight decline to ~-20, then stabilizes.
- **Q-Anchored (NQ)** (gray dash-dot): Steady decline from 0 to ~-60.
- **A-Anchored (NQ)** (brown dotted): Stable near 0.
### Key Observations
1. **Model Size Impact**: Qwen3-32B shows more pronounced ΔP declines for Q-Anchored methods compared to Qwen3-8B.
2. **Dataset Sensitivity**:
- PopQA and HotpotQA datasets exhibit the largest ΔP drops for Q-Anchored methods.
- NQ dataset shows the most stable trends for A-Anchored methods.
3. **Anchoring Effect**: A-Anchored methods (dashed lines) generally maintain higher ΔP values (closer to 0) than Q-Anchored methods (solid lines).
4. **Layer Progression**: ΔP trends become more stable in deeper layers (layers >20) for both models.
### Interpretation
The data suggests that anchoring methods (Q-Anchored vs. A-Anchored) significantly influence ΔP values, with Q-Anchored methods showing greater performance degradation across layers. Larger models (Qwen3-32B) exhibit more severe ΔP declines for Q-Anchored methods, potentially indicating scalability challenges. The NQ dataset's unique trend (steady decline in Q-Anchored methods) may reflect dataset-specific characteristics or methodological differences. A-Anchored methods appear more robust, maintaining stability across layers and model sizes. These findings highlight the importance of anchoring strategy selection based on model architecture and task requirements.
</details>
<details>
<summary>x29.png Details</summary>

### Visual Description
## Line Graphs: ΔP Trends Across Layers for Qwen3-8B and Qwen3-32B Models
### Overview
The image contains two line graphs comparing the performance of Qwen3-8B and Qwen3-32B models across layers (0–30 and 0–60, respectively) using different anchoring strategies (Q-Anchored vs. A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis represents ΔP (change in performance), and the x-axis represents model layers. Shaded regions indicate variability/confidence intervals.
---
### Components/Axes
- **X-Axis (Layer)**:
- Qwen3-8B: 0 to 30 (intervals of 10)
- Qwen3-32B: 0 to 60 (intervals of 20)
- **Y-Axis (ΔP)**:
- Range: -80 to 0 (negative values indicate performance degradation)
- Units: Not explicitly labeled, but ΔP implies relative change.
- **Legends**:
- **Qwen3-8B**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- **Qwen3-32B**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Colors:
- Blue: PopQA
- Green: TriviaQA
- Purple: HotpotQA
- Red: NQ
---
### Detailed Analysis
#### Qwen3-8B Graph
- **Q-Anchored (Solid Lines)**:
- **PopQA**: Starts near 0, drops sharply to ~-80 by layer 30.
- **TriviaQA**: Begins at ~-20, declines to ~-70.
- **HotpotQA**: Starts at ~-10, falls to ~-75.
- **NQ**: Starts at ~-5, declines to ~-70.
- **A-Anchored (Dashed Lines)**:
- **PopQA**: Starts at 0, declines to ~-60.
- **TriviaQA**: Begins at ~-10, drops to ~-65.
- **HotpotQA**: Starts at ~-5, falls to ~-60.
- **NQ**: Starts at ~-2, declines to ~-60.
#### Qwen3-32B Graph
- **Q-Anchored (Solid Lines)**:
- **PopQA**: Starts near 0, drops to ~-80 by layer 60.
- **TriviaQA**: Begins at ~-20, declines to ~-75.
- **HotpotQA**: Starts at ~-10, falls to ~-70.
- **NQ**: Starts at ~-5, declines to ~-70.
- **A-Anchored (Dashed Lines)**:
- **PopQA**: Starts at 0, declines to ~-60.
- **TriviaQA**: Begins at ~-10, drops to ~-65.
- **HotpotQA**: Starts at ~-5, falls to ~-60.
- **NQ**: Starts at ~-2, declines to ~-60.
---
### Key Observations
1. **Q-Anchored vs. A-Anchored**:
- Q-Anchored models (solid lines) show steeper declines in ΔP across layers compared to A-Anchored (dashed lines), suggesting stronger dependency on question anchoring for performance.
- A-Anchored models exhibit more gradual declines, indicating greater stability in answer anchoring.
2. **Dataset Variability**:
- **PopQA** (blue) consistently shows the steepest decline for Q-Anchored models, implying higher sensitivity to question anchoring.
- **NQ** (red) datasets (e.g., Natural Questions) show moderate declines, suggesting intermediate reliance on anchoring strategies.
3. **Model Size**:
- Qwen3-32B (larger model) exhibits similar trends to Qwen3-8B but with slightly less variability in ΔP, possibly due to increased capacity to mitigate anchoring effects.
4. **Shaded Regions**:
- Wider shaded areas in Qwen3-8B suggest higher uncertainty in smaller models, while Qwen3-32B shows tighter confidence intervals.
---
### Interpretation
- **Anchoring Strategy Impact**: Q-Anchored models degrade more rapidly with increasing layers, highlighting their reliance on question-level context. A-Anchored models, which anchor to answers, show more consistent performance, suggesting answer-level grounding is more robust.
- **Dataset Complexity**: PopQA (simple QA) and NQ (complex QA) exhibit distinct trends, with PopQA being more sensitive to anchoring shifts. This may reflect differences in task structure (e.g., direct vs. multi-hop reasoning).
- **Model Scaling**: Larger models (Qwen3-32B) maintain performance better across layers, indicating that increased parameter count helps stabilize anchoring effects. However, the fundamental trend (Q-Anchored > A-Anchored decline) persists, emphasizing architectural trade-offs in grounding strategies.
---
### Spatial Grounding & Cross-Reference
- **Legend Position**: Bottom of both graphs, aligned with x-axis.
- **Color Consistency**:
- Q-Anchored: Solid lines (blue, green, purple, red).
- A-Anchored: Dashed lines (blue, green, purple, red).
- Dataset colors match across both graphs (e.g., blue = PopQA in both 8B and 32B).
---
### Conclusion
The graphs demonstrate that anchoring strategy (Q vs. A) significantly influences layer-wise performance degradation, with Q-Anchored models being more sensitive. Dataset complexity and model size further modulate these effects, providing insights into the design of question-answering architectures.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Graph: ΔP vs Layer for Qwen3-8B and Qwen3-32B Models
### Overview
The image contains two side-by-side line graphs comparing the performance of Q-Anchored and A-Anchored methods across different datasets (PopQA, TriviaQA, HotpotQA, NQ) for two versions of the Qwen3 model (8B and 32B parameters). The y-axis represents ΔP (change in performance), and the x-axis represents model layers. Each graph shows multiple colored lines with shaded confidence intervals.
### Components/Axes
- **Left Chart**: Qwen3-8B model
- **Right Chart**: Qwen3-32B model
- **Y-Axis**: ΔP (range: -80 to 0)
- **X-Axis**: Layer (0 to 30 for 8B, 0 to 60 for 32B)
- **Legend**: Located at the bottom, with six entries:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted red: Q-Anchored (HotpotQA)
- Solid orange: A-Anchored (PopQA)
- Dashed purple: A-Anchored (TriviaQA)
- Dotted pink: A-Anchored (HotpotQA)
- Solid gray: A-Anchored (NQ)
- Dashed gray: Q-Anchored (NQ)
### Detailed Analysis
#### Qwen3-8B Chart
- **Q-Anchored Lines**:
- PopQA (solid blue): Starts at 0, declines sharply to ~-80 by layer 30 with oscillations.
- TriviaQA (dashed green): Similar trend to PopQA but less steep (-60 to -70 by layer 30).
- HotpotQA (dotted red): Gradual decline to ~-60 by layer 30.
- NQ (dashed gray): Sharpest drop to ~-90 by layer 30.
- **A-Anchored Lines**:
- PopQA (solid orange): Remains near 0 throughout.
- TriviaQA (dashed purple): Slight decline to ~-10 by layer 30.
- HotpotQA (dotted pink): Minimal change (~-5 by layer 30).
- NQ (solid gray): Stable near 0.
#### Qwen3-32B Chart
- **Q-Anchored Lines**:
- PopQA (solid blue): Starts at 0, drops to ~-80 by layer 60 with volatility.
- TriviaQA (dashed green): Declines to ~-70 by layer 60.
- HotpotQA (dotted red): Gradual decline to ~-60 by layer 60.
- NQ (dashed gray): Sharp drop to ~-90 by layer 60.
- **A-Anchored Lines**:
- PopQA (solid orange): Stable near 0.
- TriviaQA (dashed purple): Slight decline to ~-10 by layer 60.
- HotpotQA (dotted pink): Minimal change (~-5 by layer 60).
- NQ (solid gray): Stable near 0.
### Key Observations
1. **Q-Anchored vs A-Anchored**: Q-Anchored methods show significant ΔP degradation across layers, while A-Anchored methods remain stable.
2. **Model Size Impact**: The 32B model exhibits more pronounced ΔP declines for Q-Anchored methods compared to the 8B model.
3. **Dataset Sensitivity**: NQ dataset shows the steepest ΔP decline for Q-Anchored methods in both models.
4. **Confidence Intervals**: Shaded regions indicate variability, with Q-Anchored methods showing wider intervals in deeper layers.
### Interpretation
The data suggests that Q-Anchored methods are more sensitive to layer depth, with performance degradation (ΔP) increasing as layers progress. This trend is amplified in the larger 32B model, indicating potential scalability challenges. A-Anchored methods maintain stability, implying robustness to layer depth variations. The NQ dataset consistently drives the largest ΔP declines, highlighting its role as a critical factor in performance degradation. The results may reflect architectural differences in how anchoring strategies interact with model scale and dataset complexity.
</details>
Figure 14: $\Delta\mathrm{P}$ under attention knockout for reasoning models. Probing mlp activations for the final token (top), the token immediately preceding the exact answer tokens (middle), and the last exact answer token (bottom).
<details>
<summary>x31.png Details</summary>

### Visual Description
## Line Graphs: LLaMA-3-2B-Instruct and LLaMA-3-8B-Instruct Performance Comparison
### Overview
The image contains two side-by-side line graphs comparing the performance of different anchoring methods (Q-Anchored and A-Anchored) across datasets (PopQA, TriviaQA, HotpotQA, NQ) in LLaMA-3-2B-Instruct and LLaMA-3-8B-Instruct models. The y-axis represents ΔP (perplexity change), and the x-axis represents model layers. Each graph includes shaded regions indicating confidence intervals.
---
### Components/Axes
- **Left Graph (LLaMA-3-2B-Instruct)**:
- **X-axis**: Layer (0 to 25)
- **Y-axis**: ΔP (range: -100 to 0)
- **Legend**:
- Blue: Q-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Red: Q-Anchored (HotpotQA)
- Purple: Q-Anchored (NQ)
- Dashed Orange: A-Anchored (PopQA)
- Dashed Green: A-Anchored (TriviaQA)
- Dashed Red: A-Anchored (HotpotQA)
- Dashed Purple: A-Anchored (NQ)
- **Right Graph (LLaMA-3-8B-Instruct)**:
- **X-axis**: Layer (0 to 30)
- **Y-axis**: ΔP (range: -100 to 0)
- **Legend**: Same as the left graph.
---
### Detailed Analysis
#### LLaMA-3-2B-Instruct (Left Graph)
1. **Q-Anchored (PopQA)** (Blue):
- Starts at 0 (layer 0), drops sharply to -60 by layer 25.
- Fluctuates between -40 and -60 in mid-layers (layers 5–15).
2. **Q-Anchored (TriviaQA)** (Green):
- Starts at 0, declines to -50 by layer 25.
- Shows moderate fluctuations (-30 to -50) in mid-layers.
3. **Q-Anchored (HotpotQA)** (Red):
- Starts at 0, drops to -40 by layer 25.
- Fluctuates between -20 and -40 in mid-layers.
4. **Q-Anchored (NQ)** (Purple):
- Starts at 0, declines to -70 by layer 25.
- Sharp drop to -70 in early layers (layers 5–10), then stabilizes.
5. **A-Anchored (PopQA)** (Dashed Orange):
- Starts at 0, ends at -20 by layer 25.
- Minimal fluctuations (-10 to -20).
6. **A-Anchored (TriviaQA)** (Dashed Green):
- Starts at 0, ends at -30 by layer 25.
- Slight dip to -25 in mid-layers.
7. **A-Anchored (HotpotQA)** (Dashed Red):
- Starts at 0, ends at -25 by layer 25.
- Stable with minor fluctuations (-15 to -25).
8. **A-Anchored (NQ)** (Dashed Purple):
- Starts at 0, ends at -40 by layer 25.
- Gradual decline with minor fluctuations (-20 to -40).
#### LLaMA-3-8B-Instruct (Right Graph)
1. **Q-Anchored (PopQA)** (Blue):
- Starts at 0, drops sharply to -100 by layer 30.
- Steep decline in early layers (layers 5–15), then stabilizes.
2. **Q-Anchored (TriviaQA)** (Green):
- Starts at 0, declines to -80 by layer 30.
- Sharp drop to -60 in early layers, then stabilizes.
3. **Q-Anchored (HotpotQA)** (Red):
- Starts at 0, drops to -60 by layer 30.
- Moderate decline (-40 to -60) in mid-layers.
4. **Q-Anchored (NQ)** (Purple):
- Starts at 0, drops to -90 by layer 30.
- Steep decline to -70 in early layers, then stabilizes.
5. **A-Anchored (PopQA)** (Dashed Orange):
- Starts at 0, ends at -40 by layer 30.
- Gradual decline (-20 to -40).
6. **A-Anchored (TriviaQA)** (Dashed Green):
- Starts at 0, ends at -50 by layer 30.
- Slight dip to -35 in mid-layers.
7. **A-Anchored (HotpotQA)** (Dashed Red):
- Starts at 0, ends at -35 by layer 30.
- Stable with minor fluctuations (-25 to -35).
8. **A-Anchored (NQ)** (Dashed Purple):
- Starts at 0, ends at -60 by layer 30.
- Gradual decline (-30 to -60).
---
### Key Observations
1. **Model Size Impact**:
- The 8B model shows steeper ΔP declines compared to the 2B model, especially for Q-Anchored methods.
- Example: Q-Anchored (NQ) in 8B drops to -90 vs. -70 in 2B.
2. **Anchoring Method Differences**:
- **Q-Anchored** methods exhibit larger ΔP drops, particularly for NQ and HotpotQA datasets.
- **A-Anchored** methods show smaller, more stable ΔP values across layers.
3. **Dataset Variability**:
- NQ consistently shows the largest ΔP drops, suggesting it is the most challenging dataset.
- PopQA and TriviaQA have moderate ΔP declines, while HotpotQA has the smallest drops.
4. **Confidence Intervals**:
- Shaded regions indicate variability in ΔP measurements. Larger models (8B) show wider confidence intervals, especially in Q-Anchored methods.
---
### Interpretation
- **Model Size and Performance**: The 8B model’s larger ΔP drops suggest that increased model size amplifies the impact of anchoring methods, particularly for complex datasets like NQ.
- **Anchoring Robustness**: A-Anchored methods demonstrate greater stability, implying they may be more effective in maintaining performance across layers.
- **Dataset Sensitivity**: NQ’s poor performance across both models highlights its inherent difficulty, possibly due to its reliance on reasoning or knowledge-intensive tasks.
- **Layer-Specific Trends**: Early layers (0–10) show the most significant ΔP changes, indicating that anchoring methods have a stronger effect in initial processing stages.
This analysis underscores the importance of anchoring strategies in model performance, with A-Anchored methods offering potential advantages in stability and robustness.
</details>
<details>
<summary>x32.png Details</summary>

### Visual Description
## Line Chart: ΔP Across Layers for Mistral-7B-Instruct Models
### Overview
The image contains two side-by-side line charts comparing the ΔP metric across 30 layers of the Mistral-7B-Instruct model (versions v0.1 and v0.3). Each chart includes seven data series representing different anchoring methods (Q-Anchored and A-Anchored) across four datasets (PopQA, TriviaQA, HotpotQA, NQ). The charts show significant variation in ΔP values, with most lines trending downward as layer depth increases.
### Components/Axes
- **X-axis**: Layer (0 to 30, integer increments)
- **Y-axis**: ΔP (ranging from -80 to 0, with gridlines at -20, -40, -60)
- **Legends**:
- **Left Panel (v0.1)**:
- Blue solid: Q-Anchored (PopQA)
- Orange dashed: A-Anchored (PopQA)
- Green dotted: Q-Anchored (TriviaQA)
- Red dotted: A-Anchored (TriviaQA)
- Purple dash-dot: Q-Anchored (HotpotQA)
- Brown dashed: A-Anchored (HotpotQA)
- Pink dotted: Q-Anchored (NQ)
- **Right Panel (v0.3)**:
- Same legend structure as v0.1, but with updated line colors/patterns for version consistency
### Detailed Analysis
**Left Panel (v0.1)**:
1. **Q-Anchored (PopQA)**: Starts at 0, drops sharply to -60 by layer 10, then stabilizes with minor fluctuations.
2. **A-Anchored (PopQA)**: Begins at ~-5, declines gradually to -55 by layer 30.
3. **Q-Anchored (TriviaQA)**: Peaks at ~-10, declines to -65 by layer 30.
4. **A-Anchored (TriviaQA)**: Starts at ~-15, declines to -60 by layer 30.
5. **Q-Anchored (HotpotQA)**: Sharp drop to -70 by layer 10, then stabilizes.
6. **A-Anchored (HotpotQA)**: Gradual decline from ~-10 to -55.
7. **Q-Anchored (NQ)**: Starts at 0, declines to -75 by layer 30.
**Right Panel (v0.3)**:
1. **Q-Anchored (PopQA)**: Starts at 0, declines to -50 by layer 20, then stabilizes.
2. **A-Anchored (PopQA)**: Begins at ~-5, declines to -45 by layer 30.
3. **Q-Anchored (TriviaQA)**: Peaks at ~-10, declines to -55 by layer 30.
4. **A-Anchored (TriviaQA)**: Starts at ~-15, declines to -50 by layer 30.
5. **Q-Anchored (HotpotQA)**: Sharp drop to -60 by layer 10, then stabilizes.
6. **A-Anchored (HotpotQA)**: Gradual decline from ~-10 to -45.
7. **Q-Anchored (NQ)**: Starts at 0, declines to -65 by layer 30.
### Key Observations
1. **Version Comparison**: v0.3 shows less drastic declines in ΔP across most datasets compared to v0.1.
2. **Dataset Sensitivity**: HotpotQA consistently shows the steepest declines, suggesting higher sensitivity to anchoring methods.
3. **Anchoring Method Differences**: Q-Anchored methods generally exhibit steeper declines than A-Anchored counterparts.
4. **Layer Stability**: Both versions show stabilization after layer 20, with reduced volatility in deeper layers.
### Interpretation
The data suggests that anchoring methods (Q vs. A) significantly impact ΔP performance, with Q-Anchored approaches showing more pronounced declines. The v0.3 model demonstrates improved stability across datasets, particularly for HotpotQA, indicating potential architectural improvements. The consistent pattern of Q-Anchored methods underperforming compared to A-Anchored methods across datasets may reflect differences in how question vs. answer anchoring interacts with model architecture. The stabilization observed in deeper layers (post-layer 20) suggests that model performance converges or becomes less sensitive to anchoring strategies in later layers.
</details>
Figure 15: $\Delta\mathrm{P}$ under attention knockout for instruct models.
<details>
<summary>x33.png Details</summary>

### Visual Description
## Line Graphs: Llama-3.2-1B and Llama-3.2-3B Performance Across Layers
### Overview
The image contains two line graphs comparing the performance of different Q-Anchored and A-Anchored models across layers for two versions of the Llama-3.2 architecture (1B and 3B parameters). Each graph shows ΔP (change in performance) on the y-axis and Layer on the x-axis. Multiple data series are represented with distinct line styles and colors, with shaded regions indicating confidence intervals.
### Components/Axes
- **X-axis (Layer)**:
- Llama-3.2-1B: Layers 0–15 (discrete intervals).
- Llama-3.2-3B: Layers 0–25 (discrete intervals).
- **Y-axis (ΔP)**:
- Range: -15 to 0 (for Llama-3.2-3B) and -10 to 0 (for Llama-3.2-1B).
- **Legend**:
- Positioned at the bottom of both graphs.
- Entries:
- **Q-Anchored (PopQA)**: Solid blue.
- **A-Anchored (PopQA)**: Dashed orange.
- **Q-Anchored (TriviaQA)**: Solid green.
- **A-Anchored (TriviaQA)**: Dashed red.
- **Q-Anchored (HotpotQA)**: Solid purple.
- **A-Anchored (HotpotQA)**: Dashed brown.
- **Q-Anchored (NQ)**: Solid pink.
- **A-Anchored (NQ)**: Dashed gray.
### Detailed Analysis
#### Llama-3.2-1B Graph
- **Q-Anchored (PopQA)**: Starts near 0, peaks at ~2 (Layer 5), then drops to ~-5 (Layer 15).
- **A-Anchored (PopQA)**: Starts near 0, fluctuates between -1 and 1, ending near -2.
- **Q-Anchored (TriviaQA)**: Starts near 0, dips to ~-3 (Layer 10), then rises to ~-1.
- **A-Anchored (TriviaQA)**: Starts near 0, fluctuates between -1 and 1, ending near -1.
- **Q-Anchored (HotpotQA)**: Starts near 0, peaks at ~3 (Layer 5), then drops to ~-4 (Layer 15).
- **A-Anchored (HotpotQA)**: Starts near 0, fluctuates between -1 and 1, ending near -1.
- **Q-Anchored (NQ)**: Starts near 0, dips to ~-2 (Layer 10), then rises to ~-1.
- **A-Anchored (NQ)**: Starts near 0, fluctuates between -1 and 1, ending near -1.
#### Llama-3.2-3B Graph
- **Q-Anchored (PopQA)**: Starts near 0, peaks at ~2 (Layer 5), then drops to ~-5 (Layer 25).
- **A-Anchored (PopQA)**: Starts near 0, fluctuates between -1 and 1, ending near -2.
- **Q-Anchored (TriviaQA)**: Starts near 0, dips to ~-4 (Layer 10), then rises to ~-1.
- **A-Anchored (TriviaQA)**: Starts near 0, fluctuates between -1 and 1, ending near -1.
- **Q-Anchored (HotpotQA)**: Starts near 0, peaks at ~3 (Layer 5), then drops to ~-6 (Layer 25).
- **A-Anchored (HotpotQA)**: Starts near 0, fluctuates between -1 and 1, ending near -1.
- **Q-Anchored (NQ)**: Starts near 0, dips to ~-3 (Layer 10), then rises to ~-1.
- **A-Anchored (NQ)**: Starts near 0, fluctuates between -1 and 1, ending near -1.
### Key Observations
1. **Model Size Impact**:
- Llama-3.2-3B shows more pronounced fluctuations in ΔP compared to Llama-3.2-1B, particularly in HotpotQA and TriviaQA.
- The 3B model’s ΔP values are generally lower (more negative) in later layers.
2. **Dataset-Specific Trends**:
- **HotpotQA**: Q-Anchored models exhibit the largest drops in ΔP for both 1B and 3B, with sharper declines in the 3B version.
- **NQ**: Q-Anchored models show moderate dips, while A-Anchored models remain relatively stable.
- **PopQA/TriviaQA**: A-Anchored models demonstrate smaller variations compared to Q-Anchored counterparts.
3. **Confidence Intervals**:
- Shaded regions (likely 95% confidence intervals) are widest in the 3B model’s HotpotQA and TriviaQA series, indicating higher uncertainty in later layers.
### Interpretation
The data suggests that Q-Anchored models (e.g., PopQA, TriviaQA, HotpotQA) experience greater performance degradation (ΔP) in deeper layers compared to A-Anchored models. This trend is more pronounced in the larger 3B model, particularly for complex datasets like HotpotQA. The stability of A-Anchored models across layers implies they may be more robust to architectural scaling. However, the widening confidence intervals in the 3B model highlight increased variability in performance, possibly due to the model’s complexity. The consistent dips in Q-Anchored models for HotpotQA suggest this dataset is more sensitive to layer-specific architectural changes.
</details>
<details>
<summary>x34.png Details</summary>

### Visual Description
## Line Graphs: Llama-3-8B and Llama-3-70B Performance Trends
### Overview
The image contains two line graphs comparing performance degradation (ΔP) across transformer layers for two Llama-3 model variants (8B and 70B parameters). Each graph tracks multiple data series representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring strategies (Q-Anchored vs. A-Anchored). The graphs show layer-wise performance shifts, with ΔP values plotted against layer indices.
### Components/Axes
- **X-axis (Layer)**:
- Left graph: 0–30 (Llama-3-8B)
- Right graph: 0–80 (Llama-3-70B)
- **Y-axis (ΔP)**:
- Left graph: -15 to 0
- Right graph: -30 to 0
- **Legend**:
- Colors/styles map to:
- Q-Anchored (PopQA): Solid blue
- Q-Anchored (TriviaQA): Dashed green
- Q-Anchored (HotpotQA): Dotted purple
- Q-Anchored (NQ): Dash-dot pink
- A-Anchored (PopQA): Solid red
- A-Anchored (TriviaQA): Dashed orange
- A-Anchored (HotpotQA): Dotted brown
- A-Anchored (NQ): Dash-dot gray
- **Placement**: Legends at bottom; left graph on left, right graph on right.
### Detailed Analysis
#### Llama-3-8B (Left Graph)
- **Trends**:
- All lines start near ΔP=0 at layer 0.
- Gradual decline to ΔP≈-10 by layer 30, with oscillations.
- Q-Anchored (HotpotQA) shows steepest drop (-12 to -15 range).
- A-Anchored (NQ) remains closest to 0 (-2 to -4 range).
- **Key Data Points**:
- Q-Anchored (PopQA): 0 → -8 (layer 30)
- A-Anchored (HotpotQA): 0 → -5 (layer 30)
#### Llama-3-70B (Right Graph)
- **Trends**:
- Steeper overall decline than 8B variant.
- ΔP reaches -25 to -30 in later layers (60–80).
- Q-Anchored (HotpotQA) drops most sharply (-30 at layer 80).
- A-Anchored (NQ) stabilizes at ΔP≈-10.
- **Key Data Points**:
- Q-Anchored (PopQA): 0 → -20 (layer 80)
- A-Anchored (HotpotQA): 0 → -25 (layer 80)
### Key Observations
1. **Model Size Impact**: Larger model (70B) exhibits 2–3× greater ΔP magnitude than 8B across all datasets.
2. **Anchoring Strategy**: Q-Anchored consistently underperforms A-Anchored, with HotpotQA showing the largest gap (ΔP difference: -10 to -15 in 70B).
3. **Dataset Sensitivity**: HotpotQA induces the steepest performance degradation, followed by TriviaQA and NQ.
4. **Layer Dynamics**: Performance degradation accelerates in deeper layers (layers 40–80 in 70B).
### Interpretation
The data demonstrates that:
- **Model Scale Amplifies Degradation**: Larger models (70B) suffer more severe performance drops, particularly in complex reasoning tasks (HotpotQA).
- **Anchoring Matters**: A-Anchored models mitigate degradation better, suggesting architectural choices influence robustness.
- **Task Complexity Correlation**: HotpotQA’s steep decline implies it relies more heavily on early-layer representations, which degrade faster in deeper models.
- **Stability in NQ**: Minimal ΔP for NQ suggests it depends less on layer-specific features, possibly relying on surface-level patterns.
This analysis highlights trade-offs between model scale, architectural design, and task-specific performance in transformer-based QA systems.
</details>
<details>
<summary>x35.png Details</summary>

### Visual Description
## Line Graph: ΔP vs. Layer for Mistral-7B Models v0.1 and v0.3
### Overview
The image contains two side-by-side line graphs comparing the performance of different anchoring methods (Q-Anchored and A-Anchored) across layers (0–30) in two versions of the Mistral-7B model (v0.1 and v0.3). The y-axis represents ΔP (change in performance), and the x-axis represents model layers. Each graph includes multiple data series with distinct line styles and colors, representing combinations of anchoring methods and datasets (e.g., PopQA, TriviaQA, HotpotQA, NQ).
### Components/Axes
- **Y-Axis**: ΔP (change in performance), ranging from -20 to 0.
- **X-Axis**: Layer (0–30), representing model depth.
- **Legends**:
- **Left Panel (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted red: A-Anchored (PopQA)
- Dash-dot purple: A-Anchored (TriviaQA)
- **Right Panel (v0.3)**:
- Solid blue: Q-Anchored (HotpotQA)
- Dashed green: Q-Anchored (NQ)
- Dotted red: A-Anchored (HotpotQA)
- Dash-dot purple: A-Anchored (NQ)
- **Shaded Regions**: Error margins or confidence intervals around each line.
### Detailed Analysis
#### Left Panel (Mistral-7B-v0.1):
1. **Q-Anchored (PopQA)**: Solid blue line starts near 0, dips sharply to ~-15 at layer 15, then fluctuates upward.
2. **Q-Anchored (TriviaQA)**: Dashed green line remains relatively stable, with minor dips to ~-5.
3. **A-Anchored (PopQA)**: Dotted red line shows gradual decline to ~-10, with a sharp drop at layer 25.
4. **A-Anchored (TriviaQA)**: Dash-dot purple line fluctuates minimally, staying near 0.
#### Right Panel (Mistral-7B-v0.3):
1. **Q-Anchored (HotpotQA)**: Solid blue line starts near 0, dips to ~-10 at layer 10, then stabilizes.
2. **Q-Anchored (NQ)**: Dashed green line shows erratic fluctuations, peaking at ~-5 and dropping to ~-15 at layer 30.
3. **A-Anchored (HotpotQA)**: Dotted red line declines steadily to ~-15, with a sharp drop at layer 25.
4. **A-Anchored (NQ)**: Dash-dot purple line remains stable, with minor dips to ~-5.
### Key Observations
- **Layer-Specific Variability**: Sharp drops (e.g., layer 15 in v0.1, layer 25 in v0.3) suggest critical layer interactions affecting ΔP.
- **Dataset Impact**: Methods using HotpotQA and NQ datasets exhibit larger ΔP fluctuations compared to PopQA and TriviaQA.
- **Model Version Differences**: v0.3 shows more pronounced dips in A-Anchored methods, indicating architectural changes.
- **Error Margins**: Shaded regions highlight inconsistency in Q-Anchored (NQ) and A-Anchored (HotpotQA) across layers.
### Interpretation
The data suggests that anchoring methods significantly influence ΔP, with dataset choice and model version amplifying these effects. For example:
- **Q-Anchored (PopQA)** in v0.1 shows the most drastic performance drop, possibly due to layer-specific dependencies.
- **A-Anchored (HotpotQA)** in v0.3 exhibits the largest cumulative ΔP decline, hinting at architectural sensitivity in deeper layers.
- The stability of TriviaQA and NQ in A-Anchored methods suggests robustness in certain configurations.
The shaded regions indicate that performance variability is dataset-dependent, with HotpotQA and NQ showing higher uncertainty. These trends may reflect differences in question complexity or answer diversity across datasets. Further investigation into layer-specific mechanisms (e.g., attention patterns) could clarify these effects.
</details>
Figure 16: $\Delta\mathrm{P}$ under attention knockout with randomly masked question tokens. Unlike selectively blocking the exact question tokens, both Q-Anchored and A-Anchored samples exhibit similar patterns, with substantially smaller probability changes when question tokens are masked at random. This suggests that exact question tokens play a critical role in conveying the semantic information of core frame elements.
Appendix D Token Patching
<details>
<summary>x36.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3.2 Models
### Overview
The image presents a comparative bar chart analyzing prediction flip rates for two versions of the Llama-3.2 model (1B and 3B parameter sizes) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Two anchoring strategies are compared: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by red and gray bars respectively.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right)
- **Y-Axis (Prediction Flip Rate)**: Scaled from 0 to 80
- **Legend**:
- Red = Q-Anchored (exact_question)
- Gray = A-Anchored (exact_question)
- **Model Versions**:
- Left section = Llama-3.2-1B
- Right section = Llama-3.2-3B
### Detailed Analysis
#### Llama-3.2-1B (Left Section)
| Dataset | Q-Anchored (exact_question) | A-Anchored (exact_question) |
|--------------|-----------------------------|-----------------------------|
| PopQA | ~78 | ~12 |
| TriviaQA | ~68 | ~28 |
| HotpotQA | ~40 | ~4 |
| NQ | ~48 | ~6 |
#### Llama-3.2-3B (Right Section)
| Dataset | Q-Anchored (exact_question) | A-Anchored (exact_question) |
|--------------|-----------------------------|-----------------------------|
| PopQA | ~60 | ~12 |
| TriviaQA | ~76 | ~26 |
| HotpotQA | ~66 | ~12 |
| NQ | ~76 | ~35 |
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored (red) consistently outperforms A-Anchored (gray) across all datasets and models, with flip rates 3-5x higher.
2. **Model Size Impact**: Llama-3.2-3B shows significantly higher Q-Anchored rates than 1B in PopQA (+20%), TriviaQA (+12%), and NQ (+56%), but underperforms in HotpotQA (-14%).
3. **A-Anchored Anomaly**: NQ dataset shows a 483% increase in A-Anchored flip rate for 3B vs 1B (6 → 35), contradicting the general trend of lower A-Anchored performance.
4. **Dataset Variance**: HotpotQA exhibits the largest gap between anchoring strategies (36 percentage points for 1B, 54 for 3B).
### Interpretation
The data demonstrates that Q-Anchored methods leverage model capabilities more effectively than A-Anchored approaches, with larger models (3B) showing stronger performance gains. The NQ dataset's anomalous A-Anchored improvement for 3B suggests potential dataset-specific interactions with model architecture. While Q-Anchored benefits from increased parameter count, the HotpotQA underperformance in 3B warrants investigation into dataset-model compatibility. The consistent Q-Anchored superiority across datasets indicates that question-specific anchoring provides more reliable performance than answer-based anchoring, though the NQ exception highlights the need for further analysis of edge cases in question-answering benchmarks.
</details>
<details>
<summary>x37.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3-8B and Llama-3-70B Models
### Overview
The image is a grouped bar chart comparing prediction flip rates for two language models (Llama-3-8B and Llama-3-70B) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Two anchoring strategies are compared: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by red and gray bars respectively.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (repeated for both models)
- **Y-Axis (Prediction Flip Rate)**: 0–80% scale
- **Legend**:
- Red = Q-Anchored (exact_question)
- Gray = A-Anchored (exact_question)
- **Model Labels**:
- Top-left: Llama-3-8B
- Top-right: Llama-3-70B
### Detailed Analysis
#### Llama-3-8B Section
- **Q-Anchored (red)**:
- PopQA: ~70%
- TriviaQA: ~85% (highest)
- HotpotQA: ~45%
- NQ: ~70%
- **A-Anchored (gray)**:
- PopQA: ~15%
- TriviaQA: ~50%
- HotpotQA: ~5%
- NQ: ~20%
#### Llama-3-70B Section
- **Q-Anchored (red)**:
- PopQA: ~80%
- TriviaQA: ~70%
- HotpotQA: ~25%
- NQ: ~85% (highest)
- **A-Anchored (gray)**:
- PopQA: ~20%
- TriviaQA: ~40%
- HotpotQA: ~2%
- NQ: ~45%
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored consistently outperforms A-Anchored across all datasets and models (e.g., Llama-3-70B NQ: 85% vs 45%).
2. **Model Size Impact**: Llama-3-70B shows significantly higher flip rates than Llama-3-8B, especially in NQ (85% vs 70% for Q-Anchored).
3. **Dataset Variability**:
- TriviaQA and NQ yield the highest Q-Anchored rates.
- HotpotQA has the lowest Q-Anchored rates (25% for Llama-3-70B).
4. **A-Anchored Performance**: A-Anchored rates are generally low (<50%), with HotpotQA near 0% for Llama-3-70B.
### Interpretation
The data demonstrates that **Q-Anchored (exact_question)** anchoring strategies produce substantially higher prediction flip rates than A-Anchored (exact_answer) across both model sizes. This suggests that grounding predictions to exact questions improves model reliability. The Llama-3-70B model amplifies this effect, particularly in the NQ dataset, where Q-Anchored achieves 85% flip rate. Conversely, A-Anchored struggles with HotpotQA (near 0% for Llama-3-70B), likely due to answer variability in open-ended datasets. The trend highlights the importance of question-specific anchoring for robust QA performance, with larger models better leveraging this strategy.
</details>
<details>
<summary>x38.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Mistral-7B Models
### Overview
The image presents a comparative bar chart analyzing prediction flip rates for two versions of the Mistral-7B language model (v0.1 and v0.3) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. The chart contrasts two anchoring strategies: Q-Anchored (exact_question) and A-Anchored (exact_question), visualized through red and gray bars respectively.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right)
- **Y-Axis (Prediction Flip Rate)**: Scaled from 0 to 100
- **Legend**:
- Red bars: Q-Anchored (exact_question)
- Gray bars: A-Anchored (exact_question)
- **Model Versions**:
- Left section: Mistral-7B-v0.1
- Right section: Mistral-7B-v0.3
### Detailed Analysis
#### Mistral-7B-v0.1
- **PopQA**:
- Q-Anchored: ~85
- A-Anchored: ~35
- **TriviaQA**:
- Q-Anchored: ~85
- A-Anchored: ~50
- **HotpotQA**:
- Q-Anchored: ~60
- A-Anchored: ~10
- **NQ**:
- Q-Anchored: ~85
- A-Anchored: ~55
#### Mistral-7B-v0.3
- **PopQA**:
- Q-Anchored: ~75
- A-Anchored: ~45
- **TriviaQA**:
- Q-Anchored: ~90
- A-Anchored: ~50
- **HotpotQA**:
- Q-Anchored: ~70
- A-Anchored: ~10
- **NQ**:
- Q-Anchored: ~85
- A-Anchored: ~35
### Key Observations
1. **Consistent Q-Anchored Superiority**: Q-Anchored (red) bars consistently outperform A-Anchored (gray) across all datasets and models, with differences ranging from 20-55 percentage points.
2. **Version-Specific Trends**:
- **TriviaQA**: v0.3 shows a 5% improvement in Q-Anchored performance (85→90) compared to v0.1.
- **HotpotQA**: v0.3 reduces Q-Anchored performance by 10 points (60→70) but maintains identical A-Anchored performance (10).
- **NQ**: v0.3 shows a 20-point drop in A-Anchored performance (55→35) while maintaining Q-Anchored stability.
3. **Dataset Variability**:
- HotpotQA exhibits the largest performance gap between anchoring strategies (~60 vs. ~10 in v0.1).
- NQ shows the smallest performance gap (~85 vs. ~55 in v0.1).
### Interpretation
The data demonstrates that Q-Anchored (exact_question) anchoring consistently yields higher prediction flip rates than A-Anchored (exact_question) across both model versions. The 5% improvement in TriviaQA performance in v0.3 suggests targeted enhancements in handling trivia-based questions. However, the 10-point drop in HotpotQA Q-Anchored performance in v0.3 raises questions about potential overfitting or dataset-specific limitations in the updated model. The significant drop in NQ A-Anchored performance (20 points) between versions indicates possible architectural changes affecting answer-based reasoning. These findings highlight the importance of anchoring strategy selection and model version compatibility when optimizing question-answering systems.
</details>
Figure 17: Prediction flip rate under token patching, probing attention activations of the final token.
<details>
<summary>x39.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3.2 Models
### Overview
The image presents two side-by-side bar charts comparing prediction flip rates for two versions of the Llama-3.2 model (1B and 3B parameter sizes) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Each dataset is evaluated under two anchoring methods: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by red and gray bars respectively.
### Components/Axes
- **X-axis**: Datasets (PopQA, TriviaQA, HotpotQA, NQ)
- **Y-axis**: Prediction Flip Rate (0–40 scale)
- **Legend**:
- Red bars: Q-Anchored (exact_question)
- Gray bars: A-Anchored (exact_question)
- **Chart Titles**:
- Left: "Llama-3.2-1B"
- Right: "Llama-3.2-3B"
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **Q-Anchored (red)**:
- PopQA: ~45
- TriviaQA: ~40
- HotpotQA: ~30
- NQ: ~15
- **A-Anchored (gray)**:
- PopQA: ~10
- TriviaQA: ~12
- HotpotQA: ~5
- NQ: ~2
#### Llama-3.2-3B (Right Chart)
- **Q-Anchored (red)**:
- PopQA: ~25
- TriviaQA: ~40
- HotpotQA: ~40
- NQ: ~45
- **A-Anchored (gray)**:
- PopQA: ~5
- TriviaQA: ~22
- HotpotQA: ~10
- NQ: ~28
### Key Observations
1. **Model Size Impact**: Llama-3.2-3B consistently shows higher prediction flip rates than Llama-3.2-1B across all datasets and anchoring methods.
2. **Anchoring Method Performance**: Q-Anchored (red) outperforms A-Anchored (gray) in both models, with the largest gap observed in NQ (3B model: Q-Anchored ~45 vs A-Anchored ~28).
3. **Dataset Variability**:
- NQ dataset exhibits the highest flip rates for Q-Anchored in both models.
- A-Anchored shows its strongest performance in TriviaQA (3B model: ~22) and NQ (3B model: ~28).
4. **Trend Patterns**:
- For Llama-3.2-1B, Q-Anchored rates decrease from PopQA to NQ, while A-Anchored rates peak at TriviaQA.
- For Llama-3.2-3B, Q-Anchored rates increase from PopQA to NQ, with A-Anchored peaking at NQ.
### Interpretation
The data suggests that:
- Larger model size (3B vs 1B) correlates with higher prediction flip rates, potentially indicating greater model confidence or variability in predictions.
- Q-Anchored (exact_question) consistently demonstrates superior performance compared to A-Anchored (exact_question), suggesting that question-specific anchoring improves prediction stability.
- The NQ dataset appears to be the most challenging for both models, as evidenced by its high flip rates, particularly for Q-Anchored in the 3B model (~45).
- The A-Anchored method shows unexpected strength in the NQ dataset for the 3B model (~28), possibly indicating that answer anchoring becomes more effective for complex reasoning tasks in larger models.
This analysis highlights the importance of anchoring strategy and model scale in question-answering systems, with implications for optimizing model performance across different datasets.
</details>
<details>
<summary>x40.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3-8B and Llama-3-70B Models
### Overview
The image contains two side-by-side bar charts comparing prediction flip rates for two language models (Llama-3-8B and Llama-3-70B) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Each model is evaluated using two anchoring methods: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by distinct colors (red for Q-Anchored, gray for A-Anchored).
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (categorical, evenly spaced).
- **Y-Axis (Prediction Flip Rate)**: Percentage scale from 0 to 80 (linear, increments of 20).
- **Legend**: Located at the bottom center, with red bars labeled "Q-Anchored (exact_question)" and gray bars labeled "A-Anchored (exact_question)".
- **Model Labels**: "Llama-3-8B" (top-left chart) and "Llama-3-70B" (top-right chart).
### Detailed Analysis
#### Llama-3-8B Model
- **Q-Anchored (red)**:
- PopQA: ~40%
- TriviaQA: ~70% (highest value)
- HotpotQA: ~40%
- NQ: ~45%
- **A-Anchored (gray)**:
- PopQA: ~10%
- TriviaQA: ~50%
- HotpotQA: ~5% (lowest value)
- NQ: ~15%
#### Llama-3-70B Model
- **Q-Anchored (red)**:
- PopQA: ~40%
- TriviaQA: ~90% (highest value)
- HotpotQA: ~60%
- NQ: ~40%
- **A-Anchored (gray)**:
- PopQA: ~30%
- TriviaQA: ~65%
- HotpotQA: ~15%
- NQ: ~25%
### Key Observations
1. **Q-Anchored vs. A-Anchored**: Q-Anchored consistently shows higher prediction flip rates than A-Anchored for both models across all datasets.
2. **Model Size Impact**: Llama-3-70B outperforms Llama-3-8B in Q-Anchored rates (e.g., TriviaQA: 90% vs. 70%), but the gap narrows in A-Anchored (65% vs. 50%).
3. **Dataset Variability**: TriviaQA exhibits the highest flip rates for both models, while HotpotQA has the lowest A-Anchored rates.
4. **NQ Dataset**: Shows moderate performance, with Llama-3-70B achieving ~40% (Q-Anchored) vs. ~25% (A-Anchored).
### Interpretation
The data suggests that **Q-Anchored questions** (exact_question) induce higher prediction flip rates, likely due to stricter alignment with ground-truth answers, increasing uncertainty. The larger Llama-3-70B model demonstrates superior performance in Q-Anchored settings, particularly in complex datasets like TriviaQA. However, A-Anchored rates remain lower across all cases, indicating that answer anchoring reduces variability but also limits model exploration. The consistent trend across model sizes implies that anchoring method has a more significant impact on flip rates than model capacity alone. TriviaQA’s high flip rates highlight its role as a challenging benchmark for factual reasoning.
</details>
<details>
<summary>x41.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Mistral-7B Models (v0.1 vs v0.3)
### Overview
The chart compares prediction flip rates (in percentage) for two versions of the Mistral-7B language model (v0.1 and v0.3) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Two anchoring methods are evaluated: **Q-Anchored (exact_question)** (red bars) and **A-Anchored (exact_question)** (gray bars). The y-axis ranges from 0% to 60%, with error bars indicating uncertainty.
---
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right).
- **Y-Axis (Prediction Flip Rate)**: Percentage scale (0–60%).
- **Legend**:
- Red = Q-Anchored (exact_question)
- Gray = A-Anchored (exact_question)
- **Model Versions**:
- Left group = Mistral-7B-v0.1
- Right group = Mistral-7B-v0.3
---
### Detailed Analysis
#### Mistral-7B-v0.1
- **PopQA**:
- Q-Anchored: ~65% (±2%)
- A-Anchored: ~20% (±3%)
- **TriviaQA**:
- Q-Anchored: ~63% (±1%)
- A-Anchored: ~30% (±2%)
- **HotpotQA**:
- Q-Anchored: ~55% (±3%)
- A-Anchored: ~10% (±1%)
- **NQ**:
- Q-Anchored: ~58% (±2%)
- A-Anchored: ~42% (±3%)
#### Mistral-7B-v0.3
- **PopQA**:
- Q-Anchored: ~58% (±2%)
- A-Anchored: ~20% (±2%)
- **TriviaQA**:
- Q-Anchored: ~63% (±1%)
- A-Anchored: ~28% (±2%)
- **HotpotQA**:
- Q-Anchored: ~62% (±1%)
- A-Anchored: ~20% (±1%)
- **NQ**:
- Q-Anchored: ~58% (±2%)
- A-Anchored: ~47% (±3%)
---
### Key Observations
1. **Q-Anchored Consistency**:
- Q-Anchored rates remain stable or slightly decrease in v0.3 across all datasets (e.g., PopQA drops from 65% to 58%).
- NQ shows no change in Q-Anchored performance between versions (~58% in both).
2. **A-Anchored Variability**:
- A-Anchored rates improve in v0.3 for NQ (+5% increase to 47%) but remain stagnant or decrease in other datasets (e.g., TriviaQA drops from 30% to 28%).
3. **Dataset-Specific Trends**:
- **NQ** exhibits the highest A-Anchored flip rates in both versions (~42% in v0.1, ~47% in v0.3), suggesting it is more sensitive to anchoring methods.
- **HotpotQA** shows the largest gap between anchoring methods (~55% Q vs. ~10% A in v0.1; ~62% Q vs. ~20% A in v0.3).
---
### Interpretation
The data demonstrates that **Q-Anchored (exact_question)** methods consistently outperform A-Anchored (exact_question) across both model versions, with Q-Anchored rates remaining stable or improving slightly in v0.3. The exception is **NQ**, where A-Anchored performance improves significantly in v0.3 (+5%), indicating potential architectural or training improvements in handling answer-specific context. However, Q-Anchored still dominates, suggesting that question-level anchoring is more robust for reducing prediction flip rates. The stability of Q-Anchored performance in v0.3 implies that model updates prioritized maintaining question-centric reliability over answer-centric adjustments.
</details>
Figure 18: Prediction flip rate under token patching, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x42.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3.2 Models
### Overview
The image contains two side-by-side bar charts comparing prediction flip rates for two versions of the Llama-3.2 language model (1B and 3B parameter sizes) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Each chart compares two anchoring methods: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by distinct colors.
### Components/Axes
- **X-axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (categorical)
- **Y-axis (Prediction Flip Rate)**: 0–60 (linear scale)
- **Legends**:
- Red bars: Q-Anchored (exact_question)
- Gray bars: A-Anchored (exact_question)
- **Model Labels**: Llama-3.2-1B (left chart), Llama-3.2-3B (right chart)
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **PopQA**: Q-Anchored ≈42, A-Anchored ≈3
- **TriviaQA**: Q-Anchored ≈58, A-Anchored ≈30
- **HotpotQA**: Q-Anchored ≈62, A-Anchored ≈7
- **NQ**: Q-Anchored ≈44, A-Anchored ≈12
#### Llama-3.2-3B (Right Chart)
- **PopQA**: Q-Anchored ≈56, A-Anchored ≈20
- **TriviaQA**: Q-Anchored ≈65, A-Anchored ≈28
- **HotpotQA**: Q-Anchored ≈59, A-Anchored ≈8
- **NQ**: Q-Anchored ≈52, A-Anchored ≈15
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored consistently outperforms A-Anchored across all datasets and models, with flip rates 3–5x higher.
2. **Model Size Impact**: Llama-3.2-3B shows 25–30% higher flip rates than Llama-3.2-1B for Q-Anchored methods.
3. **Dataset Variance**: TriviaQA and HotpotQA exhibit the highest flip rates, while PopQA and NQ show lower performance.
4. **A-Anchored Limitations**: A-Anchored methods rarely exceed 30% flip rate, with PopQA/A-Anchored at ~3% (lowest observed).
### Interpretation
The data demonstrates that Q-Anchored methods significantly improve prediction stability compared to A-Anchored approaches, with larger models (3B) achieving better performance than smaller ones (1B). The disparity between anchoring methods suggests that question-specific anchoring (Q-Anchored) is critical for reliable QA systems. TriviaQA and HotpotQA's high flip rates indicate these datasets may contain more ambiguous or complex questions requiring robust anchoring. The minimal A-Anchored performance highlights potential flaws in answer-centric anchoring strategies for these models.
</details>
<details>
<summary>x43.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3-8B and Llama-3-70B Models
### Overview
The image presents a comparative bar chart analyzing prediction flip rates for two language models (Llama-3-8B and Llama-3-70B) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Two anchoring strategies are compared: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by red and gray bars respectively.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (categorical, evenly spaced)
- **Y-Axis (Prediction Flip Rate)**: 0–80 scale (linear, increments of 20)
- **Legend**:
- Red: Q-Anchored (exact_question)
- Gray: A-Anchored (exact_question)
- **Model Sections**:
- Left: Llama-3-8B
- Right: Llama-3-70B
### Detailed Analysis
#### Llama-3-8B (Left Section)
- **Q-Anchored (Red)**:
- PopQA: ~60
- TriviaQA: ~70
- HotpotQA: ~50
- NQ: ~60
- **A-Anchored (Gray)**:
- PopQA: ~30
- TriviaQA: ~40
- HotpotQA: ~10
- NQ: ~20
#### Llama-3-70B (Right Section)
- **Q-Anchored (Red)**:
- PopQA: ~70
- TriviaQA: ~80
- HotpotQA: ~60
- NQ: ~55
- **A-Anchored (Gray)**:
- PopQA: ~40
- TriviaQA: ~50
- HotpotQA: ~10
- NQ: ~15
### Key Observations
1. **Q-Anchored Consistently Outperforms A-Anchored**:
- For both models, Q-Anchored rates are 2–4x higher than A-Anchored across all datasets.
- Largest gap in HotpotQA (Llama-3-8B: 50 vs 10; Llama-3-70B: 60 vs 10).
2. **Model Size Impact**:
- Llama-3-70B generally achieves higher rates than Llama-3-8B (e.g., TriviaQA: 80 vs 70 for Q-Anchored).
- NQ dataset shows the largest performance drop for Llama-3-70B (55 vs 60 for Q-Anchored).
3. **Dataset Variability**:
- TriviaQA and PopQA show the highest performance for both models.
- NQ dataset has the lowest rates overall, suggesting potential challenges in this domain.
### Interpretation
The data demonstrates that anchoring models to exact questions (Q-Anchored) significantly improves prediction flip rates compared to answer anchoring (A-Anchored). This suggests that question-level context is more critical for accurate predictions than answer-level context. The performance gap widens in complex datasets like HotpotQA, where multi-hop reasoning may require deeper question understanding. While larger models (70B) generally outperform smaller ones (8B), the NQ dataset reveals a notable exception, indicating potential limitations in handling specific question types despite increased model capacity. These findings highlight the importance of anchoring strategies and dataset-specific model tuning for question-answering systems.
</details>
<details>
<summary>x44.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Mistral-7B Models (v0.1 vs v0.3)
### Overview
The image presents a side-by-side comparison of prediction flip rates for two versions of the Mistral-7B model (v0.1 and v0.3) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Two categories are compared for each dataset: "Q-Anchored (exact_question)" (red bars) and "A-Anchored (exact_question)" (gray bars). The y-axis represents prediction flip rate (0–80), while the x-axis lists datasets.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (repeated for both model versions).
- **Y-Axis (Prediction Flip Rate)**: Scaled from 0 to 80 in increments of 20.
- **Legend**:
- Red: Q-Anchored (exact_question)
- Gray: A-Anchored (exact_question)
- **Model Versions**:
- Left chart: Mistral-7B-v0.1
- Right chart: Mistral-7B-v0.3
### Detailed Analysis
#### Mistral-7B-v0.1
- **PopQA**:
- Q-Anchored: ~70
- A-Anchored: ~15
- **TriviaQA**:
- Q-Anchored: ~65
- A-Anchored: ~45
- **HotpotQA**:
- Q-Anchored: ~75
- A-Anchored: ~10
- **NQ**:
- Q-Anchored: ~72
- A-Anchored: ~30
#### Mistral-7B-v0.3
- **PopQA**:
- Q-Anchored: ~60
- A-Anchored: ~25
- **TriviaQA**:
- Q-Anchored: ~78
- A-Anchored: ~50
- **HotpotQA**:
- Q-Anchored: ~70
- A-Anchored: ~12
- **NQ**:
- Q-Anchored: ~68
- A-Anchored: ~32
### Key Observations
1. **Q-Anchored Consistently Outperforms A-Anchored**: Across all datasets and models, Q-Anchored (red) bars are significantly taller than A-Anchored (gray) bars, indicating higher prediction flip rates for exact-question anchoring.
2. **Model Version Differences**:
- v0.3 shows slightly lower Q-Anchored rates than v0.1 in PopQA (~60 vs ~70) and NQ (~68 vs ~72), but higher in TriviaQA (~78 vs ~65).
- A-Anchored rates increase modestly in v0.3 (e.g., TriviaQA: ~50 vs ~45).
3. **Dataset-Specific Trends**:
- **HotpotQA**: Lowest A-Anchored rates (~10–12) suggest greater sensitivity to anchoring methods.
- **TriviaQA**: Highest A-Anchored rate in v0.3 (~50), indicating improved performance with this anchoring strategy for this dataset.
### Interpretation
The data demonstrates that anchoring predictions to exact questions (Q-Anchored) generally yields higher flip rates than anchoring to answers (A-Anchored), likely due to the specificity of question-based context. The marginal differences between model versions (v0.1 vs v0.3) suggest that updates to Mistral-7B had limited impact on this metric, though TriviaQA performance improved notably in v0.3. The stark contrast in A-Anchored rates across datasets (e.g., HotpotQA vs TriviaQA) highlights dataset-specific challenges, possibly tied to question complexity or answer ambiguity. These findings underscore the importance of anchoring strategy in model evaluation and the need for dataset-aware tuning.
</details>
Figure 19: Prediction flip rate under token patching, probing attention activations of the last exact answer token.
<details>
<summary>x45.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3.2 Models
### Overview
The image presents a comparative bar chart analyzing prediction flip rates for two versions of the Llama-3.2 language model (1B and 3B parameter variants) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. The chart contrasts two anchoring methods: Q-Anchored (exact_question) and A-Anchored (exact_question), with distinct color coding for each method.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right)
- **Y-Axis (Prediction Flip Rate)**: Scaled from 0 to 80 in increments of 20
- **Legend**:
- Red bars: Q-Anchored (exact_question)
- Gray bars: A-Anchored (exact_question)
- **Model Labels**:
- Left chart: Llama-3.2-1B
- Right chart: Llama-3.2-3B
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **PopQA**:
- Q-Anchored: ~55%
- A-Anchored: ~2%
- **TriviaQA**:
- Q-Anchored: ~70%
- A-Anchored: ~30%
- **HotpotQA**:
- Q-Anchored: ~50%
- A-Anchored: ~8%
- **NQ**:
- Q-Anchored: ~75%
- A-Anchored: ~12%
#### Llama-3.2-3B (Right Chart)
- **PopQA**:
- Q-Anchored: ~60%
- A-Anchored: ~22%
- **TriviaQA**:
- Q-Anchored: ~65%
- A-Anchored: ~28%
- **HotpotQA**:
- Q-Anchored: ~55%
- A-Anchored: ~12%
- **NQ**:
- Q-Anchored: ~78%
- A-Anchored: ~32%
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored consistently outperforms A-Anchored across all datasets and models, with flip rates 2-4x higher.
2. **Model Size Correlation**: Llama-3.2-3B shows systematically higher flip rates than Llama-3.2-1B (e.g., NQ Q-Anchored increases from 75% to 78%).
3. **Dataset Variance**: NQ dataset exhibits the highest flip rates for both methods, while PopQA shows the lowest A-Anchored performance.
4. **A-Anchored Limitations**: A-Anchored rates remain below 35% in all cases, suggesting weaker effectiveness compared to Q-Anchored.
### Interpretation
The data demonstrates that Q-Anchored methods significantly influence prediction flips more than A-Anchored approaches, with larger model sizes amplifying this effect. The NQ dataset's high flip rates may reflect its complexity or open-ended nature, making it more susceptible to anchoring effects. The stark contrast between Q and A anchoring suggests that question-level anchoring (Q-Anchored) is more impactful than answer-level anchoring (A-Anchored) in these models. The 3B model's improved performance across datasets implies that increased parameter count enhances sensitivity to anchoring strategies, potentially indicating better contextual understanding or reasoning capabilities.
</details>
<details>
<summary>x46.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3-8B and Llama-3-70B Models
### Overview
The image presents a grouped bar chart comparing prediction flip rates for two language models (Llama-3-8B and Llama-3-70B) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Two anchoring methods are compared: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by red and gray bars respectively.
### Components/Axes
- **X-Axis**: Datasets (PopQA, TriviaQA, HotpotQA, NQ)
- **Y-Axis**: Prediction Flip Rate (%) ranging from 0 to 100
- **Legend**:
- Red: Q-Anchored (exact_question)
- Gray: A-Anchored (exact_question)
- **Model Labels**:
- Top-left: Llama-3-8B
- Top-right: Llama-3-70B
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **PopQA**:
- Q-Anchored: ~65% (red)
- A-Anchored: ~22% (gray)
- **TriviaQA**:
- Q-Anchored: ~88% (red)
- A-Anchored: ~55% (gray)
- **HotpotQA**:
- Q-Anchored: ~50% (red)
- A-Anchored: ~10% (gray)
- **NQ**:
- Q-Anchored: ~75% (red)
- A-Anchored: ~20% (gray)
#### Llama-3-70B (Right Chart)
- **PopQA**:
- Q-Anchored: ~90% (red)
- A-Anchored: ~50% (gray)
- **TriviaQA**:
- Q-Anchored: ~70% (red)
- A-Anchored: ~22% (gray)
- **HotpotQA**:
- Q-Anchored: ~60% (red)
- A-Anchored: ~12% (gray)
- **NQ**:
- Q-Anchored: ~40% (red)
- A-Anchored: ~15% (gray)
### Key Observations
1. **Q-Anchored Consistently Outperforms A-Anchored**:
- Across all datasets and models, Q-Anchored (red) bars are significantly taller than A-Anchored (gray) bars.
- Example: In Llama-3-8B TriviaQA, Q-Anchored reaches ~88% vs. A-Anchored at ~55%.
2. **Model Size Impact**:
- Llama-3-70B generally shows higher absolute flip rates than Llama-3-8B, particularly in PopQA (90% vs. 65% for Q-Anchored).
3. **Dataset-Specific Trends**:
- **TriviaQA** has the highest Q-Anchored flip rates for both models.
- **NQ** shows the lowest A-Anchored performance in Llama-3-70B (~15%).
4. **Anchoring Method Effect**:
- Q-Anchored (exact_question) correlates with higher flip rates, suggesting stronger question-specific performance.
- A-Anchored (exact_question) underperforms, with rates often below 30% except in Llama-3-70B PopQA (~50%).
### Interpretation
The data demonstrates that **Q-Anchored (exact_question)** anchoring significantly improves prediction flip rates compared to A-Anchored (exact_question) across all datasets and model sizes. This suggests that question-specific anchoring enhances model performance in QA tasks. The Llama-3-70B model achieves higher absolute rates than Llama-3-8B, indicating that larger model size amplifies the benefits of Q-Anchored methods. Notably, the A-Anchored method struggles in NQ for Llama-3-70B, highlighting potential limitations in answer-based anchoring for complex datasets. The consistent trend across models implies that anchoring strategy matters more than model size for flip rate optimization.
</details>
<details>
<summary>x47.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side bar charts comparing prediction flip rates for two versions of the Mistral-7B language model (v0.1 and v0.3) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Each dataset is evaluated using two anchoring methods: **Q-Anchored (exact_question)** (red bars) and **A-Anchored (exact_question)** (gray bars). The y-axis represents prediction flip rate (0–80%), and the x-axis lists datasets.
---
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right).
- **Y-Axis (Prediction Flip Rate)**: Scaled from 0 to 80% in increments of 20.
- **Legend**: Located at the bottom of both charts. Red = Q-Anchored (exact_question), Gray = A-Anchored (exact_question).
- **Chart Titles**:
- Left chart: "Mistral-7B-v0.1"
- Right chart: "Mistral-7B-v0.3"
---
### Detailed Analysis
#### Mistral-7B-v0.1 (Left Chart)
- **Q-Anchored (red)**:
- PopQA: ~75%
- TriviaQA: ~82%
- HotpotQA: ~72%
- NQ: ~81%
- **A-Anchored (gray)**:
- PopQA: ~40%
- TriviaQA: ~55%
- HotpotQA: ~18%
- NQ: ~45%
#### Mistral-7B-v0.3 (Right Chart)
- **Q-Anchored (red)**:
- PopQA: ~76%
- TriviaQA: ~85%
- HotpotQA: ~65%
- NQ: ~77%
- **A-Anchored (gray)**:
- PopQA: ~35%
- TriviaQA: ~52%
- HotpotQA: ~12%
- NQ: ~32%
---
### Key Observations
1. **Q-Anchored Consistently Outperforms A-Anchored**:
- Across all datasets and models, Q-Anchored rates are significantly higher than A-Anchored (e.g., TriviaQA v0.1: 82% vs. 55%).
2. **HotpotQA Anomaly**:
- A-Anchored performance drops sharply in v0.3 (18% → 12%), while Q-Anchored also declines (72% → 65%).
3. **TriviaQA Dominance**:
- TriviaQA shows the highest Q-Anchored rates for both models (82% and 85%).
4. **NQ Dataset**:
- NQ has the lowest A-Anchored rates (45% and 32%) but remains the second-highest for Q-Anchored in v0.1.
---
### Interpretation
- **Effectiveness of Q-Anchoring**: The consistent superiority of Q-Anchored suggests that grounding predictions on the exact question text improves accuracy, likely by reducing ambiguity.
- **Model Version Impact**:
- v0.3 shows reduced performance in HotpotQA for both anchoring methods, possibly due to architectural changes or dataset-specific biases.
- A-Anchored degradation in HotpotQA (v0.3) may indicate overfitting or misalignment with the dataset’s structure.
- **Dataset Sensitivity**: TriviaQA’s high performance implies it aligns well with the model’s training data or prompting strategy, while HotpotQA’s volatility suggests sensitivity to model updates.
---
### Spatial Grounding & Verification
- **Legend Position**: Bottom center, clearly associating colors with anchoring methods.
- **Bar Alignment**: Red (Q-Anchored) bars are consistently taller than gray (A-Anchored) bars across all datasets and models.
- **Trend Verification**:
- Q-Anchored trends upward for TriviaQA and NQ in v0.3, while HotpotQA trends downward.
- A-Anchored rates decline for HotpotQA in v0.3, confirming model version impact.
---
### Conclusion
The data demonstrates that Q-Anchored (exact_question) significantly enhances prediction reliability compared to A-Anchored methods. The decline in HotpotQA performance in v0.3 warrants further investigation into dataset-model compatibility. TriviaQA’s consistent high performance highlights its suitability for evaluating question-answering systems.
</details>
Figure 20: Prediction flip rate under token patching, probing mlp activations of the final token.
<details>
<summary>x48.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3.2 Models
### Overview
The image presents a comparative bar chart analyzing prediction flip rates for two versions of the Llama-3.2 language model (1B and 3B parameter sizes) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Two anchoring methods are compared: Q-Anchored (exact_question) and A-Anchored (exact_question), represented by red and gray bars respectively.
### Components/Axes
- **X-Axis (Datasets)**:
- PopQA (leftmost)
- TriviaQA
- HotpotQA
- NQ (rightmost)
- **Y-Axis (Prediction Flip Rate)**:
- Scale: 0 to 50 (increments of 10)
- **Legend**:
- Red: Q-Anchored (exact_question)
- Gray: A-Anchored (exact_question)
- **Model Versions**:
- Left section: Llama-3.2-1B
- Right section: Llama-3.2-3B
### Detailed Analysis
#### Llama-3.2-1B (Left Section)
- **PopQA**:
- Q-Anchored: ~50
- A-Anchored: ~5
- **TriviaQA**:
- Q-Anchored: ~45
- A-Anchored: ~20
- **HotpotQA**:
- Q-Anchored: ~30
- A-Anchored: ~3
- **NQ**:
- Q-Anchored: ~40
- A-Anchored: ~15
#### Llama-3.2-3B (Right Section)
- **PopQA**:
- Q-Anchored: ~30
- A-Anchored: ~13
- **TriviaQA**:
- Q-Anchored: ~50
- A-Anchored: ~17
- **HotpotQA**:
- Q-Anchored: ~35
- A-Anchored: ~13
- **NQ**:
- Q-Anchored: ~47
- A-Anchored: ~19
### Key Observations
1. **Q-Anchored Dominance**:
- Q-Anchored consistently outperforms A-Anchored across all datasets and model sizes, with flip rates 3-10x higher.
2. **Model Size Impact**:
- Llama-3.2-3B shows reduced Q-Anchored performance in PopQA (-40%) and HotpotQA (-13%) compared to 1B, but matches or exceeds in TriviaQA (+11%) and NQ (+18%).
3. **A-Anchored Variability**:
- A-Anchored rates remain relatively stable between model sizes, with minor increases in TriviaQA (+35%) and NQ (+27%).
4. **Dataset-Specific Trends**:
- NQ dataset shows the largest gap between anchoring methods (~30 points for 1B, ~28 points for 3B).
### Interpretation
The data suggests that Q-Anchored (exact_question) significantly improves prediction stability compared to A-Anchored (exact_question), with performance gains scaling with model complexity in most cases. However, the Llama-3.2-3B model exhibits unexpected underperformance in Q-Anchored for PopQA and HotpotQA, potentially indicating dataset-specific architectural limitations. The NQ dataset's high flip rates for both methods suggest it may represent particularly challenging or ambiguous question types. The consistent A-Anchored performance across model sizes implies that answer anchoring provides more stable baseline behavior regardless of model capacity.
</details>
<details>
<summary>x49.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3-8B and Llama-3-70B Models
### Overview
The image is a grouped bar chart comparing prediction flip rates (in percentage) for two language models, **Llama-3-8B** and **Llama-3-70B**, across four datasets: **PopQA**, **TriviaQA**, **HotpotQA**, and **NQ**. Two anchoring methods are compared: **Q-Anchored (exact_question)** (red bars) and **A-Anchored (exact_question)** (gray bars).
### Components/Axes
- **X-axis**: Datasets (PopQA, TriviaQA, HotpotQA, NQ).
- **Y-axis**: Prediction Flip Rate (%) ranging from 0 to 70% in 20% increments.
- **Legend**:
- Red: Q-Anchored (exact_question)
- Gray: A-Anchored (exact_question)
- **Models**:
- Llama-3-8B (left chart)
- Llama-3-70B (right chart)
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **PopQA**:
- Q-Anchored: ~55%
- A-Anchored: ~10%
- **TriviaQA**:
- Q-Anchored: ~65%
- A-Anchored: ~40%
- **HotpotQA**:
- Q-Anchored: ~40%
- A-Anchored: ~10%
- **NQ**:
- Q-Anchored: ~65%
- A-Anchored: ~20%
#### Llama-3-70B (Right Chart)
- **PopQA**:
- Q-Anchored: ~65%
- A-Anchored: ~15%
- **TriviaQA**:
- Q-Anchored: ~55%
- A-Anchored: ~20%
- **HotpotQA**:
- Q-Anchored: ~50%
- A-Anchored: ~15%
- **NQ**:
- Q-Anchored: ~45%
- A-Anchored: ~25%
### Key Observations
1. **Q-Anchored Consistently Outperforms A-Anchored**:
- Across all datasets and models, Q-Anchored flip rates are significantly higher than A-Anchored rates.
- Example: Llama-3-8B on NQ shows a 65% (Q) vs. 20% (A) gap.
2. **Model Size Impact**:
- Llama-3-70B generally has lower flip rates than Llama-3-8B, particularly in **NQ** (45% vs. 65% for Q-Anchored).
3. **Dataset Variability**:
- **NQ** has the highest Q-Anchored rates for both models.
- **HotpotQA** shows the largest drop between Q and A anchoring for Llama-3-8B (~30% difference).
### Interpretation
- **Anchoring Method Effectiveness**: Q-Anchored (exact_question) demonstrates superior performance, suggesting that precise question alignment improves prediction stability.
- **Model Scaling Trade-offs**: While Llama-3-70B reduces flip rates compared to Llama-3-8B, the gap between anchoring methods narrows, implying diminishing returns in larger models for Q-Anchored benefits.
- **Dataset-Specific Behavior**: The **NQ** dataset’s high Q-Anchored rates may reflect its question complexity or structure, which aligns better with exact anchoring.
### Spatial Grounding & Trend Verification
- **Legend Placement**: Bottom-left, clearly labeled with color-coded anchors.
- **Bar Trends**:
- Q-Anchored bars slope upward relative to A-Anchored across all datasets.
- Llama-3-70B’s bars are shorter than Llama-3-8B’s, confirming lower flip rates.
- **Color Consistency**: Red (Q) and gray (A) bars match legend labels without ambiguity.
### Content Details
- **Approximate Values**:
- Llama-3-8B:
- PopQA: Q=55%, A=10%
- TriviaQA: Q=65%, A=40%
- HotpotQA: Q=40%, A=10%
- NQ: Q=65%, A=20%
- Llama-3-70B:
- PopQA: Q=65%, A=15%
- TriviaQA: Q=55%, A=20%
- HotpotQA: Q=50%, A=15%
- NQ: Q=45%, A=25%
### Notable Outliers
- **Llama-3-8B on TriviaQA**: A-Anchored rate (~40%) is unusually high compared to other datasets, suggesting dataset-specific model behavior.
- **Llama-3-70B on NQ**: Q-Anchored rate (~45%) is notably lower than Llama-3-8B’s (~65%), highlighting model size’s impact on performance.
### Final Notes
The chart underscores the importance of anchoring methods in model reliability, with Q-Anchored outperforming A-Anchored across all scenarios. Model scaling improves performance but does not eliminate the anchoring gap, indicating architectural or training differences between the two models.
</details>
<details>
<summary>x50.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image compares prediction flip rates for two versions of the Mistral-7B language model (v0.1 and v0.3) across four datasets (PopQA, TriviaQA, HotpotQA, NQ). Two anchoring strategies are evaluated: **Q-Anchored (exact_question)** and **A-Anchored (exact_question)**, represented by red and gray bars respectively. The y-axis measures prediction flip rate as a percentage.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right).
- **Y-Axis (Prediction Flip Rate)**: 0% to 80% in 20% increments.
- **Legend**:
- Red = Q-Anchored (exact_question)
- Gray = A-Anchored (exact_question)
- **Model Versions**:
- Left subplot = Mistral-7B-v0.1
- Right subplot = Mistral-7B-v0.3
### Detailed Analysis
#### Mistral-7B-v0.1
- **PopQA**:
- Q-Anchored: ~70% (red)
- A-Anchored: ~25% (gray)
- **TriviaQA**:
- Q-Anchored: ~60% (red)
- A-Anchored: ~50% (gray)
- **HotpotQA**:
- Q-Anchored: ~40% (red)
- A-Anchored: ~10% (gray)
- **NQ**:
- Q-Anchored: ~70% (red)
- A-Anchored: ~20% (gray)
#### Mistral-7B-v0.3
- **PopQA**:
- Q-Anchored: ~70% (red)
- A-Anchored: ~15% (gray)
- **TriviaQA**:
- Q-Anchored: ~70% (red)
- A-Anchored: ~40% (gray)
- **HotpotQA**:
- Q-Anchored: ~50% (red)
- A-Anchored: ~10% (gray)
- **NQ**:
- Q-Anchored: ~60% (red)
- A-Anchored: ~45% (gray)
### Key Observations
1. **Q-Anchored Dominance**: Across all datasets and models, Q-Anchored consistently outperforms A-Anchored, with flip rates 2–4× higher in most cases.
2. **Version-Specific Trends**:
- **v0.1**: Largest gap between anchoring strategies in HotpotQA (40% vs. 10%).
- **v0.3**: Narrowed gap in TriviaQA (70% vs. 40%) and NQ (60% vs. 45%), suggesting improved A-Anchored performance.
3. **Dataset Variability**:
- PopQA and TriviaQA show the highest flip rates for Q-Anchored in both versions.
- NQ exhibits the most significant A-Anchored improvement in v0.3 (+25% vs. v0.1).
### Interpretation
The data suggests that **Q-Anchored (exact_question)** anchoring improves model confidence, as evidenced by higher prediction flip rates. However, **Mistral-7B-v0.3** shows notable progress in A-Anchored performance, particularly for natural questions (NQ), where the gap between anchoring strategies reduced by ~20%. This may indicate architectural or training improvements in v0.3 that better align with real-world question structures. The persistent dominance of Q-Anchored highlights the importance of question specificity in model reliability, while the narrowing gaps in v0.3 suggest potential for more robust generalization in future iterations.
</details>
Figure 21: Prediction flip rate under token patching, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x51.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3.2 Models
### Overview
The image contains two side-by-side bar charts comparing prediction flip rates for Llama-3.2-1B and Llama-3.2-3B models across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. Four bar colors represent different anchoring strategies: Q-Anchored (exact_question), Q-Anchored (random), A-Anchored (exact_question), and A-Anchored (random). The y-axis shows prediction flip rate percentages (0-80%), while the x-axis lists datasets.
### Components/Axes
- **X-axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right)
- **Y-axis (Prediction Flip Rate)**: 0-80% in 20% increments
- **Legend (bottom)**:
- Pink: Q-Anchored (exact_question)
- Red: Q-Anchored (random)
- Gray: A-Anchored (exact_question)
- Black: A-Anchored (random)
### Detailed Analysis
#### Llama-3.2-1B Chart
- **PopQA**:
- Q-Anchored (exact_question): ~50% (pink)
- Q-Anchored (random): ~10% (red)
- A-Anchored (exact_question): ~25% (gray)
- A-Anchored (random): ~2% (black)
- **TriviaQA**:
- Q-Anchored (exact_question): ~65% (pink)
- Q-Anchored (random): ~12% (red)
- A-Anchored (exact_question): ~28% (gray)
- A-Anchored (random): ~3% (black)
- **HotpotQA**:
- Q-Anchored (exact_question): ~75% (pink)
- Q-Anchored (random): ~15% (red)
- A-Anchored (exact_question): ~10% (gray)
- A-Anchored (random): ~1% (black)
- **NQ**:
- Q-Anchored (exact_question): ~30% (pink)
- Q-Anchored (random): ~2% (red)
- A-Anchored (exact_question): ~8% (gray)
- A-Anchored (random): ~1% (black)
#### Llama-3.2-3B Chart
- **PopQA**:
- Q-Anchored (exact_question): ~60% (pink)
- Q-Anchored (random): ~15% (red)
- A-Anchored (exact_question): ~20% (gray)
- A-Anchored (random): ~3% (black)
- **TriviaQA**:
- Q-Anchored (exact_question): ~70% (pink)
- Q-Anchored (random): ~18% (red)
- A-Anchored (exact_question): ~22% (gray)
- A-Anchored (random): ~4% (black)
- **HotpotQA**:
- Q-Anchored (exact_question): ~78% (pink)
- Q-Anchored (random): ~20% (red)
- A-Anchored (exact_question): ~15% (gray)
- A-Anchored (random): ~5% (black)
- **NQ**:
- Q-Anchored (exact_question): ~50% (pink)
- Q-Anchored (random): ~8% (red)
- A-Anchored (exact_question): ~15% (gray)
- A-Anchored (random): ~2% (black)
### Key Observations
1. **Model Size Impact**: Llama-3.2-3B consistently shows higher prediction flip rates than Llama-3.2-1B across all datasets and anchoring strategies.
2. **Anchoring Strategy Trends**:
- Q-Anchored (exact_question) dominates with the highest flip rates (50-78%).
- Q-Anchored (random) shows moderate rates (2-20%).
- A-Anchored strategies have the lowest rates (1-28%).
3. **Dataset Variance**:
- HotpotQA has the highest flip rates for Q-Anchored (exact_question) in both models.
- NQ has the lowest flip rates across all strategies.
### Interpretation
The data suggests that:
- Larger model size (3B vs 1B) improves prediction flip rates across all anchoring strategies.
- Q-Anchored (exact_question) is the most effective strategy, likely due to precise question alignment with context.
- Random anchoring (both Q and A) performs poorly, indicating that random context selection reduces model confidence.
- The NQ dataset (Natural Questions) shows the weakest performance, possibly due to its open-ended nature requiring deeper reasoning.
The charts highlight the importance of question-specific anchoring for improving model reliability, with model scale amplifying these effects. The consistent underperformance of random anchoring strategies suggests that context selection significantly impacts prediction stability.
</details>
<details>
<summary>x52.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Llama-3-8B and Llama-3-70B Models
### Overview
The image compares prediction flip rates across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ) for two Llama-3 models (8B and 70B parameters). Four anchoring strategies are visualized: Q-Anchored (exact_question), A-Anchored (exact_question), Q-Anchored (random), and A-Anchored (random). The y-axis represents prediction flip rate (0-80%), while the x-axis categorizes datasets.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right)
- **Y-Axis (Prediction Flip Rate)**: 0-80% in 20% increments
- **Legend (Bottom Center)**:
- Red: Q-Anchored (exact_question)
- Gray: A-Anchored (exact_question)
- Dark Red: Q-Anchored (random)
- Dark Gray: A-Anchored (random)
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **PopQA**:
- Q-Anchored (exact): ~75%
- A-Anchored (exact): ~38%
- Q-Anchored (random): ~8%
- A-Anchored (random): ~2%
- **TriviaQA**:
- Q-Anchored (exact): ~78%
- A-Anchored (exact): ~35%
- Q-Anchored (random): ~10%
- A-Anchored (random): ~3%
- **HotpotQA**:
- Q-Anchored (exact): ~70%
- A-Anchored (exact): ~12%
- Q-Anchored (random): ~9%
- A-Anchored (random): ~4%
- **NQ**:
- Q-Anchored (exact): ~72%
- A-Anchored (exact): ~20%
- Q-Anchored (random): ~5%
- A-Anchored (random): ~1%
#### Llama-3-70B (Right Chart)
- **PopQA**:
- Q-Anchored (exact): ~75%
- A-Anchored (exact): ~30%
- Q-Anchored (random): ~6%
- A-Anchored (random): ~1%
- **TriviaQA**:
- Q-Anchored (exact): ~78%
- A-Anchored (exact): ~35%
- Q-Anchored (random): ~18%
- A-Anchored (random): ~4%
- **HotpotQA**:
- Q-Anchored (exact): ~72%
- A-Anchored (exact): ~10%
- Q-Anchored (random): ~19%
- A-Anchored (random): ~5%
- **NQ**:
- Q-Anchored (exact): ~58% (↓ 14% vs 8B)
- A-Anchored (exact): ~22%
- Q-Anchored (random): ~15%
- A-Anchored (random): ~6%
### Key Observations
1. **Q-Anchored (exact_question)** consistently shows the highest flip rates across all datasets and models, suggesting superior performance.
2. **Model Size Impact**: Llama-3-70B generally matches or slightly underperforms Llama-3-8B in Q-Anchored (exact) methods, except for NQ where 70B drops 14%.
3. **Random Anchoring**: Both Q and A random anchoring methods show significantly lower flip rates (<20%), indicating poor effectiveness.
4. **A-Anchored (exact_question)** performs better than random methods but lags behind Q-Anchored (exact) by 20-40%.
5. **NQ Dataset Anomaly**: Llama-3-70B shows a notable 14% drop in Q-Anchored (exact) performance compared to 8B, contrary to expectations for larger models.
### Interpretation
The data demonstrates that:
- **Anchoring Strategy Matters More Than Model Size**: Q-Anchored (exact_question) outperforms all other methods regardless of model size, suggesting it captures critical contextual relationships.
- **Diminishing Returns for Larger Models**: The 70B model's performance plateau or decline in some cases (e.g., NQ) implies potential overfitting or architectural limitations in handling specific datasets.
- **Random Anchoring Ineffectiveness**: Both Q and A random methods show minimal utility, highlighting the importance of structured anchoring for prediction reliability.
- **Dataset-Specific Behavior**: NQ's anomalous drop in 70B suggests dataset-model compatibility issues, warranting further investigation into dataset characteristics and model training dynamics.
This analysis underscores the critical role of precise anchoring strategies in question-answering systems, with implications for optimizing model architecture and training protocols.
</details>
<details>
<summary>x53.png Details</summary>

### Visual Description
## Bar Chart: Prediction Flip Rate Comparison for Mistral-7B-v0.1 and v0.3
### Overview
The image contains two side-by-side bar charts comparing prediction flip rates across four datasets (PopQA, TriviaQA, HotpotQA, NQ) for two versions of the Mistral-7B model (v0.1 and v0.3). Each chart uses color-coded bars to represent four anchoring methods: Q-Anchored (exact_question), Q-Anchored (random), A-Anchored (exact_question), and A-Anchored (random). The y-axis measures prediction flip rate (0–80), while the x-axis lists datasets.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (left to right).
- **Y-Axis (Prediction Flip Rate)**: Scale from 0 to 80.
- **Legend**: Located at the bottom, with four color-coded categories:
- Pink: Q-Anchored (exact_question)
- Dark Red: Q-Anchored (random)
- Gray: A-Anchored (exact_question)
- Dark Gray: A-Anchored (random)
### Detailed Analysis
#### Mistral-7B-v0.1
- **PopQA**:
- Q-Anchored (exact_question): ~80
- Q-Anchored (random): ~5
- A-Anchored (exact_question): ~35
- A-Anchored (random): ~1
- **TriviaQA**:
- Q-Anchored (exact_question): ~75
- Q-Anchored (random): ~10
- A-Anchored (exact_question): ~25
- A-Anchored (random): ~3
- **HotpotQA**:
- Q-Anchored (exact_question): ~80
- Q-Anchored (random): ~12
- A-Anchored (exact_question): ~5
- A-Anchored (random): ~4
- **NQ**:
- Q-Anchored (exact_question): ~78
- Q-Anchored (random): ~8
- A-Anchored (exact_question): ~45
- A-Anchored (random): ~2
#### Mistral-7B-v0.3
- **PopQA**:
- Q-Anchored (exact_question): ~70
- Q-Anchored (random): ~7
- A-Anchored (exact_question): ~20
- A-Anchored (random): ~1
- **TriviaQA**:
- Q-Anchored (exact_question): ~80
- Q-Anchored (random): ~8
- A-Anchored (exact_question): ~25
- A-Anchored (random): ~2
- **HotpotQA**:
- Q-Anchored (exact_question): ~75
- Q-Anchored (random): ~10
- A-Anchored (exact_question): ~10
- A-Anchored (random): ~3
- **NQ**:
- Q-Anchored (exact_question): ~75
- Q-Anchored (random): ~9
- A-Anchored (exact_question): ~25
- A-Anchored (random): ~2
### Key Observations
1. **Dominance of Q-Anchored (exact_question)**: Across all datasets and versions, Q-Anchored (exact_question) consistently shows the highest prediction flip rates, often exceeding 70–80.
2. **Random Anchoring Performance**: Q-Anchored (random) and A-Anchored (random) methods have the lowest flip rates, typically below 10.
3. **Version Comparison**: v0.3 generally exhibits lower flip rates than v0.1 for most anchoring methods, suggesting potential improvements in model stability or accuracy.
4. **NQ Dataset Anomaly**: In v0.1, A-Anchored (exact_question) for NQ reaches ~45, the highest among A-Anchored methods. In v0.3, TriviaQA’s Q-Anchored (exact_question) peaks at ~80.
### Interpretation
The data indicates that anchoring methods significantly impact prediction flip rates. Exact question anchoring (Q-Anchored and A-Anchored) correlates with higher flip rates, implying greater sensitivity to input specificity. Random anchoring methods yield minimal flip rates, suggesting robustness to input variations. The reduction in flip rates from v0.1 to v0.3 may reflect model optimizations, though the exact meaning of "prediction flip rate" (e.g., error rate vs. confidence metric) would clarify whether lower values are beneficial. Notably, the NQ dataset in v0.1 shows a unique pattern where A-Anchored (exact_question) outperforms other methods, warranting further investigation into dataset-specific model behavior.
</details>
Figure 22: Prediction flip rate under token patching, probing mlp activations of the last exact answer token.
Appendix E Answer-Only Input
<details>
<summary>x54.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Llama-3.2-1B and Llama-3.2-3B Models
### Overview
The image contains two side-by-side bar charts comparing the performance of two language models (Llama-3.2-1B and Llama-3.2-3B) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Performance is measured using the metric "-ΔP" (negative delta P), with separate bars for "Q-Anchored" (red) and "A-Anchored" (gray) approaches. The charts highlight differences in performance between model sizes and anchoring strategies.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (categorical, left to right).
- **Y-Axis (-ΔP)**: Numerical scale from 0 to 60, with increments of 20.
- **Legend**: Located at the bottom center, with red representing "Q-Anchored" and gray representing "A-Anchored".
- **Model Labels**:
- Left chart: "Llama-3.2-1B" (top-left).
- Right chart: "Llama-3.2-3B" (top-right).
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **PopQA**:
- Q-Anchored: ~45
- A-Anchored: ~2
- **TriviaQA**:
- Q-Anchored: ~58
- A-Anchored: ~16
- **HotpotQA**:
- Q-Anchored: ~62
- A-Anchored: ~16
- **NQ**:
- Q-Anchored: ~22
- A-Anchored: ~8
#### Llama-3.2-3B (Right Chart)
- **PopQA**:
- Q-Anchored: ~23
- A-Anchored: ~5
- **TriviaQA**:
- Q-Anchored: ~63
- A-Anchored: ~9
- **HotpotQA**:
- Q-Anchored: ~57
- A-Anchored: ~18
- **NQ**:
- Q-Anchored: ~32
- A-Anchored: ~10
### Key Observations
1. **Q-Anchored Dominance**:
- Q-Anchored consistently outperforms A-Anchored across all datasets and models.
- Llama-3.2-1B achieves the highest Q-Anchored performance on HotpotQA (~62), while Llama-3.2-3B excels on TriviaQA (~63).
2. **Model Size Impact**:
- Llama-3.2-3B generally outperforms Llama-3.2-1B in Q-Anchored for TriviaQA and NQ but underperforms on PopQA and HotpotQA.
- A-Anchored performance improves slightly with larger models (e.g., HotpotQA: 16 → 18).
3. **NQ Dataset Anomaly**:
- Llama-3.2-3B’s Q-Anchored performance (~32) is lower than Llama-3.2-1B’s (~22), contradicting the trend of larger models performing better.
4. **A-Anchored Variability**:
- A-Anchored values are consistently low but show modest gains with larger models (e.g., TriviaQA: 16 → 9, HotpotQA: 16 → 18).
### Interpretation
The data suggests that **Q-Anchored methods are significantly more effective** than A-Anchored approaches for both models, with performance heavily dependent on the dataset. While Llama-3.2-3B generally improves Q-Anchored results, its underperformance on NQ raises questions about scalability or dataset-specific limitations. The modest gains in A-Anchored performance with larger models indicate potential for optimization but highlight a persistent gap compared to Q-Anchored. The NQ dataset’s anomalous results for Llama-3.2-3B warrant further investigation into model behavior on this specific task.
</details>
<details>
<summary>x55.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Llama-3-8B and Llama-3-70B Models
### Overview
The image contains two side-by-side bar charts comparing the performance of two language models (Llama-3-8B and Llama-3-70B) across four question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ). Performance is measured using the metric "-ΔP" (negative delta P), with two anchoring methods: Q-Anchored (red bars) and A-Anchored (gray bars). The charts highlight differences in performance between model sizes and anchoring strategies.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (categorical, evenly spaced).
- **Y-Axis (-ΔP)**: Numerical scale from 0 to 60 (linear increments of 20).
- **Legend**:
- Red bars = Q-Anchored
- Gray bars = A-Anchored
- **Chart Titles**:
- Left: "Llama-3-8B"
- Right: "Llama-3-70B"
### Detailed Analysis
#### Llama-3-8B Chart
- **PopQA**:
- Q-Anchored ≈ 55
- A-Anchored ≈ 5
- **TriviaQA**:
- Q-Anchored ≈ 65
- A-Anchored ≈ 15
- **HotpotQA**:
- Q-Anchored ≈ 55
- A-Anchored ≈ 20
- **NQ**:
- Q-Anchored ≈ 25
- A-Anchored ≈ 5
#### Llama-3-70B Chart
- **PopQA**:
- Q-Anchored ≈ 50
- A-Anchored ≈ 3
- **TriviaQA**:
- Q-Anchored ≈ 65
- A-Anchored ≈ 25
- **HotpotQA**:
- Q-Anchored ≈ 45
- A-Anchored ≈ 22
- **NQ**:
- Q-Anchored ≈ 45
- A-Anchored ≈ 5
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored consistently outperforms A-Anchored across all datasets and models (e.g., TriviaQA: 65 vs. 15 for Llama-3-8B).
2. **Model Size Impact**: Llama-3-70B generally matches or exceeds Llama-3-8B performance, except in NQ where both models show similar Q-Anchored results (~25 vs. ~45).
3. **A-Anchored Variability**: A-Anchored performance varies significantly by dataset, with TriviaQA showing the highest gains (~25 for Llama-3-70B).
4. **NQ Anomaly**: NQ dataset has the lowest Q-Anchored performance for Llama-3-8B (~25) but matches Llama-3-70B (~45), suggesting dataset-specific challenges.
### Interpretation
The data demonstrates that Q-Anchored methods are more effective than A-Anchored across both models, with TriviaQA being the strongest performer for Q-Anchored. The larger Llama-3-70B model shows improved performance in most cases, particularly in HotpotQA and NQ, where it closes the gap with the smaller model. However, the NQ dataset presents an outlier: Llama-3-8B underperforms in Q-Anchored but matches Llama-3-70B, indicating potential dataset-specific limitations for the larger model. The A-Anchored method’s performance is more volatile, with TriviaQA showing disproportionately higher gains for the 70B model, suggesting anchoring strategy interacts with dataset complexity. These trends highlight the importance of model size and anchoring method selection for optimal QA performance.
</details>
<details>
<summary>x56.png Details</summary>

### Visual Description
## Bar Chart: Mistral-7B Model Performance Comparison (v0.1 vs v0.3)
### Overview
The image contains two side-by-side bar charts comparing the performance of the Mistral-7B model (versions v0.1 and v0.3) across four question-answering datasets: PopQA, TriviaQA, HotpotQA, and NQ. Performance is measured using ΔP (delta-P) values, with separate bars for Q-Anchored and A-Anchored methods. The charts highlight differences in performance between model versions and anchoring approaches.
### Components/Axes
- **X-Axis (Datasets)**: PopQA, TriviaQA, HotpotQA, NQ (categorical, left to right)
- **Y-Axis (ΔP)**: Numerical scale from 0 to 80 (linear)
- **Legend**:
- Red bars = Q-Anchored
- Gray bars = A-Anchored
- **Chart Titles**:
- Left: "Mistral-7B-v0.1"
- Right: "Mistral-7B-v0.3"
### Detailed Analysis
#### Mistral-7B-v0.1
- **Q-Anchored**:
- PopQA: ~78
- TriviaQA: ~72
- HotpotQA: ~45
- NQ: ~44
- **A-Anchored**:
- PopQA: ~22
- TriviaQA: ~20
- HotpotQA: ~20
- NQ: ~3
#### Mistral-7B-v0.3
- **Q-Anchored**:
- PopQA: ~78
- TriviaQA: ~58
- HotpotQA: ~47
- NQ: ~54
- **A-Anchored**:
- PopQA: ~18
- TriviaQA: ~5
- HotpotQA: ~22
- NQ: ~4
### Key Observations
1. **Q-Anchored Dominance**: Q-Anchored consistently outperforms A-Anchored in both model versions across all datasets.
2. **Version-Specific Trends**:
- **v0.1**: Q-Anchored shows strong performance (72-78 range) in TriviaQA and PopQA.
- **v0.3**: Q-Anchored performance drops in TriviaQA (72 → 58) but improves in NQ (44 → 54).
3. **A-Anchored Variability**:
- TriviaQA shows a drastic drop (20 → 5) between versions.
- HotpotQA A-Anchored improves slightly (20 → 22) in v0.3.
4. **NQ Dataset**: A-Anchored performs poorly (<5) in both versions, suggesting limited effectiveness for this dataset.
### Interpretation
The data demonstrates that Q-Anchored methods are more robust and consistent across datasets and model versions. The performance drop in TriviaQA for v0.3 suggests potential version-specific limitations in handling trivia-based questions. The near-identical PopQA Q-Anchored scores (78 in both versions) indicate stability in this metric. The A-Anchored method's poor performance on NQ (≤5) highlights a critical weakness in this anchoring approach for knowledge-intensive tasks. The HotpotQA A-Anchored improvement in v0.3 (20 → 22) may reflect targeted optimizations, but the overall trend underscores the superiority of Q-Anchored methods in this evaluation framework.
</details>
Figure 23: $-\Delta\mathrm{P}$ with only the LLM-generated answer. Q-Anchored instances exhibit substantial shifts, whereas A-Anchored instances remain stable, confirming that A-Anchored truthfulness encoding relies on information in the LLM-generated answer itself.
Appendix F Answer Accuracy
<details>
<summary>x57.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3.2 Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers for two Llama-3.2 variants (1B and 3B parameter sizes). Each graph shows multiple data series representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring methods (Q-Anchored vs A-Anchored). The graphs use color-coded lines with distinct line styles to differentiate datasets and anchoring approaches.
### Components/Axes
- **X-axis (Layer)**:
- Left chart: 0–15 (Llama-3.2-1B)
- Right chart: 0–25 (Llama-3.2-3B)
- Discrete integer values representing transformer layers
- **Y-axis (Answer Accuracy)**:
- Scale: 0–100% (both charts)
- Continuous percentage values
- **Legends**:
- Positioned at bottom of each chart
- Color-coded with line styles:
- **Solid lines**: Q-Anchored methods
- **Dashed lines**: A-Anchored methods
- Datasets:
- PopQA (blue/orange)
- TriviaQA (green/brown)
- HotpotQA (purple/gray)
- NQ (pink/red)
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **Q-Anchored (PopQA)**:
- Solid blue line
- Starts at ~85% accuracy (Layer 0), dips to ~40% (Layer 5), then fluctuates between 50–70%
- **A-Anchored (PopQA)**:
- Dashed orange line
- Starts at ~50%, peaks at ~65% (Layer 10), then declines to ~40%
- **Q-Anchored (TriviaQA)**:
- Solid green line
- Starts at ~60%, peaks at ~80% (Layer 12), then declines to ~50%
- **A-Anchored (TriviaQA)**:
- Dashed brown line
- Starts at ~40%, peaks at ~60% (Layer 8), then declines to ~30%
- **Q-Anchored (HotpotQA)**:
- Solid purple line
- Starts at ~70%, peaks at ~90% (Layer 6), then declines to ~60%
- **A-Anchored (HotpotQA)**:
- Dashed gray line
- Starts at ~50%, peaks at ~70% (Layer 14), then declines to ~40%
- **Q-Anchored (NQ)**:
- Solid pink line
- Starts at ~55%, peaks at ~75% (Layer 3), then declines to ~50%
- **A-Anchored (NQ)**:
- Dashed red line
- Starts at ~45%, peaks at ~65% (Layer 7), then declines to ~40%
#### Llama-3.2-3B (Right Chart)
- **Q-Anchored (PopQA)**:
- Solid blue line
- Starts at ~75%, peaks at ~95% (Layer 10), then declines to ~65%
- **A-Anchored (PopQA)**:
- Dashed orange line
- Starts at ~55%, peaks at ~75% (Layer 15), then declines to ~50%
- **Q-Anchored (TriviaQA)**:
- Solid green line
- Starts at ~65%, peaks at ~85% (Layer 20), then declines to ~60%
- **A-Anchored (TriviaQA)**:
- Dashed brown line
- Starts at ~45%, peaks at ~70% (Layer 22), then declines to ~40%
- **Q-Anchored (HotpotQA)**:
- Solid purple line
- Starts at ~80%, peaks at ~100% (Layer 18), then declines to ~70%
- **A-Anchored (HotpotQA)**:
- Dashed gray line
- Starts at ~60%, peaks at ~80% (Layer 24), then declines to ~50%
- **Q-Anchored (NQ)**:
- Solid pink line
- Starts at ~60%, peaks at ~85% (Layer 12), then declines to ~55%
- **A-Anchored (NQ)**:
- Dashed red line
- Starts at ~50%, peaks at ~70% (Layer 16), then declines to ~45%
### Key Observations
1. **Model Size Impact**:
- 3B model shows higher peak accuracies (up to 100% vs 90% in 1B)
- 3B model exhibits greater layer-to-layer variability (e.g., HotpotQA Q-Anchored peaks at Layer 18)
2. **Anchoring Method Trends**:
- Q-Anchored methods consistently outperform A-Anchored across datasets
- A-Anchored methods show more gradual declines after initial peaks
3. **Dataset Variability**:
- HotpotQA Q-Anchored shows most dramatic peaks (100% in 3B model)
- NQ dataset exhibits the most erratic patterns (e.g., sharp dips in Layer 5 for 1B model)
4. **Layer-Specific Patterns**:
- Early layers (0–5) show higher variability in both models
- Middle layers (10–15 for 1B; 15–20 for 3B) show more stable performance
### Interpretation
The data suggests that:
1. **Model Size Enhances Performance**: The 3B model achieves higher peak accuracies but with increased layer-to-layer variability, indicating potential overfitting or complex internal dynamics.
2. **Q-Anchored Superiority**: Q-Anchored methods consistently outperform A-Anchored across all datasets, suggesting question-specific anchoring provides better context retention.
3. **Dataset-Specific Behavior**:
- HotpotQA benefits most from Q-Anchored methods (reaching 100% accuracy in 3B model)
- NQ dataset shows the most unstable performance, possibly due to its open-ended nature
4. **Layer Dynamics**: Early layers (0–5) may represent initial context processing, while middle layers (10–15/20) show optimized question-answer alignment.
The graphs highlight the importance of anchoring strategy and model scale in transformer-based QA systems, with larger models offering higher potential but requiring careful layer management.
</details>
<details>
<summary>x58.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Model Layers for Llama-3-8B and Llama-3-70B
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers for two Llama-3 variants (8B and 70B parameters). Each graph tracks six distinct data series representing different anchoring methods (Q-Anchored/A-Anchored) across four datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs show significant variability in accuracy across layers, with notable differences between model sizes.
### Components/Axes
- **X-axis (Layer)**:
- Llama-3-8B: 0–30 (discrete increments)
- Llama-3-70B: 0–80 (discrete increments)
- **Y-axis (Answer Accuracy)**: 0–100% (continuous scale)
- **Legends**:
- Position: Bottom of both charts
- Entries (color/style):
1. Q-Anchored (PopQA): Solid blue
2. Q-Anchored (TriviaQA): Dotted green
3. Q-Anchored (HotpotQA): Dashed purple
4. Q-Anchored (NQ): Dotted pink
5. A-Anchored (PopQA): Dashed orange
6. A-Anchored (TriviaQA): Dotted gray
7. A-Anchored (HotpotQA): Dashed red
8. A-Anchored (NQ): Dotted brown
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **Q-Anchored (PopQA)**: Peaks at ~90% accuracy in layer 10, drops to ~40% by layer 30 with high volatility.
- **A-Anchored (PopQA)**: Starts at ~50%, fluctuates between 30–60% with sharp dips (e.g., layer 15: ~20%).
- **Q-Anchored (TriviaQA)**: Stable ~70–80% until layer 20, then erratic (60–90%).
- **A-Anchored (TriviaQA)**: Gradual decline from ~60% to ~40% with oscillations.
- **Q-Anchored (HotpotQA)**: Sharp initial drop from ~80% to ~50% by layer 10, then stabilizes ~60–70%.
- **A-Anchored (HotpotQA)**: Starts ~70%, declines to ~40% by layer 30 with jagged patterns.
- **Q-Anchored (NQ)**: Peaks ~95% at layer 5, crashes to ~30% by layer 30.
- **A-Anchored (NQ)**: Starts ~60%, declines to ~20% with irregular fluctuations.
#### Llama-3-70B (Right Chart)
- **Q-Anchored (PopQA)**: Starts ~85%, stabilizes ~70–80% after layer 20.
- **A-Anchored (PopQA)**: Starts ~60%, stabilizes ~50–60% after layer 40.
- **Q-Anchored (TriviaQA)**: Peaks ~90% at layer 10, declines to ~70% by layer 80.
- **A-Anchored (TriviaQA)**: Starts ~70%, declines to ~50% with moderate volatility.
- **Q-Anchored (HotpotQA)**: Starts ~80%, stabilizes ~65–75% after layer 30.
- **A-Anchored (HotpotQA)**: Starts ~75%, declines to ~50% with oscillations.
- **Q-Anchored (NQ)**: Peaks ~95% at layer 5, declines to ~60% by layer 80.
- **A-Anchored (NQ)**: Starts ~65%, declines to ~40% with irregular drops.
### Key Observations
1. **Model Size Impact**: Llama-3-70B shows more stable accuracy trends compared to Llama-3-8B, particularly in later layers.
2. **Anchoring Method Differences**:
- Q-Anchored methods generally outperform A-Anchored in early layers but show diminishing returns in larger models.
- A-Anchored methods exhibit greater volatility in smaller models (8B).
3. **Dataset Variability**:
- NQ (Natural Questions) shows the most dramatic drops in accuracy across both models.
- PopQA maintains higher baseline accuracy than TriviaQA/HotpotQA in later layers.
4. **Layer-Specific Anomalies**:
- Layer 5 in Llama-3-8B (Q-Anchored NQ) shows an outlier peak at ~95%.
- Layer 30 in Llama-3-8B (A-Anchored PopQA) has a sharp dip to ~20%.
### Interpretation
The data suggests that model size (70B vs. 8B) correlates with improved stability in answer accuracy across layers, particularly for Q-Anchored methods. However, anchoring effectiveness varies significantly by dataset:
- **PopQA** benefits most from Q-Anchoring in smaller models but shows diminishing returns in larger models.
- **NQ** exhibits the most instability, with Q-Anchored methods collapsing in later layers despite initial promise.
- The A-Anchored methods' volatility in the 8B model implies architectural limitations in smaller transformers for maintaining consistent reasoning chains.
These patterns highlight the importance of dataset-specific anchoring strategies and suggest that larger models may better preserve Q-Anchored performance but still struggle with dataset heterogeneity. The sharp declines in NQ accuracy across all layers warrant further investigation into question complexity and model attention mechanisms.
</details>
<details>
<summary>x59.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Mistral-7B Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across 30 layers of the Mistral-7B model (versions v0.1 and v0.3). Each graph plots answer accuracy (0–100%) against layer numbers (0–30). The data is segmented by QA datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring methods (Q-Anchored vs. A-Anchored).
---
### Components/Axes
- **X-axis**: "Layer" (0–30), representing model layers.
- **Y-axis**: "Answer Accuracy" (0–100%), with gridlines at 20, 40, 60, 80, 100.
- **Legends**:
- **Left Chart (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed gray: A-Anchored (HotpotQA)
- Solid pink: Q-Anchored (NQ)
- Dashed black: A-Anchored (NQ)
- **Right Chart (v0.3)**: Same legend as left chart.
---
### Detailed Analysis
#### Left Chart (Mistral-7B-v0.1)
- **Q-Anchored (PopQA)**: Starts at ~80% accuracy, dips to ~40% at layer 10, then fluctuates between 50–70%.
- **A-Anchored (PopQA)**: Begins at ~30%, peaks at ~60% at layer 10, then drops to ~20% by layer 30.
- **Q-Anchored (TriviaQA)**: Starts at ~70%, dips to ~30% at layer 10, then rises to ~60% by layer 30.
- **A-Anchored (TriviaQA)**: Begins at ~20%, peaks at ~50% at layer 10, then declines to ~10% by layer 30.
- **Q-Anchored (HotpotQA)**: Starts at ~75%, dips to ~40% at layer 10, then stabilizes at ~60%.
- **A-Anchored (HotpotQA)**: Begins at ~25%, peaks at ~55% at layer 10, then drops to ~20%.
- **Q-Anchored (NQ)**: Highly erratic, with sharp drops (e.g., ~90% → ~10% at layer 5) and peaks (e.g., ~80% at layer 20).
- **A-Anchored (NQ)**: Smoother than Q-Anchored, with a peak of ~40% at layer 10 and a decline to ~20% by layer 30.
#### Right Chart (Mistral-7B-v0.3)
- **Q-Anchored (PopQA)**: Starts at ~85%, dips to ~45% at layer 10, then fluctuates between 50–75%.
- **A-Anchored (PopQA)**: Begins at ~35%, peaks at ~65% at layer 10, then drops to ~25%.
- **Q-Anchored (TriviaQA)**: Starts at ~75%, dips to ~35% at layer 10, then rises to ~65% by layer 30.
- **A-Anchored (TriviaQA)**: Begins at ~25%, peaks at ~55% at layer 10, then declines to ~15%.
- **Q-Anchored (HotpotQA)**: Starts at ~80%, dips to ~45% at layer 10, then stabilizes at ~70%.
- **A-Anchored (HotpotQA)**: Begins at ~30%, peaks at ~60% at layer 10, then drops to ~25%.
- **Q-Anchored (NQ)**: Similar erratic pattern to v0.1, with a sharp drop to ~10% at layer 5 and a peak of ~85% at layer 20.
- **A-Anchored (NQ)**: Smoother than Q-Anchored, with a peak of ~45% at layer 10 and a decline to ~25%.
---
### Key Observations
1. **Q-Anchored vs. A-Anchored**:
- Q-Anchored methods generally show higher peak accuracy but greater volatility (e.g., NQ dataset drops from ~90% to ~10% in v0.1).
- A-Anchored methods are more stable but consistently lower in accuracy (e.g., A-Anchored (PopQA) peaks at ~60% vs. Q-Anchored’s ~80%).
2. **Model Version Differences**:
- v0.3 shows slightly higher baseline accuracy for Q-Anchored methods (e.g., PopQA starts at ~85% vs. v0.1’s ~80%).
- A-Anchored methods in v0.3 have marginally higher peaks (e.g., A-Anchored (PopQA) peaks at ~65% vs. v0.1’s ~60%).
3. **NQ Dataset Anomalies**:
- Q-Anchored (NQ) exhibits extreme fluctuations, suggesting instability in handling this dataset.
- A-Anchored (NQ) is less volatile but still underperforms compared to other datasets.
---
### Interpretation
The data suggests that **Q-Anchored methods** (e.g., PopQA, TriviaQA) achieve higher accuracy in specific layers but are prone to instability, particularly with the NQ dataset. **A-Anchored methods** offer more consistent performance but lower overall accuracy. The slight improvements in v0.3 (e.g., higher baseline accuracy for Q-Anchored) indicate minor optimizations in the model architecture. The NQ dataset’s erratic behavior highlights challenges in generalizing across diverse QA tasks.
**Notable Trends**:
- Peaks in accuracy for Q-Anchored methods often occur around layer 10, suggesting early layers are critical for certain tasks.
- A-Anchored methods show a "peak-and-decline" pattern, possibly due to overfitting or layer-specific limitations.
This analysis underscores the trade-off between accuracy and stability in model design, with anchoring methods playing a pivotal role in performance.
</details>
Figure 24: Comparisons of answer accuracy between pathways, probing attention activations of the final token.
<details>
<summary>x60.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3.2 Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers for two Llama-3.2 variants (1B and 3B parameters). Each graph shows multiple data series representing different question-answering datasets (PopQA, TriviaQA, HotpotQA) and anchoring methods (Q-Anchored vs A-Anchored). The graphs use color-coded lines with shaded confidence intervals to visualize performance trends.
### Components/Axes
- **X-axis (Layer)**:
- Left chart: 0–15 (Llama-3.2-1B)
- Right chart: 0–25 (Llama-3.2-3B)
- **Y-axis (Answer Accuracy)**: 0–100% (both charts)
- **Legends**:
- Positioned at bottom of each chart
- Line styles/colors:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: Q-Anchored (HotpotQA)
- Solid red: A-Anchored (PopQA)
- Dashed gray: A-Anchored (TriviaQA)
- Dotted purple: A-Anchored (HotpotQA)
- Dashed black: Q-Anchored (NoQA)
- Dotted gray: A-Anchored (NoQA)
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **Q-Anchored (PopQA)**: Blue line shows peak accuracy ~85% at layer 10, with sharp drops at layers 5 and 15. Confidence interval (shaded blue) widens significantly at layer 15.
- **A-Anchored (PopQA)**: Orange dashed line remains stable at ~50–60% accuracy, with minimal fluctuations.
- **Q-Anchored (TriviaQA)**: Green dashed line peaks at ~70% at layer 8, then declines sharply to ~30% by layer 15.
- **A-Anchored (TriviaQA)**: Gray dashed line shows gradual decline from ~60% to ~40% across layers.
- **Q-Anchored (HotpotQA)**: Dotted orange line peaks at ~75% at layer 12, with erratic fluctuations.
- **A-Anchored (HotpotQA)**: Dotted purple line shows moderate performance (~50–60%) with a notable dip at layer 10.
- **NoQA Baselines**:
- Q-Anchored (NoQA): Black dashed line hovers ~40–50%.
- A-Anchored (NoQA): Gray dotted line remains flat at ~30%.
#### Llama-3.2-3B (Right Chart)
- **Q-Anchored (PopQA)**: Blue line maintains ~80–90% accuracy across layers 0–25, with a sharp drop to ~60% at layer 20.
- **A-Anchored (PopQA)**: Orange dashed line shows gradual decline from ~65% to ~40%.
- **Q-Anchored (TriviaQA)**: Green dashed line peaks at ~75% at layer 10, then declines to ~50% by layer 25.
- **A-Anchored (TriviaQA)**: Gray dashed line remains stable at ~50–60%.
- **Q-Anchored (HotpotQA)**: Dotted orange line peaks at ~80% at layer 15, with significant volatility.
- **A-Anchored (HotpotQA)**: Dotted purple line shows erratic performance (~40–70%) with a sharp drop at layer 20.
- **NoQA Baselines**:
- Q-Anchored (NoQA): Black dashed line hovers ~50–60%.
- A-Anchored (NoQA): Gray dotted line remains flat at ~35%.
### Key Observations
1. **Model Size Impact**: Llama-3.2-3B generally shows higher baseline accuracy than Llama-3.2-1B, particularly in Q-Anchored configurations.
2. **Dataset Sensitivity**:
- PopQA performs best with Q-Anchored methods in both models.
- HotpotQA shows the most volatility, especially in the 3B model.
3. **Layer Dependency**:
- Accuracy peaks cluster around layers 8–15 for 1B and 10–15 for 3B.
- Performance declines sharply after layer 15 in the 3B model.
4. **Anchoring Method**: Q-Anchored consistently outperforms A-Anchored across datasets, except for NoQA baselines.
### Interpretation
The data suggests that:
- **Q-Anchored methods** leverage model capacity more effectively, particularly for complex datasets like HotpotQA.
- **Larger models (3B)** maintain higher accuracy but show greater sensitivity to layer depth, with performance drops in later layers.
- **NoQA baselines** indicate that anchoring methods provide meaningful improvements over random guessing, especially for TriviaQA and HotpotQA.
- The sharp declines in accuracy at specific layers (e.g., layer 15 in 1B, layer 20 in 3B) may reflect architectural bottlenecks or dataset-specific challenges in deeper layers.
*Note: All values are approximate due to the absence of gridlines and exact numerical labels. Confidence intervals suggest measurement uncertainty, particularly in volatile regions.*
</details>
<details>
<summary>x61.png Details</summary>

### Visual Description
## Line Graphs: Answer Accuracy Across Layers in Llama-3 Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers for different question-answering (QA) datasets. The left graph represents the Llama-3-8B model (30 layers), while the right graph represents the Llama-3-70B model (80 layers). Each graph shows multiple data series with distinct line styles and colors, representing different QA datasets and anchoring methods.
### Components/Axes
- **X-axis (Layer)**:
- Left graph: 0–30 (Llama-3-8B)
- Right graph: 0–80 (Llama-3-70B)
- **Y-axis (Answer Accuracy)**: 0–100% (both graphs)
- **Legends**:
- **Q-Anchored (PopQA)**: Solid blue line
- **A-Anchored (PopQA)**: Dashed orange line
- **Q-Anchored (TriviaQA)**: Solid green line
- **A-Anchored (TriviaQA)**: Dashed brown line
- **Q-Anchored (HotpotQA)**: Solid purple line
- **A-Anchored (HotpotQA)**: Dashed red line
- **Q-Anchored (NQ)**: Solid pink line
- **A-Anchored (NQ)**: Dashed gray line
### Detailed Analysis
#### Llama-3-8B (Left Graph)
- **Q-Anchored (PopQA)**: Starts at ~80% accuracy, dips to ~40% at layer 10, then fluctuates between 50–70%.
- **A-Anchored (PopQA)**: Begins at ~60%, drops to ~20% at layer 10, then stabilizes near 40–60%.
- **Q-Anchored (TriviaQA)**: Peaks at ~90% at layer 5, then declines to ~50% by layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~70%, dips to ~30% at layer 15, then recovers to ~50–70%.
- **Q-Anchored (HotpotQA)**: Oscillates between 60–80%, with sharp drops at layers 10 and 25.
- **A-Anchored (HotpotQA)**: Starts at ~50%, fluctuates between 30–60%, with a notable spike at layer 20.
- **Q-Anchored (NQ)**: Peaks at ~75% at layer 15, then declines to ~40% by layer 30.
- **A-Anchored (NQ)**: Starts at ~55%, dips to ~25% at layer 10, then stabilizes near 40–60%.
#### Llama-3-70B (Right Graph)
- **Q-Anchored (PopQA)**: Starts at ~75%, dips to ~30% at layer 20, then fluctuates between 50–75%.
- **A-Anchored (PopQA)**: Begins at ~65%, drops to ~25% at layer 25, then stabilizes near 40–65%.
- **Q-Anchored (TriviaQA)**: Peaks at ~85% at layer 10, then declines to ~45% by layer 80.
- **A-Anchored (TriviaQA)**: Starts at ~75%, dips to ~35% at layer 30, then recovers to ~50–75%.
- **Q-Anchored (HotpotQA)**: Oscillates between 55–75%, with sharper drops at layers 40 and 60.
- **A-Anchored (HotpotQA)**: Starts at ~55%, fluctuates between 35–65%, with a spike at layer 50.
- **Q-Anchored (NQ)**: Peaks at ~70% at layer 30, then declines to ~35% by layer 80.
- **A-Anchored (NQ)**: Starts at ~50%, dips to ~20% at layer 20, then stabilizes near 40–60%.
### Key Observations
1. **Fluctuating Accuracy**: All datasets show significant variability across layers, with no consistent upward or downward trend.
2. **Q-Anchored vs. A-Anchored**: Q-Anchored methods generally outperform A-Anchored in early layers but exhibit similar variability in later layers.
3. **Model Size Impact**: The 70B model shows more pronounced fluctuations in later layers (e.g., layer 60–80) compared to the 8B model.
4. **Dataset-Specific Patterns**:
- **PopQA**: Sharp early-layer drops in both models.
- **TriviaQA**: High early peaks followed by declines.
- **HotpotQA**: Persistent mid-range fluctuations.
- **NQ**: Gradual decline in later layers.
### Interpretation
The data suggests that anchoring methods (Q-Anchored vs. A-Anchored) influence answer accuracy differently across datasets and model sizes. Q-Anchored methods often show stronger performance in early layers but fail to maintain consistency, while A-Anchored methods exhibit more stability in later layers. The Llama-3-70B model’s increased layer count amplifies variability, indicating that larger models may struggle with layer-specific task alignment. The fluctuations highlight the complexity of transformer architectures, where certain layers may specialize in specific tasks (e.g., TriviaQA’s early-layer peaks suggest specialization in factual recall). These patterns underscore the need for dataset-specific tuning and layer-wise analysis to optimize QA performance.
</details>
<details>
<summary>x62.png Details</summary>

### Visual Description
## Line Chart: Answer Accuracy Across Layers for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two line charts comparing answer accuracy across 30 layers of the Mistral-7B model (versions v0.1 and v0.3). Each chart displays multiple data series representing different question-answering (QA) datasets and anchoring methods (Q-Anchored vs. A-Anchored). The y-axis measures answer accuracy (0–100%), while the x-axis represents model layers (0–30). The charts highlight variability in performance across layers and datasets.
---
### Components/Axes
- **X-axis (Layer)**: Labeled "Layer" with ticks at 0, 10, 20, 30.
- **Y-axis (Answer Accuracy)**: Labeled "Answer Accuracy" with ticks at 0, 20, 40, 60, 80, 100.
- **Legends**:
- **Left Chart (v0.1)**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ).
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ).
- **Right Chart (v0.3)**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ).
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ).
- **Titles**:
- Left: "Mistral-7B-v0.1"
- Right: "Mistral-7B-v0.3"
---
### Detailed Analysis
#### Left Chart (Mistral-7B-v0.1)
- **Q-Anchored (PopQA)**: Starts at ~80% accuracy, dips to ~40% at layer 10, then fluctuates between ~50–70% (peak ~75% at layer 20).
- **A-Anchored (PopQA)**: Starts at ~60%, dips to ~30% at layer 10, then stabilizes around ~40–50%.
- **Q-Anchored (TriviaQA)**: Peaks at ~90% at layer 5, drops to ~30% at layer 15, then recovers to ~70% at layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~50%, dips to ~20% at layer 10, then fluctuates between ~30–50%.
- **Q-Anchored (HotpotQA)**: Peaks at ~85% at layer 10, drops to ~40% at layer 20, then recovers to ~70% at layer 30.
- **A-Anchored (HotpotQA)**: Starts at ~55%, dips to ~25% at layer 15, then stabilizes around ~40–50%.
- **Q-Anchored (NQ)**: Peaks at ~95% at layer 5, drops to ~30% at layer 15, then recovers to ~75% at layer 30.
- **A-Anchored (NQ)**: Starts at ~65%, dips to ~20% at layer 10, then fluctuates between ~30–50%.
#### Right Chart (Mistral-7B-v0.3)
- **Q-Anchored (PopQA)**: Starts at ~70%, dips to ~40% at layer 10, then fluctuates between ~50–70% (peak ~75% at layer 20).
- **A-Anchored (PopQA)**: Starts at ~60%, dips to ~30% at layer 10, then stabilizes around ~40–50%.
- **Q-Anchored (TriviaQA)**: Peaks at ~85% at layer 5, drops to ~35% at layer 15, then recovers to ~70% at layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~50%, dips to ~25% at layer 10, then fluctuates between ~30–50%.
- **Q-Anchored (HotpotQA)**: Peaks at ~80% at layer 10, drops to ~45% at layer 20, then recovers to ~70% at layer 30.
- **A-Anchored (HotpotQA)**: Starts at ~55%, dips to ~25% at layer 15, then stabilizes around ~40–50%.
- **Q-Anchored (NQ)**: Peaks at ~90% at layer 5, drops to ~35% at layer 15, then recovers to ~75% at layer 30.
- **A-Anchored (NQ)**: Starts at ~65%, dips to ~20% at layer 10, then fluctuates between ~30–50%.
---
### Key Observations
1. **Layer-Specific Variability**: Accuracy fluctuates significantly across layers, with sharp drops and recoveries (e.g., TriviaQA Q-Anchored in v0.1 drops 60% from layer 5 to 15).
2. **Anchoring Impact**: Q-Anchored methods generally outperform A-Anchored in most datasets, though performance varies by layer.
3. **Dataset Sensitivity**: NQ (Natural Questions) shows the highest peaks (up to 95%) but also the steepest drops.
4. **Model Version Differences**: v0.3 exhibits slightly more stable trends compared to v0.1, with less extreme dips.
---
### Interpretation
The charts suggest that anchoring methods (Q vs. A) and dataset types (e.g., NQ vs. PopQA) significantly influence model performance. Q-Anchored approaches consistently achieve higher accuracy peaks, but both methods show layer-specific instability. The v0.3 model appears more robust, with reduced volatility in accuracy. The sharp drops (e.g., TriviaQA Q-Anchored in v0.1) may indicate architectural or training-related bottlenecks in specific layers. These patterns highlight the importance of layer-specific optimization for QA tasks.
</details>
Figure 25: Comparisons of answer accuracy between pathways, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x63.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Transformer Layers for Llama-3.2 Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer layers for two versions of the Llama-3.2 model (1B and 3B parameters). Each graph shows multiple data series representing different question-answering (QA) datasets and anchoring methods (Q-Anchored vs. A-Anchored). The graphs use color-coded lines with shaded confidence intervals to visualize performance trends.
### Components/Axes
- **X-axis (Layer)**:
- Left chart: 0–15 (Llama-3.2-1B)
- Right chart: 0–25 (Llama-3.2-3B)
- **Y-axis (Answer Accuracy)**: 0–100% (both charts)
- **Legends**:
- **Left Chart (Llama-3.2-1B)**:
- Blue solid: Q-Anchored (PopQA)
- Green dotted: Q-Anchored (TriviaQA)
- Orange dashed: A-Anchored (PopQA)
- Red dotted: A-Anchored (TriviaQA)
- Purple dashed: Q-Anchored (HotpotQA)
- Pink dotted: Q-Anchored (NQ)
- Gray dashed: A-Anchored (HotpotQA)
- Black dashed: A-Anchored (NQ)
- **Right Chart (Llama-3.2-3B)**:
- Same datasets/methods as left chart but with extended layer range.
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
1. **Q-Anchored (PopQA)**:
- Blue solid line peaks at ~85% accuracy around layer 10, then declines to ~60% by layer 15.
- Confidence interval (shaded blue) narrows after layer 10.
2. **Q-Anchored (TriviaQA)**:
- Green dotted line peaks at ~75% around layer 5, drops to ~50% by layer 15.
3. **A-Anchored (PopQA)**:
- Orange dashed line peaks at ~65% around layer 10, drops to ~40% by layer 15.
4. **Q-Anchored (HotpotQA)**:
- Purple dashed line peaks at ~70% around layer 15, with high variability (60–80%).
5. **Q-Anchored (NQ)**:
- Pink dotted line remains stable at ~50–60% across all layers.
6. **A-Anchored (HotpotQA/NQ)**:
- Gray/black dashed lines show lower accuracy (~40–50%) than Q-Anchored counterparts.
#### Llama-3.2-3B (Right Chart)
1. **Q-Anchored (PopQA)**:
- Blue solid line peaks at ~90% around layer 10, drops to ~70% by layer 25.
2. **Q-Anchored (TriviaQA)**:
- Green dotted line peaks at ~80% around layer 5, declines to ~60% by layer 25.
3. **A-Anchored (PopQA)**:
- Orange dashed line peaks at ~70% around layer 20, drops to ~50% by layer 25.
4. **Q-Anchored (HotpotQA)**:
- Purple dashed line peaks at ~85% around layer 20, with sharp drops (60–90%).
5. **Q-Anchored (NQ)**:
- Pink dotted line remains stable at ~55–65% across all layers.
6. **A-Anchored (HotpotQA/NQ)**:
- Gray/black dashed lines show lower accuracy (~45–55%) than Q-Anchored.
### Key Observations
1. **Model Size Impact**:
- Llama-3.2-3B generally achieves higher peak accuracy than Llama-3.2-1B (e.g., PopQA Q-Anchored: 85% vs. 85% peak, but 3B sustains higher values longer).
2. **Dataset Variability**:
- HotpotQA shows the highest variability in both models, suggesting complexity in long-context reasoning.
3. **Anchoring Method**:
- Q-Anchored consistently outperforms A-Anchored across datasets (e.g., PopQA Q-Anchored peaks at 85–90% vs. A-Anchored at 65–70%).
4. **Layer-Specific Trends**:
- Accuracy often peaks in middle layers (5–20) before declining, indicating potential overfitting or context retention limits.
### Interpretation
The data demonstrates that **Q-Anchored methods** (using question context) outperform **A-Anchored methods** (using answer context) across all datasets and model sizes. The Llama-3.2-3B model shows improved performance in later layers compared to the 1B version, particularly for complex datasets like HotpotQA. However, the sharp drops in accuracy for A-Anchored methods (e.g., HotpotQA in Llama-3.2-3B) suggest that answer-centric anchoring struggles with long-context reasoning. The stability of NQ Q-Anchored accuracy implies it may be less sensitive to model architecture changes. These trends highlight the importance of question context in transformer-based QA systems and suggest opportunities for optimizing middle-layer representations.
</details>
<details>
<summary>x64.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3-8B and Llama-3-70B Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across neural network layers for two versions of the Llama-3 model (8B and 70B parameters). Each graph tracks performance across multiple anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs show significant variability in accuracy across layers, with distinct patterns emerging between model sizes.
### Components/Axes
- **X-axis (Layer)**:
- Llama-3-8B: 0–30 layers
- Llama-3-70B: 0–80 layers
- **Y-axis (Answer Accuracy)**: 0–100% scale
- **Legends**:
- **Line styles/colors**:
- Solid lines: Q-Anchored methods
- Dashed lines: A-Anchored methods
- Colors correspond to datasets:
- Blue: PopQA
- Green: TriviaQA
- Purple: HotpotQA
- Red: NQ
- Legend positioned at bottom center, spanning both charts
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **Q-Anchored (PopQA)**: Starts at ~80% accuracy, drops sharply to ~20% at layer 5, then fluctuates between 40–80% with peaks at layers 10, 15, and 25.
- **A-Anchored (PopQA)**: Begins at ~50%, dips to ~30% at layer 5, then stabilizes between 30–50% with minor oscillations.
- **Q-Anchored (TriviaQA)**: Peaks at ~90% at layer 10, crashes to ~10% at layer 15, then recovers to ~70% by layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~60%, drops to ~40% at layer 10, then stabilizes between 40–60%.
- **Q-Anchored (HotpotQA)**: Sharp drop from ~70% to ~20% at layer 5, followed by erratic fluctuations between 30–70%.
- **A-Anchored (HotpotQA)**: Starts at ~50%, dips to ~30% at layer 5, then stabilizes between 30–50%.
- **Q-Anchored (NQ)**: Begins at ~60%, drops to ~20% at layer 5, then fluctuates between 40–60%.
- **A-Anchored (NQ)**: Starts at ~50%, dips to ~30% at layer 5, then stabilizes between 30–50%.
#### Llama-3-70B (Right Chart)
- **Q-Anchored (PopQA)**: Starts at ~90%, drops to ~30% at layer 10, then fluctuates between 50–90% with peaks at layers 20, 40, and 60.
- **A-Anchored (PopQA)**: Begins at ~60%, dips to ~40% at layer 10, then stabilizes between 40–60%.
- **Q-Anchored (TriviaQA)**: Peaks at ~95% at layer 20, crashes to ~15% at layer 30, then recovers to ~80% by layer 70.
- **A-Anchored (TriviaQA)**: Starts at ~70%, dips to ~50% at layer 20, then stabilizes between 50–70%.
- **Q-Anchored (HotpotQA)**: Sharp drop from ~80% to ~25% at layer 10, followed by erratic fluctuations between 40–80%.
- **A-Anchored (HotpotQA)**: Starts at ~60%, dips to ~40% at layer 10, then stabilizes between 40–60%.
- **Q-Anchored (NQ)**: Begins at ~70%, drops to ~20% at layer 10, then fluctuates between 50–70%.
- **A-Anchored (NQ)**: Starts at ~60%, dips to ~40% at layer 10, then stabilizes between 40–60%.
### Key Observations
1. **Model Size Impact**: The 70B model shows more pronounced fluctuations in accuracy across layers compared to the 8B model.
2. **Anchoring Method Differences**:
- Q-Anchored methods generally start with higher accuracy but experience sharper drops in early layers.
- A-Anchored methods exhibit more stability but lower baseline accuracy.
3. **Dataset Variability**:
- HotpotQA and NQ datasets show the most erratic patterns, particularly in the 70B model.
- TriviaQA demonstrates the highest peaks in accuracy for both models.
4. **Confidence Intervals**: Shaded regions (not explicitly labeled) suggest wider uncertainty in the 70B model’s predictions.
### Interpretation
The data suggests that larger models (70B) exhibit greater layer-to-layer variability in answer accuracy, potentially due to increased complexity or overfitting. Q-Anchored methods outperform A-Anchored methods in early layers but become less reliable in deeper layers for certain datasets. The TriviaQA dataset consistently shows the highest accuracy peaks, indicating it may be better aligned with the models’ training data. The sharp drops in accuracy for Q-Anchored methods at specific layers (e.g., layer 5 in 8B, layer 10 in 70B) could reflect architectural bottlenecks or dataset-specific challenges. The stability of A-Anchored methods across layers implies they may be more robust to model size changes but sacrifice peak performance. Further investigation into dataset-model alignment and anchoring strategy trade-offs is warranted.
</details>
<details>
<summary>x65.png Details</summary>

### Visual Description
## Line Chart: Answer Accuracy Across Layers for Mistral-7B Models
### Overview
The image contains two side-by-side line charts comparing answer accuracy across layers (0–30) for two versions of the Mistral-7B model (v0.1 and v0.3). Each chart includes multiple data series representing different anchoring strategies (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis measures answer accuracy (0–100%), and the x-axis represents model layers.
---
### Components/Axes
- **Left Chart Title**: "Mistral-7B-v0.1"
- **Right Chart Title**: "Mistral-7B-v0.3"
- **Y-Axis**: "Answer Accuracy" (0–100%)
- **X-Axis**: "Layer" (0–30)
- **Legend**: Located at the bottom of both charts, with the following entries:
- **Solid Lines**:
- Blue: Q-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Purple: Q-Anchored (HotpotQA)
- Pink: Q-Anchored (NQ)
- **Dashed Lines**:
- Orange: A-Anchored (PopQA)
- Red: A-Anchored (TriviaQA)
- Gray: A-Anchored (HotpotQA)
- Black: A-Anchored (NQ)
---
### Detailed Analysis
#### Mistral-7B-v0.1 (Left Chart)
- **Q-Anchored (PopQA)**: Starts at ~80% accuracy, dips sharply to ~40% at layer 5, then stabilizes near 80% by layer 30.
- **A-Anchored (PopQA)**: Peaks at ~60% at layer 10, drops to ~20% at layer 15, and fluctuates between 20–40% thereafter.
- **Q-Anchored (TriviaQA)**: Begins at ~70%, dips to ~50% at layer 10, then rises to ~80% by layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~50%, drops to ~30% at layer 5, and stabilizes near 40% by layer 30.
- **Q-Anchored (HotpotQA)**: Peaks at ~90% at layer 10, drops to ~60% at layer 15, then recovers to ~80% by layer 30.
- **A-Anchored (HotpotQA)**: Starts at ~60%, dips to ~40% at layer 10, and fluctuates between 40–60% thereafter.
- **Q-Anchored (NQ)**: Starts at ~75%, dips to ~50% at layer 10, then rises to ~85% by layer 30.
- **A-Anchored (NQ)**: Begins at ~55%, drops to ~35% at layer 10, and stabilizes near 50% by layer 30.
#### Mistral-7B-v0.3 (Right Chart)
- **Q-Anchored (PopQA)**: Starts at ~85%, dips to ~60% at layer 10, then stabilizes near 90% by layer 30.
- **A-Anchored (PopQA)**: Peaks at ~65% at layer 10, drops to ~40% at layer 15, and fluctuates between 40–60% thereafter.
- **Q-Anchored (TriviaQA)**: Begins at ~75%, dips to ~55% at layer 10, then rises to ~85% by layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~55%, drops to ~35% at layer 10, and stabilizes near 50% by layer 30.
- **Q-Anchored (HotpotQA)**: Peaks at ~95% at layer 10, drops to ~70% at layer 15, then recovers to ~90% by layer 30.
- **A-Anchored (HotpotQA)**: Starts at ~65%, dips to ~45% at layer 10, and fluctuates between 45–65% thereafter.
- **Q-Anchored (NQ)**: Starts at ~80%, dips to ~60% at layer 10, then rises to ~90% by layer 30.
- **A-Anchored (NQ)**: Begins at ~60%, drops to ~40% at layer 10, and stabilizes near 60% by layer 30.
---
### Key Observations
1. **Version Comparison**:
- Mistral-7B-v0.3 shows more stable and higher accuracy trends compared to v0.1, particularly for Q-Anchored models.
- A-Anchored models in v0.3 exhibit slightly improved stability but remain lower than Q-Anchored counterparts.
2. **Dataset Performance**:
- **PopQA**: Q-Anchored models consistently outperform A-Anchored across both versions.
- **HotpotQA**: Q-Anchored models achieve the highest accuracy (up to ~95% in v0.3), while A-Anchored models lag significantly.
- **NQ**: Q-Anchored models show the most pronounced improvement in v0.3, reaching ~90% accuracy.
3. **Layer Trends**:
- Accuracy often peaks around layer 10–15, followed by fluctuations.
- Sharp drops (e.g., layer 5–10) suggest potential instability in early layers for certain datasets.
---
### Interpretation
The data demonstrates that **Q-Anchored models consistently outperform A-Anchored models** across all datasets and versions, with the gap widening in Mistral-7B-v0.3. The improved stability in v0.3 suggests architectural or training enhancements, particularly for complex datasets like HotpotQA and NQ. The layer-wise fluctuations highlight the importance of early-layer performance, as drops in accuracy at layers 5–10 correlate with lower overall performance. These trends underscore the effectiveness of Q-Anchored strategies in maintaining high accuracy, while A-Anchored models may require further optimization for robustness.
</details>
Figure 26: Comparisons of answer accuracy between pathways, probing attention activations of the last exact answer token.
<details>
<summary>x66.png Details</summary>

### Visual Description
## Line Graphs: Answer Accuracy Across Layers in Llama-3.2 Models
### Overview
The image contains two line graphs comparing answer accuracy across layers for the Llama-3.2-1B and Llama-3.2-3B models. Each graph includes multiple data series representing different question (Q) and answer (A) anchored datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis measures answer accuracy (0–100%), and the x-axis represents model layers (0–15 for 1B, 0–25 for 3B).
### Components/Axes
- **X-axis (Layer)**:
- Left graph: 0–15 (Llama-3.2-1B)
- Right graph: 0–25 (Llama-3.2-3B)
- **Y-axis (Answer Accuracy)**: 0–100%
- **Legends**:
- **Left Graph**:
- Blue: Q-Anchored (PopQA)
- Orange: A-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Red: A-Anchored (TriviaQA)
- Purple: Q-Anchored (HotpotQA)
- Pink: Q-Anchored (NQ)
- Gray: A-Anchored (HotpotQA)
- Brown: A-Anchored (NQ)
- **Right Graph**: Same legend as left graph.
### Detailed Analysis
#### Llama-3.2-1B (Left Graph)
- **Q-Anchored (PopQA)**: Blue line starts at ~80% accuracy, dips to ~40% at layer 5, then fluctuates between ~50–70% up to layer 15.
- **A-Anchored (PopQA)**: Orange line remains relatively stable, hovering between ~40–60% across all layers.
- **Q-Anchored (TriviaQA)**: Green line starts at ~60%, drops to ~30% at layer 5, then rises to ~70% by layer 15.
- **A-Anchored (TriviaQA)**: Red line fluctuates between ~40–60%, with a peak at ~70% near layer 10.
- **Q-Anchored (HotpotQA)**: Purple line starts at ~70%, dips to ~50% at layer 5, then rises to ~80% by layer 15.
- **Q-Anchored (NQ)**: Pink line starts at ~50%, drops to ~20% at layer 5, then rises to ~70% by layer 15.
- **A-Anchored (HotpotQA)**: Gray line fluctuates between ~40–60%, with a peak at ~70% near layer 10.
- **A-Anchored (NQ)**: Brown line starts at ~30%, drops to ~10% at layer 5, then rises to ~50% by layer 15.
#### Llama-3.2-3B (Right Graph)
- **Q-Anchored (PopQA)**: Blue line starts at ~80%, dips to ~50% at layer 10, then rises to ~90% by layer 25.
- **A-Anchored (PopQA)**: Orange line remains stable between ~40–60% across all layers.
- **Q-Anchored (TriviaQA)**: Green line starts at ~60%, drops to ~30% at layer 10, then rises to ~80% by layer 25.
- **A-Anchored (TriviaQA)**: Red line fluctuates between ~40–60%, with a peak at ~70% near layer 20.
- **Q-Anchored (HotpotQA)**: Purple line starts at ~70%, dips to ~50% at layer 10, then rises to ~90% by layer 25.
- **Q-Anchored (NQ)**: Pink line starts at ~50%, drops to ~20% at layer 10, then rises to ~80% by layer 25.
- **A-Anchored (HotpotQA)**: Gray line fluctuates between ~40–60%, with a peak at ~70% near layer 20.
- **A-Anchored (NQ)**: Brown line starts at ~30%, drops to ~10% at layer 10, then rises to ~60% by layer 25.
### Key Observations
1. **Q-Anchored vs. A-Anchored**: Q-Anchored methods generally show higher accuracy than A-Anchored across most datasets and layers.
2. **Layer-Specific Trends**:
- In Llama-3.2-1B, Q-Anchored (PopQA) and (HotpotQA) show significant dips at layer 5, while A-Anchored methods are more stable.
- In Llama-3.2-3B, Q-Anchored (NQ) and (TriviaQA) exhibit sharper drops at layer 10, followed by recovery.
3. **Model Size Impact**: Llama-3.2-3B (right graph) has more layers (25 vs. 15), but trends mirror the 1B model, suggesting similar architectural behavior.
4. **Uncertainty**: Shaded areas around lines indicate variability, with larger spreads in Q-Anchored methods (e.g., PopQA in 1B).
### Interpretation
The data suggests that **Q-Anchored approaches** (e.g., PopQA, HotpotQA) outperform A-Anchored methods in answer accuracy, particularly in later layers. However, performance varies by dataset:
- **PopQA** and **HotpotQA** show robust Q-Anchored performance, while **NQ** and **TriviaQA** exhibit more volatility.
- The **3B model** (right graph) demonstrates similar trends to the 1B model but with extended layers, indicating scalability.
- **A-Anchored methods** (e.g., PopQA, TriviaQA) are more consistent but less accurate, suggesting they may prioritize stability over peak performance.
- The **NQ dataset** (Q-Anchored) shows the most dramatic fluctuations, possibly due to its complexity or training data differences.
This analysis highlights the importance of anchoring strategies in model performance, with Q-Anchored methods offering higher accuracy at the cost of variability. Further investigation into dataset-specific training or layer-wise optimization could improve consistency.
</details>
<details>
<summary>x67.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3-8B and Llama-3-70B Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across layers for two versions of the Llama-3 model (8B and 70B parameters). Each graph tracks performance across 30 layers (8B) or 80 layers (70B) for six distinct methods/dataset combinations. The y-axis represents answer accuracy (0-100%), while the x-axis represents layer depth. Multiple colored lines with distinct styles represent different Q-Anchored and A-Anchored methods applied to specific datasets.
### Components/Axes
- **X-Axis (Layer Depth)**:
- Llama-3-8B: 0–30 layers (discrete increments)
- Llama-3-70B: 0–80 layers (discrete increments)
- **Y-Axis (Answer Accuracy)**: 0–100% (continuous scale)
- **Legends**:
- **Llama-3-8B** (left chart):
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: A-Anchored (PopQA)
- Dashed gray: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dotted pink: Q-Anchored (NQ)
- **Llama-3-70B** (right chart):
- Same legend entries as above, with matching colors/styles
### Detailed Analysis
#### Llama-3-8B (Left Chart)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ~90% accuracy, dips to ~70% at layer 10, then stabilizes near 80%.
2. **Q-Anchored (TriviaQA)** (dashed green):
- Peaks at ~95% at layer 5, drops to ~60% by layer 20, then fluctuates between 50–70%.
3. **A-Anchored (PopQA)** (dotted orange):
- Begins at ~50%, rises to ~70% at layer 15, then declines to ~40% by layer 30.
4. **A-Anchored (TriviaQA)** (dashed gray):
- Starts at ~40%, peaks at ~60% at layer 10, then drops to ~30% by layer 30.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Starts at ~80%, dips to ~60% at layer 10, then stabilizes near 70%.
6. **Q-Anchored (NQ)** (dotted pink):
- Highly volatile: starts at ~50%, peaks at ~90% at layer 5, crashes to ~20% at layer 15, then fluctuates between 10–50%.
#### Llama-3-70B (Right Chart)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ~95%, dips to ~80% at layer 20, then stabilizes near 90%.
2. **Q-Anchored (TriviaQA)** (dashed green):
- Peaks at ~98% at layer 10, drops to ~70% by layer 40, then fluctuates between 60–80%.
3. **A-Anchored (PopQA)** (dotted orange):
- Begins at ~60%, rises to ~80% at layer 30, then declines to ~50% by layer 80.
4. **A-Anchored (TriviaQA)** (dashed gray):
- Starts at ~50%, peaks at ~70% at layer 20, then drops to ~40% by layer 80.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Starts at ~85%, dips to ~70% at layer 40, then stabilizes near 80%.
6. **Q-Anchored (NQ)** (dotted pink):
- Less volatile than 8B: starts at ~60%, peaks at ~85% at layer 10, then fluctuates between 50–70%.
### Key Observations
1. **Model Size Impact**:
- Llama-3-70B consistently outperforms Llama-3-8B across most methods/datasets.
- Larger model shows smoother accuracy curves with fewer extreme fluctuations.
2. **Method Performance**:
- Q-Anchored methods generally outperform A-Anchored methods.
- NQ dataset shows the most instability, especially in the 8B model.
3. **Layer-Specific Trends**:
- Early layers (0–10) often show peak accuracy for Q-Anchored methods.
- Later layers (20–80) exhibit gradual declines or stabilization.
4. **Dataset Variability**:
- HotpotQA and TriviaQA show moderate stability.
- PopQA and NQ exhibit higher volatility, particularly in the 8B model.
### Interpretation
The data suggests that:
- **Model Scale Matters**: The 70B model achieves higher baseline accuracy and more stable performance across layers compared to the 8B model.
- **Q-Anchored Superiority**: Q-Anchored methods consistently outperform A-Anchored counterparts, likely due to better contextual grounding.
- **Dataset Sensitivity**: NQ (Natural Questions) introduces significant instability, possibly due to its open-ended nature or data complexity.
- **Layer Depth Dynamics**: Early layers (0–10) are critical for establishing accuracy, with later layers showing diminishing returns or degradation.
The graphs highlight trade-offs between model size, anchoring strategies, and dataset characteristics. The 70B model’s improved performance suggests that scaling enhances robustness, while Q-Anchored methods provide more reliable accuracy across diverse datasets.
</details>
<details>
<summary>x68.png Details</summary>

### Visual Description
## Line Graphs: Answer Accuracy Across Layers for Mistral-7B Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers (0-30) for two versions of Mistral-7B (v0.1 and v0.3). Each graph shows multiple data series representing different anchoring methods (Q-Anchored/A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs use color-coded lines with shaded confidence intervals.
### Components/Axes
- **Y-Axis**: Answer Accuracy (0-100%) with ticks at 0, 20, 40, 60, 80, 100
- **X-Axis**: Layer (0-30) with ticks at 0, 10, 20, 30
- **Legends**:
- **Left Graph (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: A-Anchored (PopQA)
- Dash-dot red: A-Anchored (TriviaQA)
- Gray shaded area: Confidence intervals
- **Right Graph (v0.3)**:
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- Dotted gray: A-Anchored (HotpotQA)
- Dash-dot brown: A-Anchored (NQ)
- Gray shaded area: Confidence intervals
### Detailed Analysis
**Mistral-7B-v0.1 (Left Graph)**
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ~95% accuracy (Layer 0), drops sharply to ~60% (Layer 5), then stabilizes between 70-85% with minor fluctuations
2. **Q-Anchored (TriviaQA)** (dashed green):
- Begins at ~80%, dips to ~50% (Layer 5), then rises to ~90% (Layer 20) before declining to ~75%
3. **A-Anchored (PopQA)** (dotted orange):
- Starts at ~40%, fluctuates between 30-50% until Layer 15, then stabilizes at ~45%
4. **A-Anchored (TriviaQA)** (dash-dot red):
- Begins at ~30%, rises to ~50% (Layer 10), drops to ~20% (Layer 15), then stabilizes at ~35%
**Mistral-7B-v0.3 (Right Graph)**
1. **Q-Anchored (HotpotQA)** (solid purple):
- Starts at ~70%, drops to ~50% (Layer 5), then rises to ~90% (Layer 15) before declining to ~80%
2. **Q-Anchored (NQ)** (dashed pink):
- Begins at ~60%, fluctuates between 50-70% until Layer 20, then stabilizes at ~65%
3. **A-Anchored (HotpotQA)** (dotted gray):
- Starts at ~35%, rises to ~55% (Layer 10), drops to ~40% (Layer 15), then stabilizes at ~45%
4. **A-Anchored (NQ)** (dash-dot brown):
- Begins at ~25%, rises to ~45% (Layer 10), drops to ~30% (Layer 15), then stabilizes at ~35%
### Key Observations
1. **Version Differences**: v0.3 shows more stable performance (smaller fluctuations) compared to v0.1
2. **Dataset Performance**:
- PopQA consistently shows highest accuracy in v0.1
- NQ shows lowest accuracy across both versions
3. **Anchoring Method Impact**:
- Q-Anchored methods generally outperform A-Anchored in both versions
- TriviaQA shows most dramatic fluctuations in v0.1
4. **Confidence Intervals**: Gray shaded areas indicate ~±5% uncertainty in all versions
### Interpretation
The data suggests that:
1. **Model Version Improvements**: v0.3 demonstrates more consistent layer performance, particularly for Q-Anchored methods
2. **Dataset-Specific Behavior**:
- PopQA's question-answer structure may better align with Q-Anchored methods
- NQ's complex reasoning requirements might challenge both anchoring approaches
3. **Layer Sensitivity**: Early layers (0-10) show highest variability, with performance stabilizing after Layer 15
4. **Anchoring Method Tradeoffs**: While Q-Anchored generally performs better, A-Anchored methods show more consistent mid-range performance
The graphs highlight the importance of anchoring strategy selection based on both model version and target dataset characteristics. The confidence intervals suggest measurement uncertainty, particularly in early layers where performance is most volatile.
</details>
Figure 27: Comparisons of answer accuracy between pathways, probing mlp activations of the final token.
<details>
<summary>x69.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3.2 Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers for two Llama-3.2 variants (1B and 3B parameter sizes). Each graph shows multiple data series representing different question-answering datasets and anchoring methods. The graphs use color-coded lines with shaded confidence intervals to visualize performance trends.
### Components/Axes
- **X-axis (Layer)**:
- Left chart: 0–15 (Llama-3.2-1B)
- Right chart: 0–25 (Llama-3.2-3B)
- **Y-axis (Answer Accuracy)**: 0–100% scale
- **Legends**:
- Positioned at bottom of both charts
- Line styles/colors:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed gray: A-Anchored (HotpotQA)
- Solid pink: Q-Anchored (NQ)
- Dashed brown: A-Anchored (NQ)
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **Q-Anchored (PopQA)**: Starts at ~90% accuracy (layer 0), dips to ~60% by layer 5, then fluctuates between 50–70%.
- **A-Anchored (PopQA)**: Starts at ~60%, remains relatively stable (50–65%) with minor dips.
- **Q-Anchored (TriviaQA)**: Peaks at ~80% (layer 0), drops sharply to ~40% by layer 5, then stabilizes at 50–60%.
- **A-Anchored (TriviaQA)**: Starts at ~50%, declines to ~30% by layer 5, then fluctuates between 25–40%.
- **Q-Anchored (HotpotQA)**: Begins at ~70%, drops to ~50% by layer 5, then stabilizes at 40–60%.
- **A-Anchored (HotpotQA)**: Starts at ~55%, declines to ~40% by layer 5, then fluctuates between 30–50%.
- **Q-Anchored (NQ)**: Starts at ~65%, dips to ~50% by layer 5, then stabilizes at 45–60%.
- **A-Anchored (NQ)**: Starts at ~50%, declines to ~35% by layer 5, then fluctuates between 30–45%.
#### Llama-3.2-3B (Right Chart)
- **Q-Anchored (PopQA)**: Starts at ~85%, dips to ~60% by layer 5, then fluctuates between 50–75%.
- **A-Anchored (PopQA)**: Starts at ~65%, remains stable (50–70%) with minor dips.
- **Q-Anchored (TriviaQA)**: Peaks at ~80% (layer 0), drops to ~50% by layer 5, then stabilizes at 50–70%.
- **A-Anchored (TriviaQA)**: Starts at ~55%, declines to ~40% by layer 5, then fluctuates between 35–55%.
- **Q-Anchored (HotpotQA)**: Begins at ~75%, drops to ~60% by layer 5, then stabilizes at 50–70%.
- **A-Anchored (HotpotQA)**: Starts at ~60%, declines to ~45% by layer 5, then fluctuates between 40–60%.
- **Q-Anchored (NQ)**: Starts at ~70%, dips to ~60% by layer 5, then stabilizes at 55–70%.
- **A-Anchored (NQ)**: Starts at ~55%, declines to ~40% by layer 5, then fluctuates between 35–55%.
### Key Observations
1. **Model Size Impact**: The 3B model (right chart) shows more pronounced fluctuations in early layers (0–5) but stabilizes better in later layers (15–25) compared to the 1B model.
2. **Q vs. A Anchoring**: Q-anchored methods consistently outperform A-anchored across all datasets and models, with Q-anchored accuracy often 10–20% higher.
3. **Dataset Variability**:
- PopQA shows the most stable performance (lowest variance).
- HotpotQA exhibits the sharpest early-layer drops, particularly in the 3B model.
4. **Confidence Intervals**: Shaded regions indicate ~10–15% variability in accuracy measurements across runs.
### Interpretation
The data suggests that Q-anchored methods (using question context) generally outperform A-anchored methods (using answer context) across all datasets and model sizes. The 3B model demonstrates better layer-wise generalization, with accuracy stabilizing at higher layers (15–25) compared to the 1B model. Notably, the HotpotQA dataset shows the most dramatic early-layer performance drops, suggesting it may be more sensitive to model architecture or training dynamics. The consistent Q-anchored advantage implies that question context is more critical than answer context for these models' reasoning capabilities. The extended layer range in the 3B model (up to 25) reveals that larger models maintain performance in deeper layers, whereas the 1B model's performance degrades more sharply in later layers.
</details>
<details>
<summary>x70.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3-8B and Llama-3-70B Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across layers of the Llama-3-8B and Llama-3-70B models. Each graph displays multiple data series representing different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis measures answer accuracy (%), while the x-axis represents model layers. The graphs highlight variability in performance across layers and datasets.
---
### Components/Axes
- **Y-Axis**: "Answer Accuracy (%)" with a scale from 0 to 100.
- **X-Axis**: "Layer" with scales:
- Left panel (Llama-3-8B): 0 to 30.
- Right panel (Llama-3-70B): 0 to 80.
- **Legend**: Located at the bottom, with eight entries:
1. **Q-Anchored (PopQA)**: Solid blue line.
2. **A-Anchored (PopQA)**: Dashed orange line.
3. **Q-Anchored (TriviaQA)**: Solid green line.
4. **A-Anchored (TriviaQA)**: Dashed gray line.
5. **Q-Anchored (HotpotQA)**: Solid purple line.
6. **A-Anchored (HotpotQA)**: Dashed red line.
7. **Q-Anchored (NQ)**: Solid pink line.
8. **A-Anchored (NQ)**: Dashed black line.
---
### Detailed Analysis
#### Llama-3-8B (Left Panel)
- **Q-Anchored (PopQA)**: Starts at ~85% accuracy, dips to ~60% by layer 10, then stabilizes around ~70%.
- **A-Anchored (PopQA)**: Begins at ~60%, fluctuates between ~40% and ~70%, ending near ~50%.
- **Q-Anchored (TriviaQA)**: Peaks at ~90% in early layers, drops to ~70% by layer 20, then stabilizes.
- **A-Anchored (TriviaQA)**: Starts at ~70%, declines to ~50% by layer 20, then fluctuates.
- **Q-Anchored (HotpotQA)**: Starts at ~80%, dips to ~60% by layer 10, then rises to ~75%.
- **A-Anchored (HotpotQA)**: Begins at ~60%, fluctuates between ~40% and ~70%, ending near ~55%.
- **Q-Anchored (NQ)**: Starts at ~90%, drops to ~70% by layer 10, then stabilizes.
- **A-Anchored (NQ)**: Begins at ~70%, declines to ~50% by layer 10, then fluctuates.
#### Llama-3-70B (Right Panel)
- **Q-Anchored (PopQA)**: Starts at ~90%, dips to ~70% by layer 20, then stabilizes around ~80%.
- **A-Anchored (PopQA)**: Begins at ~70%, fluctuates between ~50% and ~80%, ending near ~65%.
- **Q-Anchored (TriviaQA)**: Peaks at ~95% in early layers, drops to ~80% by layer 40, then stabilizes.
- **A-Anchored (TriviaQA)**: Starts at ~80%, declines to ~60% by layer 40, then fluctuates.
- **Q-Anchored (HotpotQA)**: Starts at ~85%, dips to ~70% by layer 40, then rises to ~85%.
- **A-Anchored (HotpotQA)**: Begins at ~70%, fluctuates between ~50% and ~80%, ending near ~65%.
- **Q-Anchored (NQ)**: Starts at ~95%, drops to ~80% by layer 40, then stabilizes.
- **A-Anchored (NQ)**: Begins at ~80%, declines to ~60% by layer 40, then fluctuates.
---
### Key Observations
1. **Model Size Impact**: Llama-3-70B generally shows higher baseline accuracy than Llama-3-8B, but both exhibit significant layer-to-layer variability.
2. **Anchoring Method**: Q-Anchored methods consistently outperform A-Anchored across datasets, though performance gaps narrow in later layers.
3. **Dataset Variability**:
- **PopQA** and **TriviaQA** show the most pronounced accuracy drops in early layers.
- **HotpotQA** and **NQ** exhibit more stable trends but still experience fluctuations.
4. **Layer-Specific Trends**:
- Early layers (0–10) often show higher accuracy, followed by a decline or stabilization.
- Later layers (20–30/80) display increased variability, suggesting potential overfitting or model complexity issues.
---
### Interpretation
The data suggests that **Q-Anchored methods** (e.g., PopQA, TriviaQA) generally yield higher answer accuracy than A-Anchored methods, particularly in early layers. However, the **Llama-3-70B model** demonstrates greater overall stability and higher baseline performance compared to the 8B variant, though its accuracy still fluctuates significantly across layers.
The **dataset-specific trends** indicate that some datasets (e.g., PopQA, TriviaQA) are more sensitive to layer depth, while others (e.g., HotpotQA, NQ) show more consistent performance. The **Q-Anchored approach** appears to mitigate accuracy drops in later layers, possibly due to better alignment with query structures.
Notably, the **A-Anchored methods** (e.g., A-Anchored PopQA) exhibit the most erratic trends, suggesting that anchoring to answers (A-Anchored) may introduce instability compared to query-based anchoring (Q-Anchored). This could imply that query-focused anchoring strategies are more robust for maintaining accuracy across model layers.
The **model size** (8B vs. 70B) does not guarantee consistent performance improvements, as the 70B model still shows layer-specific variability. This highlights the importance of architectural and methodological choices (e.g., anchoring) over sheer model size alone.
</details>
<details>
<summary>x71.png Details</summary>

### Visual Description
## Line Graphs: Answer Accuracy Across Layers for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers (0–30) for two versions of the Mistral-7B model (v0.1 and v0.3). Each graph includes six data series representing different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs use color-coded lines with shaded confidence intervals.
---
### Components/Axes
- **Y-Axis**: Answer Accuracy (%)
- Range: 0–100%
- Label: "Answer Accuracy"
- **X-Axis**: Layer
- Range: 0–30
- Label: "Layer"
- **Legends**:
- **Left Graph (v0.1)**:
- Q-Anchored (PopQA): Solid blue
- A-Anchored (PopQA): Dashed orange
- Q-Anchored (TriviaQA): Solid green
- A-Anchored (TriviaQA): Dashed brown
- **Right Graph (v0.3)**:
- Q-Anchored (HotpotQA): Solid purple
- A-Anchored (HotpotQA): Dashed gray
- Q-Anchored (NQ): Solid pink
- A-Anchored (NQ): Dashed red
---
### Detailed Analysis
#### Left Graph (Mistral-7B-v0.1)
1. **Q-Anchored (PopQA)** (Solid Blue):
- Starts at ~80% accuracy at layer 0, drops sharply to ~20% by layer 5, then fluctuates between 30–70% with peaks at layers 10, 15, and 25.
2. **A-Anchored (PopQA)** (Dashed Orange):
- Starts at ~60%, dips to ~40% by layer 10, then stabilizes between 40–60% with minor oscillations.
3. **Q-Anchored (TriviaQA)** (Solid Green):
- Begins at ~70%, plunges to ~10% by layer 5, then oscillates between 20–60% with a peak at layer 20.
4. **A-Anchored (TriviaQA)** (Dashed Brown):
- Starts at ~50%, drops to ~30% by layer 10, then fluctuates between 30–50% with a peak at layer 25.
#### Right Graph (Mistral-7B-v0.3)
1. **Q-Anchored (HotpotQA)** (Solid Purple):
- Starts at ~70%, peaks at ~90% by layer 10, then declines to ~60% by layer 30 with minor fluctuations.
2. **A-Anchored (HotpotQA)** (Dashed Gray):
- Starts at ~50%, rises to ~70% by layer 15, then stabilizes between 60–70% with slight dips.
3. **Q-Anchored (NQ)** (Solid Pink):
- Begins at ~60%, drops to ~40% by layer 10, then fluctuates between 30–60% with a peak at layer 25.
4. **A-Anchored (NQ)** (Dashed Red):
- Starts at ~40%, rises to ~60% by layer 20, then declines to ~40% by layer 30 with oscillations.
---
### Key Observations
1. **Model Version Differences**:
- v0.3 shows smoother trends and higher overall accuracy compared to v0.1, which exhibits sharper fluctuations.
2. **Dataset-Specific Performance**:
- **HotpotQA** (v0.3) achieves the highest peak accuracy (~90%) among all datasets.
- **NQ** (v0.3) shows the most erratic behavior, with a sharp drop at layer 10.
3. **Anchoring Method Trends**:
- Q-Anchored methods generally outperform A-Anchored in v0.3 but underperform in v0.1 for PopQA and TriviaQA.
- A-Anchored methods in v0.1 (e.g., PopQA) exhibit more stability but lower peaks.
---
### Interpretation
The data suggests that model version v0.3 improves stability and accuracy across layers compared to v0.1. Q-Anchored methods perform better for HotpotQA and NQ in v0.3, while A-Anchored methods show resilience in v0.1 for PopQA and TriviaQA. The sharp dips in v0.1 (e.g., Q-Anchored TriviaQA at layer 5) may indicate architectural instability in early layers, whereas v0.3’s smoother curves suggest refined training or architecture. The dataset-specific performance highlights the importance of anchoring strategies tailored to question types (e.g., HotpotQA’s reliance on Q-Anchored methods).
</details>
Figure 28: Comparisons of answer accuracy between pathways, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x72.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3.2 Models
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across transformer model layers for two Llama-3.2 variants (1B and 3B parameter sizes). Each graph shows multiple data series representing different question/answer anchoring methods across four datasets (PopQA, TriviaQA, HotpotQA, NQ). The graphs use color-coded lines with shaded confidence intervals to visualize performance trends.
### Components/Axes
- **X-axis (Layer)**:
- Left chart: 0–15 (Llama-3.2-1B)
- Right chart: 0–25 (Llama-3.2-3B)
- Discrete integer increments
- **Y-axis (Answer Accuracy)**:
- Range: 0–100% (linear scale)
- Labeled "Answer Accuracy"
- **Legends**:
- Positioned at bottom of each chart
- Color-coded line styles:
- Solid lines: Q-Anchored methods
- Dashed lines: A-Anchored methods
- Dataset labels:
- PopQA (blue/orange)
- TriviaQA (green/red)
- HotpotQA (purple/gray)
- NQ (pink/black)
### Detailed Analysis
#### Llama-3.2-1B (Left Chart)
- **Q-Anchored (PopQA)**:
- Blue solid line
- Starts at ~90% accuracy (Layer 0)
- Sharp drop to ~40% at Layer 5
- Recovers to ~80% by Layer 15
- Confidence interval: ±5% (shaded blue)
- **A-Anchored (PopQA)**:
- Orange dashed line
- Starts at ~50% accuracy
- Dips to ~30% at Layer 5
- Recovers to ~55% by Layer 15
- Confidence interval: ±4%
- **Q-Anchored (TriviaQA)**:
- Green solid line
- Starts at ~70%
- Drops to ~50% at Layer 5
- Peaks at ~85% by Layer 15
- Confidence interval: ±6%
- **A-Anchored (TriviaQA)**:
- Red dashed line
- Starts at ~40%
- Dips to ~25% at Layer 5
- Recovers to ~50% by Layer 15
- Confidence interval: ±5%
- **Q-Anchored (HotpotQA)**:
- Purple solid line
- Starts at ~60%
- Drops to ~40% at Layer 5
- Peaks at ~75% by Layer 15
- Confidence interval: ±7%
- **A-Anchored (HotpotQA)**:
- Gray dashed line
- Starts at ~35%
- Dips to ~20% at Layer 5
- Recovers to ~45% by Layer 15
- Confidence interval: ±6%
- **Q-Anchored (NQ)**:
- Pink solid line
- Starts at ~80%
- Drops to ~60% at Layer 5
- Peaks at ~90% by Layer 15
- Confidence interval: ±8%
- **A-Anchored (NQ)**:
- Black dashed line
- Starts at ~55%
- Dips to ~40% at Layer 5
- Recovers to ~65% by Layer 15
- Confidence interval: ±7%
#### Llama-3.2-3B (Right Chart)
- **Q-Anchored (PopQA)**:
- Blue solid line
- Starts at ~85%
- Drops to ~60% at Layer 10
- Peaks at ~95% by Layer 25
- Confidence interval: ±4%
- **A-Anchored (PopQA)**:
- Orange dashed line
- Starts at ~55%
- Dips to ~40% at Layer 10
- Recovers to ~65% by Layer 25
- Confidence interval: ±5%
- **Q-Anchored (TriviaQA)**:
- Green solid line
- Starts at ~75%
- Drops to ~55% at Layer 10
- Peaks at ~90% by Layer 25
- Confidence interval: ±5%
- **A-Anchored (TriviaQA)**:
- Red dashed line
- Starts at ~45%
- Dips to ~30% at Layer 10
- Recovers to ~60% by Layer 25
- Confidence interval: ±6%
- **Q-Anchored (HotpotQA)**:
- Purple solid line
- Starts at ~65%
- Drops to ~45% at Layer 10
- Peaks at ~85% by Layer 25
- Confidence interval: ±6%
- **A-Anchored (HotpotQA)**:
- Gray dashed line
- Starts at ~30%
- Dips to ~20% at Layer 10
- Recovers to ~50% by Layer 25
- Confidence interval: ±7%
- **Q-Anchored (NQ)**:
- Pink solid line
- Starts at ~85%
- Drops to ~65% at Layer 10
- Peaks at ~95% by Layer 25
- Confidence interval: ±7%
- **A-Anchored (NQ)**:
- Black dashed line
- Starts at ~60%
- Dips to ~45% at Layer 10
- Recovers to ~75% by Layer 25
- Confidence interval: ±8%
### Key Observations
1. **Model Size Impact**:
- 3B model shows more stable performance at deeper layers (Layer 25) compared to 1B model
- 1B model exhibits sharper accuracy drops in early layers (Layer 5)
2. **Q vs A Anchoring**:
- Q-Anchored methods consistently outperform A-Anchored across all datasets
- Performance gap widens in deeper layers (Layer 15–25)
3. **Dataset Variability**:
- NQ shows highest baseline accuracy (80–95%)
- HotpotQA exhibits largest confidence intervals (±6–8%)
4. **Layer Depth Trends**:
- All methods show U-shaped curves (initial drop, mid-layer recovery, final peak)
- 3B model maintains higher accuracy in later layers (Layer 20–25)
### Interpretation
The data suggests that Q-Anchored methods (question-focused anchoring) outperform A-Anchored methods (answer-focused anchoring) across all datasets and model sizes. The 3B model demonstrates better layer-wise stability, maintaining higher accuracy in deeper layers compared to the 1B variant. The U-shaped performance curves indicate that early layers struggle with context integration, while deeper layers achieve better reasoning capabilities. The largest performance gaps appear in complex datasets like HotpotQA, suggesting that question anchoring is particularly critical for multi-hop reasoning tasks. The confidence intervals highlight significant variability in model performance, especially in the 1B model, which may reflect limited capacity to handle diverse question types.
</details>
<details>
<summary>x73.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers for Llama-3-8B and Llama-3-70B Models
### Overview
The image compares answer accuracy across layers (0–30 for Llama-3-8B, 0–80 for Llama-3-70B) for two model sizes. It evaluates four datasets (PopQA, TriviaQA, HotpotQA, NQ) using two anchoring methods: Q-Anchored (question-focused) and A-Anchored (answer-focused). Accuracy is measured as a percentage (0–100%).
### Components/Axes
- **X-axis**: Layer (0–30 for Llama-3-8B, 0–80 for Llama-3-70B).
- **Y-axis**: Answer Accuracy (%) (0–100).
- **Legends**:
- **Llama-3-8B**:
- Blue: Q-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Orange: A-Anchored (PopQA)
- Red: A-Anchored (TriviaQA)
- **Llama-3-70B**:
- Purple: Q-Anchored (HotpotQA)
- Pink: Q-Anchored (NQ)
- Gray: A-Anchored (HotpotQA)
- Brown: A-Anchored (NQ)
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **Q-Anchored (PopQA, Blue)**:
- Starts at ~80% (layer 0), dips to ~60% (layer 10), peaks at ~90% (layer 20), then stabilizes at ~85% (layer 30).
- **Q-Anchored (TriviaQA, Green)**:
- Begins at ~70%, rises to ~85% (layer 10), fluctuates between ~75–85% (layers 20–30).
- **A-Anchored (PopQA, Orange)**:
- Starts at ~40%, drops to ~20% (layer 10), recovers to ~50% (layer 30).
- **A-Anchored (TriviaQA, Red)**:
- Begins at ~50%, dips to ~30% (layer 10), rises to ~60% (layer 30).
#### Llama-3-70B (Right Chart)
- **Q-Anchored (HotpotQA, Purple)**:
- Starts at ~85%, fluctuates between ~70–90% (layers 0–40), stabilizes at ~80% (layers 60–80).
- **Q-Anchored (NQ, Pink)**:
- Begins at ~75%, dips to ~60% (layer 20), rises to ~85% (layer 60), then drops to ~70% (layer 80).
- **A-Anchored (HotpotQA, Gray)**:
- Starts at ~50%, fluctuates between ~40–60% (layers 0–60), stabilizes at ~55% (layers 60–80).
- **A-Anchored (NQ, Brown)**:
- Begins at ~40%, dips to ~30% (layer 20), rises to ~50% (layer 60), then drops to ~45% (layer 80).
### Key Observations
1. **Model Size Impact**:
- Llama-3-70B shows smoother trends and higher baseline accuracy (e.g., Q-Anchored HotpotQA peaks at ~90%) compared to Llama-3-8B.
2. **Anchoring Method**:
- Q-Anchored methods generally outperform A-Anchored (e.g., Q-Anchored PopQA in Llama-3-8B reaches ~90% vs. A-Anchored PopQA at ~50%).
3. **Dataset Variability**:
- NQ dataset exhibits the most erratic trends (e.g., Q-Anchored NQ in Llama-3-70B drops from ~85% to ~70% across layers).
4. **Layer Sensitivity**:
- Both models show accuracy dips in mid-layers (e.g., layer 10–20 for Llama-3-8B), suggesting potential architectural bottlenecks.
### Interpretation
The data suggests that larger models (Llama-3-70B) maintain higher and more stable accuracy across layers, particularly with Q-Anchored methods. Q-Anchored approaches consistently outperform A-Anchored, likely due to better alignment with question context. The NQ dataset’s volatility may reflect its complexity or noise. Mid-layer dips in both models hint at architectural trade-offs, where certain layers prioritize efficiency over accuracy. The 70B model’s smoother curves imply better generalization, while the 8B model’s sharper fluctuations suggest sensitivity to layer depth.
</details>
<details>
<summary>x74.png Details</summary>

### Visual Description
## Line Graph: Answer Accuracy Across Layers in Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two side-by-side line graphs comparing answer accuracy across 30 layers of the Mistral-7B model (versions v0.1 and v0.3). Each graph tracks performance for six distinct Q/A anchoring methods across four datasets (PopQA, TriviaQA, HotpotQA, NQ). Accuracy is measured on a 0-100 scale, with shaded regions indicating confidence intervals.
### Components/Axes
- **X-axis**: Layers (0-30, integer increments)
- **Y-axis**: Answer Accuracy (0-100, percentage scale)
- **Legends**:
- **v0.1 Chart**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted purple: Q-Anchored (HotpotQA)
- Solid red: A-Anchored (PopQA)
- Dashed orange: A-Anchored (TriviaQA)
- Dotted gray: A-Anchored (NQ)
- **v0.3 Chart**:
- Same legend entries as v0.1, with updated line patterns/colors
### Detailed Analysis
#### v0.1 Chart Trends
1. **Q-Anchored Methods**:
- PopQA (solid blue): Peaks at ~95 (layer 0), drops to ~60 (layer 10), stabilizes at ~80 (layer 30)
- TriviaQA (dashed green): Starts at ~85, dips to ~50 (layer 10), recovers to ~75
- HotpotQA (dotted purple): Begins at ~70, fluctuates between 50-80, ends at ~70
2. **A-Anchored Methods**:
- PopQA (solid red): Starts at ~40, plummets to ~20 (layer 10), rises to ~30
- TriviaQA (dashed orange): Begins at ~35, drops to ~15 (layer 10), recovers to ~25
- NQ (dotted gray): Starts at ~25, dips to ~10 (layer 10), ends at ~20
#### v0.3 Chart Trends
1. **Q-Anchored Methods**:
- PopQA (solid blue): Starts at ~90, dips to ~70 (layer 10), stabilizes at ~95
- TriviaQA (dashed green): Begins at ~80, drops to ~60 (layer 10), recovers to ~85
- HotpotQA (dotted purple): Starts at ~65, fluctuates between 50-75, ends at ~70
2. **A-Anchored Methods**:
- PopQA (solid red): Starts at ~30, drops to ~10 (layer 10), rises to ~25
- TriviaQA (dashed orange): Begins at ~25, plummets to ~5 (layer 10), recovers to ~20
- NQ (dotted gray): Starts at ~15, dips to ~5 (layer 10), ends at ~10
### Key Observations
1. **Version Comparison**:
- v0.3 shows significantly improved stability in Q-Anchored methods (e.g., PopQA accuracy increases from 80→95)
- A-Anchored methods in v0.3 exhibit deeper initial drops but better recovery than v0.1
2. **Dataset Performance**:
- PopQA consistently outperforms other datasets in both versions
- NQ (Natural Questions) shows the most volatility across all methods
3. **Layer-Specific Patterns**:
- Layer 10 consistently marks a performance trough for all methods
- v0.3 demonstrates sharper recovery post-layer 10 compared to v0.1
### Interpretation
The data suggests:
1. **Q-Anchored Superiority**: Question-context anchoring (Q-Anchored) consistently outperforms answer-context anchoring (A-Anchored) by 20-40% across datasets
2. **Version Improvements**: v0.3's architectural changes likely enhanced layer-wise context retention, particularly for Q-Anchored methods
3. **Dataset Sensitivity**: HotpotQA (multi-hop QA) shows the most pronounced layer-dependent performance variations, indicating challenges with complex reasoning tasks
4. **Confidence Intervals**: The shaded regions reveal that v0.3 methods have tighter confidence bounds, suggesting more robust training
Notable anomalies include the NQ method's extreme volatility (e.g., v0.1 A-Anchored NQ drops from 25→10→20) and the TriviaQA method's recovery pattern, which may indicate dataset-specific optimization opportunities.
</details>
Figure 29: Comparisons of answer accuracy between pathways, probing mlp activations of the last exact answer token.
Appendix G I-Don’t-Know Rate
<details>
<summary>x75.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Llama Model Layers for Different QA Datasets
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate (%)" across transformer model layers for two Llama architectures (Llama-3.2-1B and Llama-3.2-3B). Each graph includes six data series representing different question-answering (QA) datasets and anchoring methods (Q-Anchored vs. A-Anchored). The graphs show significant variability in I-Don't-Know rates across layers, with overlapping confidence intervals (shaded regions) indicating uncertainty.
### Components/Axes
- **X-Axis (Layer)**:
- Llama-3.2-1B: Layers 0–15 (discrete increments).
- Llama-3.2-3B: Layers 0–25 (discrete increments).
- **Y-Axis (I-Don't-Know Rate)**:
- Scale: 0% to 100% (linear).
- **Legends**:
- **Llama-3.2-1B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: A-Anchored (PopQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- **Llama-3.2-3B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: A-Anchored (PopQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
### Detailed Analysis
#### Llama-3.2-1B
- **Q-Anchored (PopQA)**:
- Starts at ~80% in Layer 0, drops sharply to ~20% by Layer 5, then fluctuates between ~30–60% until Layer 15.
- Confidence interval (shaded blue) widens significantly after Layer 5.
- **A-Anchored (PopQA)**:
- Relatively stable, hovering between ~50–70% with minimal variation.
- **Q-Anchored (TriviaQA)**:
- Peaks at ~90% in Layer 0, drops to ~40% by Layer 5, then oscillates between ~30–70%.
- **A-Anchored (TriviaQA)**:
- Starts at ~60%, dips to ~40% by Layer 5, then stabilizes around ~50–60%.
- **Q-Anchored (HotpotQA)**:
- Sharp decline from ~100% in Layer 0 to ~10% by Layer 5, followed by erratic fluctuations.
- **Q-Anchored (NQ)**:
- Begins at ~70%, drops to ~30% by Layer 5, then fluctuates between ~20–60%.
#### Llama-3.2-3B
- **Q-Anchored (PopQA)**:
- Starts at ~90%, drops to ~30% by Layer 5, then fluctuates between ~20–70% with increasing volatility.
- **A-Anchored (PopQA)**:
- Stable between ~60–80%, with slight upward trend after Layer 10.
- **Q-Anchored (TriviaQA)**:
- Peaks at ~85% in Layer 0, drops to ~20% by Layer 5, then rises to ~70% by Layer 25.
- **A-Anchored (TriviaQA)**:
- Starts at ~50%, dips to ~30% by Layer 5, then stabilizes around ~40–60%.
- **Q-Anchored (HotpotQA)**:
- Sharp decline from ~100% in Layer 0 to ~5% by Layer 5, followed by erratic spikes (e.g., ~40% at Layer 15).
- **Q-Anchored (NQ)**:
- Begins at ~80%, drops to ~10% by Layer 5, then fluctuates between ~10–60%.
### Key Observations
1. **Layer-Specific Variability**:
- Early layers (0–5) exhibit extreme I-Don't-Know rates (often >50%), while later layers show more moderate values.
- Q-Anchored datasets generally show sharper declines in early layers compared to A-Anchored datasets.
2. **Model Size Differences**:
- Llama-3.2-3B demonstrates greater variability in later layers (e.g., Layer 25) compared to Llama-3.2-1B.
3. **Dataset-Specific Trends**:
- HotpotQA consistently shows the highest initial I-Don't-Know rates, dropping sharply in early layers.
- NQ datasets exhibit the most erratic fluctuations across layers.
4. **Anchoring Method Impact**:
- A-Anchored datasets (PopQA, TriviaQA) display smoother trends, suggesting better layer-wise generalization.
### Interpretation
The data suggests that anchoring methods (Q vs. A) significantly influence the I-Don't-Know rates across transformer layers. Q-Anchored datasets exhibit higher variability and sharper declines in early layers, potentially indicating over-reliance on specific training patterns. A-Anchored datasets show more stable performance, implying better generalization. The Llama-3.2-3B model’s increased layer count correlates with heightened variability in later layers, possibly due to architectural complexity. Dataset-specific behaviors (e.g., HotpotQA’s extreme early-layer drops) highlight differences in training data complexity. These trends underscore the importance of anchoring strategies in mitigating model uncertainty during inference.
</details>
<details>
<summary>x76.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Llama-3 Models and Datasets
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" (percentage of unanswered questions) across layers of two Llama-3 language models: Llama-3-8B (left) and Llama-3-70B (right). Each graph shows six data series representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring methods (Q-Anchored vs. A-Anchored). The graphs reveal layer-specific performance variations and dataset-dependent trends.
### Components/Axes
- **X-axis (Layer)**:
- Llama-3-8B: 0–30 layers (discrete increments)
- Llama-3-70B: 0–80 layers (discrete increments)
- **Y-axis (I-Don't-Know Rate)**: 0–100% (continuous scale)
- **Legend**:
- **Q-Anchored (PopQA)**: Solid blue line
- **A-Anchored (PopQA)**: Dashed orange line
- **Q-Anchored (TriviaQA)**: Dotted green line
- **A-Anchored (TriviaQA)**: Dash-dot purple line
- **Q-Anchored (HotpotQA)**: Solid pink line
- **A-Anchored (HotpotQA)**: Dashed gray line
- **Q-Anchored (NQ)**: Dotted magenta line
- **A-Anchored (NQ)**: Dash-dot cyan line
### Detailed Analysis
#### Llama-3-8B (Left Graph)
- **Q-Anchored (PopQA)**: Starts at ~80% in Layer 0, sharply declines to ~20% by Layer 10, then fluctuates between 10–30%.
- **A-Anchored (PopQA)**: Begins at ~60%, stabilizes around 40–60% with minor oscillations.
- **Q-Anchored (TriviaQA)**: Peaks at ~70% in Layer 0, drops to ~30% by Layer 10, then stabilizes at 20–40%.
- **A-Anchored (TriviaQA)**: Remains relatively flat (~50–70%) with slight dips.
- **Q-Anchored (HotpotQA)**: Starts at ~90%, plunges to ~10% by Layer 10, then oscillates between 5–25%.
- **A-Anchored (HotpotQA)**: Starts at ~70%, declines to ~30% by Layer 10, then stabilizes at 20–40%.
- **Q-Anchored (NQ)**: Begins at ~50%, drops to ~10% by Layer 10, then fluctuates between 5–15%.
- **A-Anchored (NQ)**: Starts at ~30%, declines to ~10% by Layer 10, then stabilizes at 5–10%.
#### Llama-3-70B (Right Graph)
- **Q-Anchored (PopQA)**: Starts at ~70%, declines to ~30% by Layer 20, then fluctuates between 20–50%.
- **A-Anchored (PopQA)**: Begins at ~50%, stabilizes around 30–50% with minor oscillations.
- **Q-Anchored (TriviaQA)**: Peaks at ~80% in Layer 0, drops to ~40% by Layer 20, then stabilizes at 30–50%.
- **A-Anchored (TriviaQA)**: Remains flat (~60–80%) with slight dips.
- **Q-Anchored (HotpotQA)**: Starts at ~95%, plunges to ~20% by Layer 20, then oscillates between 10–30%.
- **A-Anchored (HotpotQA)**: Starts at ~60%, declines to ~20% by Layer 20, then stabilizes at 10–25%.
- **Q-Anchored (NQ)**: Begins at ~60%, drops to ~20% by Layer 20, then fluctuates between 10–25%.
- **A-Anchored (NQ)**: Starts at ~40%, declines to ~10% by Layer 20, then stabilizes at 5–10%.
### Key Observations
1. **Model Size Impact**: Llama-3-70B shows more pronounced fluctuations in I-Don't-Know rates compared to Llama-3-8B, especially in early layers.
2. **Anchoring Method**:
- Q-Anchored methods generally exhibit higher initial rates but sharper declines.
- A-Anchored methods maintain more stable, lower rates across layers.
3. **Dataset Sensitivity**:
- HotpotQA (complex reasoning) shows the highest initial rates and steepest declines.
- NQ (simple QA) demonstrates the most consistent performance improvements.
4. **Layer Dynamics**:
- Early layers (0–10) show rapid rate reductions for Q-Anchored methods.
- Later layers (10–30/80) exhibit stabilization or minor oscillations.
### Interpretation
The data suggests that anchoring methods significantly influence model performance:
- **Q-Anchored** approaches (question-focused) may initially struggle with complex datasets (e.g., HotpotQA) but improve rapidly as layers progress.
- **A-Anchored** methods (answer-focused) maintain steadier performance, possibly due to better generalization across layers.
- Larger models (70B) exhibit greater sensitivity to dataset complexity, with more volatile rates in early layers. This could indicate architectural differences in handling nuanced reasoning tasks.
- The NQ dataset’s consistent low rates across both models suggest it aligns well with the models’ training objectives, while HotpotQA highlights challenges in multi-hop reasoning.
The trends imply that anchoring strategy and model scale interact to shape performance, with practical implications for optimizing QA systems in resource-constrained scenarios.
</details>
<details>
<summary>x77.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Mistral-7B Model Versions and Anchoring Methods
### Overview
The image contains two side-by-side line graphs comparing the "I-Don't-Know Rate" across 30 layers of the Mistral-7B model (versions v0.1 and v0.3). Each graph tracks multiple data series representing different anchoring methods (Q-Anchored and A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis measures the I-Don't-Know Rate (0-100%), while the x-axis represents model layers (0-30). Shaded regions around lines indicate variability/confidence intervals.
### Components/Axes
- **X-axis (Layer)**: 0 to 30 (integer increments)
- **Y-axis (I-Don't-Know Rate)**: 0% to 100% (linear scale)
- **Legends**:
- **Left Graph (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted orange: A-Anchored (PopQA)
- Dashed red: A-Anchored (TriviaQA)
- Gray shaded area: Overall variability
- **Right Graph (v0.3)**:
- Solid purple: Q-Anchored (HotpotQA)
- Dashed pink: Q-Anchored (NQ)
- Dotted orange: A-Anchored (HotpotQA)
- Gray shaded area: Overall variability
### Detailed Analysis
#### Left Graph (Mistral-7B-v0.1)
1. **Q-Anchored (PopQA)** (blue):
- Starts at ~80% at layer 0, drops sharply to ~20% by layer 10, then fluctuates between 30-60% with peaks at layers 15 (~50%) and 25 (~70%).
2. **Q-Anchored (TriviaQA)** (green):
- Begins at ~60%, dips to ~10% at layer 10, then oscillates between 20-50% with a peak at layer 20 (~60%).
3. **A-Anchored (PopQA)** (orange):
- Starts at ~50%, rises to ~90% at layer 5, then declines to ~30% by layer 30 with minor fluctuations.
4. **A-Anchored (TriviaQA)** (red):
- Begins at ~70%, drops to ~20% at layer 10, then rises to ~80% at layer 20 before stabilizing near 60%.
#### Right Graph (Mistral-7B-v0.3)
1. **Q-Anchored (HotpotQA)** (purple):
- Starts at ~90%, plunges to ~10% at layer 10, then fluctuates between 30-70% with a peak at layer 25 (~80%).
2. **Q-Anchored (NQ)** (pink):
- Begins at ~70%, drops to ~20% at layer 10, then rises to ~60% at layer 20 before declining to ~40%.
3. **A-Anchored (HotpotQA)** (orange):
- Starts at ~60%, peaks at ~95% at layer 5, then declines to ~40% by layer 30 with sharp dips at layers 15 (~20%) and 25 (~30%).
### Key Observations
1. **Version Differences**:
- v0.3 shows higher variability in Q-Anchored (HotpotQA) and A-Anchored (HotpotQA) compared to v0.1.
- v0.1's A-Anchored (PopQA) has more extreme peaks (90%) than v0.3's equivalent (~95%).
2. **Dataset Impact**:
- HotpotQA and NQ datasets exhibit sharper drops in I-Don't-Know rates at early layers (layers 5-10).
- TriviaQA and PopQA show more gradual declines.
3. **Anchoring Method Trends**:
- Q-Anchored methods generally show steeper initial declines than A-Anchored methods.
- A-Anchored methods (e.g., PopQA, TriviaQA) maintain higher rates in later layers (20-30).
### Interpretation
The data suggests that anchoring methods (Q vs. A) and datasets significantly influence the model's uncertainty distribution across layers. Q-Anchored methods (e.g., HotpotQA in v0.3) demonstrate more pronounced early-layer drops in I-Don't-Know rates, potentially indicating stronger initial confidence. A-Anchored methods (e.g., PopQA in v0.1) exhibit higher variability in later layers, suggesting persistent uncertainty. The shaded regions highlight model instability, with v0.3 showing broader confidence intervals than v0.1. These patterns may reflect architectural changes between versions or dataset-specific challenges in knowledge representation.
</details>
Figure 30: Comparisons of i-don’t-know rate between pathways, probing attention activations of the final token.
<details>
<summary>x78.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate Across Llama-3.2 Models and Layers
### Overview
The image contains two line charts comparing the "I-Don't-Know Rate" (percentage of instances where models failed to answer) across different layers of the Llama-3.2-1B and Llama-3.2-3B models. Each chart includes multiple data series representing Q-Anchored and A-Anchored models trained on various datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis ranges from 0% to 100%, and the x-axis represents layer numbers (0–15 for 1B, 0–25 for 3B).
### Components/Axes
- **Y-Axis**: "I-Don't-Know Rate" (%), labeled vertically with ticks at 0, 20, 40, 60, 80, 100.
- **X-Axis**: "Layer" (layer numbers), labeled horizontally with ticks at 0, 5, 10, 15 (for 1B) and 0, 5, 10, 15, 20, 25 (for 3B).
- **Legend**: Located at the bottom, with six data series:
- **Q-Anchored (PopQA)**: Solid blue line.
- **A-Anchored (PopQA)**: Dashed orange line.
- **Q-Anchored (TriviaQA)**: Solid green line.
- **A-Anchored (TriviaQA)**: Dashed brown line.
- **Q-Anchored (HotpotQA)**: Solid purple line.
- **A-Anchored (NQ)**: Dashed pink line.
### Detailed Analysis
#### Llama-3.2-1B Chart
- **Q-Anchored (PopQA)**: Starts at ~80% in layer 0, fluctuates sharply, peaking at ~90% in layer 5, then drops to ~40% by layer 15.
- **A-Anchored (PopQA)**: Starts at ~60%, remains relatively stable (~50–70%) with minor fluctuations.
- **Q-Anchored (TriviaQA)**: Begins at ~70%, dips to ~30% in layer 5, then rises to ~60% by layer 15.
- **A-Anchored (TriviaQA)**: Starts at ~50%, fluctuates between ~40–60% with a peak at ~70% in layer 10.
- **Q-Anchored (HotpotQA)**: Starts at ~85%, drops to ~30% in layer 5, then rises to ~70% by layer 15.
- **A-Anchored (NQ)**: Starts at ~50%, remains stable (~40–60%) with a sharp drop to ~20% in layer 15.
#### Llama-3.2-3B Chart
- **Q-Anchored (PopQA)**: Starts at ~90%, fluctuates wildly, peaking at ~100% in layer 5, then drops to ~30% by layer 25.
- **A-Anchored (PopQA)**: Starts at ~60%, remains stable (~50–70%) with minor fluctuations.
- **Q-Anchored (TriviaQA)**: Begins at ~80%, dips to ~20% in layer 5, then rises to ~70% by layer 25.
- **A-Anchored (TriviaQA)**: Starts at ~50%, fluctuates between ~40–60% with a peak at ~70% in layer 10.
- **Q-Anchored (HotpotQA)**: Starts at ~95%, drops to ~20% in layer 5, then rises to ~80% by layer 25.
- **A-Anchored (NQ)**: Starts at ~50%, remains stable (~40–60%) with a sharp drop to ~10% in layer 25.
### Key Observations
1. **Model Size Differences**:
- The 3B model exhibits more extreme fluctuations (e.g., Q-Anchored lines reach 100% in layer 5) compared to the 1B model.
- The 1B model shows smoother trends, with fewer extreme peaks/troughs.
2. **Dataset Performance**:
- **NQ (A-Anchored)**: Consistently lower I-Don't-Know rates across layers, suggesting better generalization.
- **HotpotQA (Q-Anchored)**: High variability in the 3B model, with sharp drops and spikes.
3. **Layer-Specific Trends**:
- Early layers (0–5) show higher I-Don't-Know rates for Q-Anchored models, possibly due to insufficient training depth.
- Later layers (15–25) for the 3B model exhibit recovery in performance, though with persistent volatility.
### Interpretation
The data suggests that **Q-Anchored models** (trained with question-specific anchoring) are more sensitive to layer depth, showing higher variability and extreme I-Don't-Know rates in early layers. In contrast, **A-Anchored models** (trained with answer-specific anchoring) demonstrate greater stability, though their performance is still influenced by dataset complexity.
- **NQ (A-Anchored)** outperforms other datasets, indicating that answer-specific anchoring may better handle general knowledge tasks.
- The **3B model's volatility** (e.g., Q-Anchored lines reaching 100%) highlights potential overfitting or instability in deeper layers, possibly due to increased model complexity.
- The **1B model's smoother trends** suggest better generalization across layers, though its lower capacity may limit performance on complex tasks.
This analysis underscores the trade-off between model size and stability, with A-Anchored models offering more consistent performance at the cost of lower peak capabilities.
</details>
<details>
<summary>x79.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Layers for Llama-3 Models
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" across transformer model layers for two versions of the Llama-3 architecture: Llama-3-8B (left) and Llama-3-70B (right). The graphs visualize performance variability across 12 datasets (e.g., PopQA, TriviaQA, HotpotQA, NQ) using Q-Anchored and A-Anchored methods. Data is represented with colored lines and shaded confidence intervals.
### Components/Axes
- **X-Axis (Horizontal)**:
- Labeled "Layer"
- Llama-3-8B: 0–30 layers
- Llama-3-70B: 0–80 layers
- **Y-Axis (Vertical)**:
- Labeled "I-Don't-Know Rate" (0–100%)
- **Legends**:
- Positioned at the bottom of both graphs
- Colors and line styles correspond to:
- **Q-Anchored (PopQA)**: Solid blue
- **A-Anchored (PopQA)**: Dashed orange
- **Q-Anchored (TriviaQA)**: Solid green
- **A-Anchored (TriviaQA)**: Dashed gray
- **Q-Anchored (HotpotQA)**: Solid purple
- **A-Anchored (HotpotQA)**: Dashed brown
- **Q-Anchored (NQ)**: Solid pink
- **A-Anchored (NQ)**: Dashed black
### Detailed Analysis
#### Llama-3-8B (Left Graph)
- **Trends**:
- Q-Anchored (PopQA, blue) shows sharp peaks (e.g., ~90% at layer 5, ~70% at layer 15).
- A-Anchored (PopQA, orange) remains relatively stable (~50–60%).
- Q-Anchored (TriviaQA, green) exhibits volatility, dropping to ~20% at layer 25.
- Q-Anchored (HotpotQA, purple) has erratic fluctuations, peaking near 80% at layer 20.
- Q-Anchored (NQ, pink) shows gradual decline from ~70% to ~30%.
#### Llama-3-70B (Right Graph)
- **Trends**:
- Q-Anchored (PopQA, blue) has extreme volatility, reaching ~100% at layer 40.
- A-Anchored (PopQA, orange) stabilizes at ~60–70%.
- Q-Anchored (TriviaQA, green) fluctuates between ~40–80%, with a notable dip at layer 60.
- Q-Anchored (HotpotQA, purple) exhibits frequent spikes (e.g., ~90% at layer 70).
- Q-Anchored (NQ, pink) declines sharply from ~80% to ~20% by layer 80.
### Key Observations
1. **Model Size Impact**: Llama-3-70B shows greater layer-to-layer variability than Llama-3-8B.
2. **Dataset Sensitivity**:
- HotpotQA (purple) demonstrates the highest instability in both models.
- NQ (pink) shows the most consistent decline in Q-Anchored configurations.
3. **Anchoring Method**: A-Anchored methods generally exhibit smoother trends compared to Q-Anchored.
4. **Layer Correlation**: No clear monotonic relationship between layer depth and I-Don't-Know Rate across datasets.
### Interpretation
The data suggests that:
- **Model Scale ≠ Performance**: Larger models (70B) exhibit higher variability in I-Don't-Know rates, potentially due to increased complexity or dataset-specific challenges.
- **Anchoring Strategy**: A-Anchored methods may reduce volatility, though this depends on the dataset (e.g., PopQA vs. HotpotQA).
- **Dataset Difficulty**: HotpotQA consistently correlates with higher uncertainty, possibly reflecting its reliance on multi-hop reasoning.
- **Layer-Specific Failures**: Peaks in Q-Anchored lines (e.g., layer 5 in Llama-3-8B) may indicate architectural bottlenecks or dataset-model mismatches.
The graphs highlight the need for dataset-specific tuning and anchoring strategies to mitigate uncertainty in large language models.
</details>
<details>
<summary>x80.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Layers in Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" across 30 layers of the Mistral-7B model (versions v0.1 and v0.3). Each graph tracks performance across four datasets (PopQA, TriviaQA, HotpotQA, NQ) using Q-Anchored and A-Anchored methods. The graphs show significant variability in performance, with overlapping trends and sharp fluctuations in certain layers.
---
### Components/Axes
- **X-Axis (Layer)**: Ranges from 0 to 30, labeled "Layer."
- **Y-Axis (I-Don't-Know Rate)**: Ranges from 0 to 100, labeled "I-Don't-Know Rate."
- **Legends**:
- **Left Graph (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed gray: A-Anchored (HotpotQA)
- Solid pink: Q-Anchored (NQ)
- Dashed black: A-Anchored (NQ)
- **Right Graph (v0.3)**:
- Same legend structure as v0.1, but with updated line colors/styles for v0.3.
---
### Detailed Analysis
#### Left Graph (Mistral-7B-v0.1)
- **Q-Anchored (PopQA)**: Solid blue line. Peaks at ~80% in layer 10, drops to ~20% by layer 30.
- **A-Anchored (PopQA)**: Dashed orange line. Stable around 40–60%, with minor fluctuations.
- **Q-Anchored (TriviaQA)**: Solid green line. Sharp spike to ~90% at layer 5, then declines.
- **A-Anchored (TriviaQA)**: Dashed red line. Gradual decline from ~70% to ~30%.
- **Q-Anchored (HotpotQA)**: Solid purple line. Peaks at ~70% in layer 15, then stabilizes.
- **A-Anchored (HotpotQA)**: Dashed gray line. Fluctuates between 50–70%.
- **Q-Anchored (NQ)**: Solid pink line. Sharp drop from ~90% at layer 5 to ~10% by layer 30.
- **A-Anchored (NQ)**: Dashed black line. Stable at ~40–50%.
#### Right Graph (Mistral-7B-v0.3)
- **Q-Anchored (PopQA)**: Solid blue line. Peaks at ~70% in layer 20, then declines.
- **A-Anchored (PopQA)**: Dashed orange line. Stable at ~50–60%.
- **Q-Anchored (TriviaQA)**: Solid green line. Peaks at ~80% in layer 10, then drops.
- **A-Anchored (TriviaQA)**: Dashed red line. Gradual decline from ~60% to ~20%.
- **Q-Anchored (HotpotQA)**: Solid purple line. Peaks at ~60% in layer 25, then stabilizes.
- **A-Anchored (HotpotQA)**: Dashed gray line. Fluctuates between 40–60%.
- **Q-Anchored (NQ)**: Solid pink line. Sharp drop from ~80% at layer 5 to ~15% by layer 30.
- **A-Anchored (NQ)**: Dashed black line. Stable at ~30–40%.
---
### Key Observations
1. **Layer-Specific Variability**: Both models show erratic I-Don't-Know rates in early layers (e.g., layer 5–10), suggesting instability in initial processing.
2. **Dataset Differences**:
- TriviaQA consistently exhibits higher I-Don't-Know rates than other datasets.
- NQ shows the most dramatic drops in Q-Anchored methods, indicating improved performance in later layers.
3. **Model Version Comparison**:
- v0.3 generally has lower I-Don't-Know rates than v0.1, especially in later layers (e.g., layer 20–30).
- A-Anchored methods (dashed lines) are more stable across layers compared to Q-Anchored (solid lines).
---
### Interpretation
The data suggests that anchoring methods (Q vs. A) and dataset type significantly impact the I-Don't-Know rate. Q-Anchored methods show higher variability and sharper declines in later layers, while A-Anchored methods maintain steadier performance. The reduction in I-Don't-Know rates in v0.3 compared to v0.1 implies architectural improvements in Mistral-7B. Notably, TriviaQA’s high rates across layers may indicate domain-specific challenges, while NQ’s steep declines suggest better generalization in later layers. These trends highlight the importance of anchoring strategies and model versioning in handling uncertainty.
</details>
Figure 31: Comparisons of i-don’t-know rate between pathways, probing attention activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x81.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Llama-3.2 Model Layers
### Overview
The image contains two side-by-side line graphs comparing the "I-Don't-Know Rate" (percentage of unanswered questions) across neural network layers for two versions of the Llama-3.2 model (1B and 3B parameter sizes). Each graph tracks performance across 15-25 layers, with multiple data series representing different question-answering (QA) datasets and anchoring methods.
### Components/Axes
- **X-axis**: Layer number (1-15 for 1B model, 0-25 for 3B model)
- **Y-axis**: I-Don't-Know Rate (%) (0-100 scale)
- **Legend**: Located at bottom-left of both graphs, with 8 entries:
- Solid lines: Q-Anchored methods
- Dashed lines: A-Anchored methods
- Colors correspond to specific QA datasets:
- Blue: PopQA
- Green: TriviaQA
- Purple: HotpotQA
- Red: NQ (Natural Questions)
### Detailed Analysis
#### Llama-3.2-1B (Left Graph)
- **Q-Anchored (PopQA)**: Starts at ~85% in layer 1, drops to ~30% by layer 5, fluctuates between 20-40% through layer 15
- **A-Anchored (PopQA)**: Peaks at ~70% in layer 3, declines to ~40% by layer 10, stabilizes at ~50%
- **Q-Anchored (TriviaQA)**: Sharp drop from 90% to 20% between layers 1-5, then oscillates between 15-35%
- **A-Anchored (TriviaQA)**: Gradual decline from 80% to 30% across all layers
- **Q-Anchored (HotpotQA)**: Starts at 75%, drops to 10% by layer 5, then fluctuates between 5-25%
- **A-Anchored (HotpotQA)**: Starts at 65%, declines to 20% by layer 10, then rises to 40% at layer 15
- **Q-Anchored (NQ)**: Peaks at 95% in layer 1, drops to 15% by layer 5, then fluctuates between 10-30%
#### Llama-3.2-3B (Right Graph)
- **Q-Anchored (PopQA)**: Starts at 90%, drops to 20% by layer 5, then fluctuates between 10-40%
- **A-Anchored (PopQA)**: Peaks at 80% in layer 3, declines to 30% by layer 10, then rises to 60% at layer 25
- **Q-Anchored (TriviaQA)**: Sharp drop from 85% to 10% between layers 1-5, then oscillates between 5-30%
- **A-Anchored (TriviaQA)**: Starts at 70%, declines to 20% by layer 15, then rises to 50% at layer 25
- **Q-Anchored (HotpotQA)**: Peaks at 95% in layer 1, drops to 5% by layer 5, then fluctuates between 2-20%
- **A-Anchored (HotpotQA)**: Starts at 60%, declines to 10% by layer 10, then rises to 70% at layer 25
- **Q-Anchored (NQ)**: Peaks at 98% in layer 1, drops to 12% by layer 5, then fluctuates between 8-35%
### Key Observations
1. **Model Size Impact**: The 3B model shows more pronounced fluctuations and higher peak rates compared to the 1B model
2. **Anchoring Method Differences**:
- Q-Anchored methods generally show steeper initial declines
- A-Anchored methods exhibit more gradual changes and later-stage increases
3. **Dataset Variability**:
- HotpotQA consistently shows the highest initial rates
- NQ demonstrates the most dramatic early drops
4. **Layer-Specific Patterns**:
- Layer 5 consistently shows the lowest rates across all datasets
- Layer 20 in the 3B model shows anomalous peaks for A-Anchored (HotpotQA) at ~70%
### Interpretation
The data suggests that:
1. **Model Complexity**: Larger models (3B) exhibit greater layer-to-layer variability in uncertainty, potentially indicating more complex internal representations
2. **Anchoring Strategy**:
- Q-Anchored methods may prioritize early-layer confidence building
- A-Anchored methods might maintain higher uncertainty in deeper layers before resolving
3. **Dataset Characteristics**:
- HotpotQA's complexity correlates with higher initial uncertainty
- NQ's structured format may enable faster confidence resolution
4. **Layer Dynamics**: The consistent drop at layer 5 across all datasets suggests a critical transition point in the model's processing pipeline
The graphs reveal that anchoring method selection significantly impacts uncertainty distribution across layers, with potential implications for model interpretability and question routing strategies.
</details>
<details>
<summary>x82.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Layers for Llama-3-8B and Llama-3-70B Models
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" (IDK Rate) across transformer model layers for two Llama-3 variants: 8B (3.8B parameters) and 70B (70B parameters). The graphs visualize performance across four datasets: PopQA, TriviaQA, HotpotQA, and NQ, differentiated by Q-Anchored (question-focused) and A-Anchored (answer-focused) configurations. Data is presented with shaded confidence intervals.
### Components/Axes
- **X-Axis (Layer)**:
- Llama-3-8B: 0–30 layers
- Llama-3-70B: 0–80 layers
- **Y-Axis (I-Don't-Know Rate)**: 0–100% scale
- **Legend**:
- Position: Bottom center
- Entries:
- **Q-Anchored (PopQA)**: Solid blue
- **A-Anchored (PopQA)**: Dashed orange
- **Q-Anchored (TriviaQA)**: Dotted green
- **A-Anchored (TriviaQA)**: Dash-dot purple
- **Q-Anchored (HotpotQA)**: Solid red
- **A-Anchored (HotpotQA)**: Dotted pink
- **Q-Anchored (NQ)**: Dashed gray
- **A-Anchored (NQ)**: Dotted cyan
### Detailed Analysis
#### Llama-3-8B (Left Graph)
- **Q-Anchored (PopQA)**:
- Starts at ~95% IDK in layer 0, drops sharply to ~10% by layer 10, then fluctuates between 10–30%.
- **A-Anchored (PopQA)**:
- Starts at ~60%, dips to ~20% by layer 15, then stabilizes near 30–40%.
- **Q-Anchored (TriviaQA)**:
- Peaks at ~80% in layer 5, drops to ~10% by layer 20, then oscillates between 10–40%.
- **A-Anchored (TriviaQA)**:
- Starts at ~50%, rises to ~70% by layer 10, then declines to ~30% by layer 30.
- **Q-Anchored (HotpotQA)**:
- Begins at ~70%, spikes to ~90% in layer 5, then stabilizes at 40–60%.
- **A-Anchored (HotpotQA)**:
- Starts at ~50%, rises to ~80% by layer 10, then declines to ~40%.
- **Q-Anchored (NQ)**:
- Starts at ~85%, drops to ~20% by layer 10, then fluctuates between 10–30%.
- **A-Anchored (NQ)**:
- Starts at ~60%, rises to ~80% by layer 15, then declines to ~40%.
#### Llama-3-70B (Right Graph)
- **Q-Anchored (PopQA)**:
- Starts at ~90%, drops to ~15% by layer 20, then fluctuates between 10–30%.
- **A-Anchored (PopQA)**:
- Starts at ~55%, rises to ~70% by layer 40, then declines to ~50%.
- **Q-Anchored (TriviaQA)**:
- Peaks at ~85% in layer 10, drops to ~20% by layer 40, then oscillates between 10–40%.
- **A-Anchored (TriviaQA)**:
- Starts at ~45%, rises to ~75% by layer 30, then declines to ~50%.
- **Q-Anchored (HotpotQA)**:
- Begins at ~65%, spikes to ~95% in layer 20, then stabilizes at 50–70%.
- **A-Anchored (HotpotQA)**:
- Starts at ~50%, rises to ~85% by layer 50, then declines to ~60%.
- **Q-Anchored (NQ)**:
- Starts at ~80%, drops to ~10% by layer 30, then fluctuates between 5–25%.
- **A-Anchored (NQ)**:
- Starts at ~55%, rises to ~80% by layer 60, then declines to ~50%.
### Key Observations
1. **Model Size Impact**: Llama-3-70B shows more pronounced fluctuations in IDK rates compared to Llama-3-8B, suggesting larger models may struggle more with certain datasets in specific layers.
2. **Dataset Variability**:
- **HotpotQA** consistently shows the highest IDK rates, especially in Q-Anchored configurations.
- **NQ** exhibits the most dramatic drops in IDK rates for Q-Anchored models.
3. **Anchoring Effects**:
- Q-Anchored models generally show steeper initial drops in IDK rates but higher volatility in later layers.
- A-Anchored models maintain higher IDK rates in mid-layers (e.g., layers 20–50 for Llama-3-70B).
4. **Confidence Intervals**: Shaded regions indicate uncertainty, with wider bands in Llama-3-70B, particularly for TriviaQA and HotpotQA.
### Interpretation
The data suggests that:
- **Q-Anchored models** (question-focused) may prioritize early-layer processing for certain datasets (e.g., PopQA, NQ), while **A-Anchored models** (answer-focused) show delayed but sustained IDK rates in mid-layers.
- The **HotpotQA dataset** poses the greatest challenge, with IDK rates exceeding 80% in multiple layers for both model sizes.
- Llama-3-70B’s increased layer count (80 vs. 30) correlates with more complex IDK patterns, potentially reflecting deeper contextual analysis but also greater uncertainty in specific layers.
- The **NQ dataset** demonstrates the most effective Q-Anchored performance, with IDK rates dropping below 20% in later layers for Llama-3-8B.
This analysis highlights trade-offs between model scale, anchoring strategies, and dataset-specific challenges in knowledge retrieval tasks.
</details>
<details>
<summary>x83.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate Across Layers in Mistral-7B Models
### Overview
The image contains two side-by-side line charts comparing the "I-Don't-Know Rate" (y-axis) across 30 layers (x-axis) for two versions of the Mistral-7B model (v0.1 and v0.3). Each chart includes six data series differentiated by line styles and colors, representing various anchoring methods and datasets (PopQA, TriviaQA, HotpotQA, NQ).
### Components/Axes
- **X-axis**: Layer (0–30, integer ticks)
- **Y-axis**: I-Don't-Know Rate (%) (0–100, integer ticks)
- **Legends**:
- **Left Chart (v0.1)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Dotted green: Q-Anchored (TriviaQA)
- Dash-dot red: A-Anchored (TriviaQA)
- Dash-dot-dot purple: Q-Anchored (HotpotQA)
- Dotted gray: A-Anchored (HotpotQA)
- **Right Chart (v0.3)**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Dotted green: Q-Anchored (TriviaQA)
- Dash-dot red: A-Anchored (TriviaQA)
- Dash-dot-dot purple: Q-Anchored (NQ)
- Dotted gray: A-Anchored (NQ)
### Detailed Analysis
#### Left Chart (Mistral-7B-v0.1)
- **Q-Anchored (PopQA)**: Starts at ~85%, dips to ~20% at layer 10, then fluctuates between 30–60%.
- **A-Anchored (PopQA)**: Peaks at ~90% at layer 0, stabilizes around 60–80% with minor oscillations.
- **Q-Anchored (TriviaQA)**: Begins at ~70%, drops to ~10% at layer 10, then rises to ~50% by layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~60%, fluctuates between 40–80%.
- **Q-Anchored (HotpotQA)**: Peaks at ~95% at layer 0, drops to ~30% at layer 10, then stabilizes at ~50–70%.
- **A-Anchored (HotpotQA)**: Starts at ~70%, fluctuates between 50–90%.
#### Right Chart (Mistral-7B-v0.3)
- **Q-Anchored (PopQA)**: Starts at ~70%, dips to ~20% at layer 10, then stabilizes at ~40–60%.
- **A-Anchored (PopQA)**: Peaks at ~80% at layer 0, stabilizes around 60–80%.
- **Q-Anchored (TriviaQA)**: Begins at ~60%, drops to ~10% at layer 10, then rises to ~40% by layer 30.
- **A-Anchored (TriviaQA)**: Starts at ~50%, fluctuates between 30–70%.
- **Q-Anchored (NQ)**: Peaks at ~90% at layer 0, drops to ~20% at layer 10, then stabilizes at ~40–60%.
- **A-Anchored (NQ)**: Starts at ~60%, fluctuates between 40–80%.
### Key Observations
1. **Version Differences**:
- v0.3 shows generally lower I-Don't-Know rates than v0.1 for most models (e.g., Q-Anchored PopQA drops from ~85% to ~70% at layer 0).
- v0.3 exhibits smoother trends compared to v0.1’s sharper fluctuations.
2. **Anchoring Impact**:
- Q-Anchored models consistently show lower rates than A-Anchored counterparts in both versions.
- Exceptions: A-Anchored (HotpotQA) in v0.1 briefly exceeds Q-Anchored (HotpotQA) at layer 5.
3. **Dataset Variability**:
- HotpotQA and NQ datasets exhibit the highest variability (e.g., Q-Anchored NQ in v0.3 peaks at ~90% at layer 0).
- PopQA and TriviaQA datasets show more stable trends.
### Interpretation
The data suggests that anchoring methods (Q vs. A) significantly influence the I-Don't-Know Rate, with Q-Anchored models generally performing better. Version v0.3 demonstrates improved stability across datasets, likely due to architectural refinements. However, the HotpotQA and NQ datasets remain outliers, indicating potential challenges in handling complex queries. The layer-specific fluctuations (e.g., sharp drops at layer 10) may reflect model architecture design choices, such as attention mechanisms or layer normalization. Further investigation into dataset-specific model behavior is warranted.
</details>
Figure 32: Comparisons of i-don’t-know rate between pathways, probing attention activations of the last exact answer token.
<details>
<summary>x84.png Details</summary>

### Visual Description
## Line Graphs: I-Don't-Know Rate Across Layers in LLaMA-3.2 Models
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" (y-axis) across model layers (x-axis) for two versions of the LLaMA-3.2 architecture: **LLaMA-3.2-1B** (left) and **LLaMA-3.2-3B** (right). Each graph includes six data series representing different question-answering (QA) anchoring methods and datasets. The graphs use shaded regions to indicate variability (confidence intervals) around the mean values.
---
### Components/Axes
- **X-Axis (Layer)**:
- Left graph: Layers 0–15 (LLaMA-3.2-1B).
- Right graph: Layers 0–25 (LLaMA-3.2-3B).
- Labels: "Layer" with tick marks at intervals of 5.
- **Y-Axis (I-Don't-Know Rate)**:
- Range: 0–100%.
- Labels: "I-Don't-Know Rate" with increments of 20.
- **Legend**:
- Located at the bottom of both graphs.
- Six data series, differentiated by line style and color:
1. **Q-Anchored (PopQA)**: Solid blue.
2. **A-Anchored (PopQA)**: Dashed orange.
3. **Q-Anchored (TriviaQA)**: Solid green.
4. **A-Anchored (TriviaQA)**: Dashed brown.
5. **Q-Anchored (HotpotQA)**: Solid purple.
6. **Q-Anchored (NQ)**: Dashed pink.
7. **A-Anchored (HotpotQA)**: Dashed gray.
8. **A-Anchored (NQ)**: Dotted gray.
---
### Detailed Analysis
#### Left Graph (LLaMA-3.2-1B)
- **Q-Anchored (PopQA)**:
- Starts at ~90% in layer 0, drops sharply to ~10% by layer 5, then fluctuates between 10–30%.
- **A-Anchored (PopQA)**:
- Starts at ~60%, remains relatively stable (~50–70%) with minor peaks.
- **Q-Anchored (TriviaQA)**:
- Begins at ~80%, dips to ~20% by layer 5, then rises to ~60% by layer 15.
- **A-Anchored (TriviaQA)**:
- Starts at ~50%, fluctuates between 40–60%.
- **Q-Anchored (HotpotQA)**:
- Starts at ~70%, drops to ~30% by layer 5, then rises to ~50% by layer 15.
- **Q-Anchored (NQ)**:
- Starts at ~60%, dips to ~20% by layer 5, then rises to ~40% by layer 15.
- **A-Anchored (HotpotQA)**:
- Starts at ~50%, fluctuates between 40–60%.
- **A-Anchored (NQ)**:
- Starts at ~40%, fluctuates between 30–50%.
#### Right Graph (LLaMA-3.2-3B)
- **Q-Anchored (PopQA)**:
- Starts at ~80%, drops to ~20% by layer 5, then fluctuates between 10–40%.
- **A-Anchored (PopQA)**:
- Starts at ~60%, remains stable (~50–70%) with minor peaks.
- **Q-Anchored (TriviaQA)**:
- Begins at ~80%, dips to ~10% by layer 5, then rises to ~70% by layer 25.
- **A-Anchored (TriviaQA)**:
- Starts at ~50%, fluctuates between 40–60%.
- **Q-Anchored (HotpotQA)**:
- Starts at ~70%, drops to ~20% by layer 5, then rises to ~60% by layer 25.
- **Q-Anchored (NQ)**:
- Starts at ~60%, dips to ~10% by layer 5, then rises to ~50% by layer 25.
- **A-Anchored (HotpotQA)**:
- Starts at ~50%, fluctuates between 40–60%.
- **A-Anchored (NQ)**:
- Starts at ~40%, fluctuates between 30–50%.
---
### Key Observations
1. **Layer-Specific Variability**:
- Both models show significant fluctuations in I-Don't-Know rates, particularly in layers 5–15 (1B) and 10–20 (3B).
- The 3B model exhibits more pronounced volatility, especially in layers 20–25.
2. **Dataset-Specific Trends**:
- **PopQA** (solid blue/orange lines) generally shows lower rates in early layers but stabilizes later.
- **TriviaQA** (solid green/brown lines) has sharp drops in early layers, followed by recovery.
- **HotpotQA** (solid purple/dashed gray lines) exhibits the most dramatic early-layer drops.
- **NQ** (dashed pink/dotted gray lines) consistently shows lower rates but with occasional spikes.
3. **Model Size Impact**:
- The 3B model’s lines are more erratic, suggesting increased sensitivity to layer-specific factors.
---
### Interpretation
The data suggests that anchoring methods and datasets significantly influence the model’s uncertainty across layers. Early layers (0–5) show high I-Don't-Know rates for most methods, likely due to insufficient contextual understanding. Later layers demonstrate recovery, but the 3B model’s larger size introduces greater variability, possibly reflecting architectural complexity. Methods like **Q-Anchored (HotpotQA)** and **A-Anchored (NQ)** appear more stable, indicating robustness in handling uncertainty. The spikes in the 3B model (e.g., layer 20 for Q-Anchored HotpotQA) may highlight critical layers where the model struggles with specific datasets. This analysis underscores the importance of dataset choice and anchoring strategy in mitigating uncertainty in large language models.
</details>
<details>
<summary>x85.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Llama-3 Model Sizes and Anchoring Methods
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate (%)" across layers of two Llama-3 language models (8B and 70B parameters). Each graph shows six data series representing different question-answering datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring methods (Q-Anchored vs. A-Anchored). The graphs reveal layer-dependent performance variations and dataset-specific behaviors.
### Components/Axes
- **X-Axis (Layer)**:
- Llama-3-8B: 0–30 (discrete increments)
- Llama-3-70B: 0–80 (discrete increments)
- **Y-Axis (I-Don't-Know Rate)**: 0–100% (continuous scale)
- **Legends**:
- **Llama-3-8B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed gray: A-Anchored (HotpotQA)
- **Llama-3-70B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed orange: A-Anchored (PopQA)
- Solid green: Q-Anchored (TriviaQA)
- Dashed red: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dashed gray: A-Anchored (HotpotQA)
- Solid pink: Q-Anchored (NQ)
- Dashed gray: A-Anchored (NQ)
### Detailed Analysis
#### Llama-3-8B
- **Q-Anchored (PopQA)**: Starts at ~85% in Layer 0, drops sharply to ~20% by Layer 10, then fluctuates between 10–30%.
- **A-Anchored (PopQA)**: Begins at ~60%, rises to ~80% by Layer 5, then stabilizes around 60–70%.
- **Q-Anchored (TriviaQA)**: Peaks at ~70% in Layer 0, declines to ~30% by Layer 20, with oscillations.
- **A-Anchored (TriviaQA)**: Starts at ~50%, rises to ~75% by Layer 10, then declines to ~50%.
- **Q-Anchored (HotpotQA)**: Sharp drop from ~90% to ~10% by Layer 5, then stabilizes near 10%.
- **A-Anchored (HotpotQA)**: Begins at ~70%, fluctuates between 50–80% throughout.
#### Llama-3-70B
- **Q-Anchored (PopQA)**: Starts at ~70%, drops to ~30% by Layer 20, then stabilizes.
- **A-Anchored (PopQA)**: Begins at ~50%, rises to ~70% by Layer 40, then declines.
- **Q-Anchored (TriviaQA)**: Peaks at ~80% in Layer 0, declines to ~40% by Layer 60.
- **A-Anchored (TriviaQA)**: Starts at ~60%, rises to ~85% by Layer 40, then stabilizes.
- **Q-Anchored (HotpotQA)**: Sharp drop from ~95% to ~20% by Layer 10, then stabilizes.
- **A-Anchored (HotpotQA)**: Begins at ~60%, fluctuates between 40–70%.
- **Q-Anchored (NQ)**: Starts at ~60%, drops to ~20% by Layer 20, then stabilizes.
- **A-Anchored (NQ)**: Begins at ~40%, rises to ~60% by Layer 60, then declines.
### Key Observations
1. **Model Size Impact**: Llama-3-70B shows more pronounced oscillations and higher initial I-Don't-Know rates compared to the 8B model.
2. **Anchoring Method**:
- Q-Anchored methods generally show steeper initial declines but stabilize at lower rates.
- A-Anchored methods exhibit higher variability but maintain higher rates in later layers.
3. **Dataset Sensitivity**:
- HotpotQA (complex reasoning) shows the most dramatic drops for Q-Anchored methods.
- NQ (common knowledge) exhibits the least variability in the 70B model.
4. **Layer Dependency**: Both models show significant performance shifts in early layers (0–20), with stabilization in later layers.
### Interpretation
The data suggests that anchoring methods (Q vs. A) differentially affect model uncertainty across datasets and model sizes. Larger models (70B) exhibit greater sensitivity to anchoring choices, particularly in early layers. The steep declines in Q-Anchored methods for complex datasets (HotpotQA) imply that explicit question anchoring reduces uncertainty more effectively for challenging tasks. However, A-Anchored methods maintain higher uncertainty in later layers, potentially indicating over-reliance on answer patterns. The stabilization of rates in later layers across both models suggests that deeper layers achieve more consistent performance, though the 70B model's oscillations highlight trade-offs between scale and stability.
</details>
<details>
<summary>x86.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate Across Layers for Mistral-7B Models
### Overview
The image contains two side-by-side line charts comparing the "I-Don't-Know Rate" (IDK Rate) across 30 layers for two versions of the Mistral-7B model (v0.1 and v0.3). Each chart includes multiple data series representing different Q-Anchored and A-Anchored models across four datasets: PopQA, TriviaQA, HotpotQA, and NQ. The charts use color-coded lines with solid (Q-Anchored) and dashed (A-Anchored) styles to distinguish between anchoring methods.
### Components/Axes
- **X-axis**: "Layer" (0 to 30), representing the depth of the model's layers.
- **Y-axis**: "I-Don't-Know Rate" (0 to 100), indicating the percentage of instances where the model responded with "I don't know."
- **Legend**:
- **Solid lines**: Q-Anchored models (e.g., Q-Anchored (PopQA), Q-Anchored (TriviaQA), etc.).
- **Dashed lines**: A-Anchored models (e.g., A-Anchored (PopQA), A-Anchored (TriviaQA), etc.).
- **Colors**:
- Blue: Q-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Orange: Q-Anchored (HotpotQA)
- Red: Q-Anchored (NQ)
- Purple: A-Anchored (PopQA)
- Gray: A-Anchored (TriviaQA)
- Dark gray: A-Anchored (HotpotQA)
- Light gray: A-Anchored (NQ)
### Detailed Analysis
#### Mistral-7B-v0.1
- **Q-Anchored (PopQA)**: Starts at ~90% at layer 0, drops sharply to ~30% by layer 10, then fluctuates between 20-40%.
- **A-Anchored (PopQA)**: Starts at ~50%, remains relatively stable (~40-60%) across layers.
- **Q-Anchored (TriviaQA)**: Peaks at ~80% at layer 0, drops to ~20% by layer 10, then fluctuates between 10-30%.
- **A-Anchored (TriviaQA)**: Starts at ~60%, decreases to ~30% by layer 10, then stabilizes (~20-40%).
- **Q-Anchored (HotpotQA)**: Peaks at ~70% at layer 0, drops to ~20% by layer 10, then fluctuates between 10-30%.
- **A-Anchored (HotpotQA)**: Starts at ~50%, decreases to ~20% by layer 10, then stabilizes (~10-30%).
- **Q-Anchored (NQ)**: Peaks at ~60% at layer 0, drops to ~10% by layer 10, then fluctuates between 5-20%.
- **A-Anchored (NQ)**: Starts at ~40%, decreases to ~10% by layer 10, then stabilizes (~5-15%).
#### Mistral-7B-v0.3
- **Q-Anchored (PopQA)**: Starts at ~70%, drops to ~20% by layer 10, then fluctuates between 10-30%.
- **A-Anchored (PopQA)**: Starts at ~50%, remains stable (~40-60%) across layers.
- **Q-Anchored (TriviaQA)**: Peaks at ~60% at layer 0, drops to ~10% by layer 10, then fluctuates between 5-20%.
- **A-Anchored (TriviaQA)**: Starts at ~50%, decreases to ~20% by layer 10, then stabilizes (~10-30%).
- **Q-Anchored (HotpotQA)**: Peaks at ~60% at layer 0, drops to ~10% by layer 10, then fluctuates between 5-20%.
- **A-Anchored (HotpotQA)**: Starts at ~40%, decreases to ~10% by layer 10, then stabilizes (~5-15%).
- **Q-Anchored (NQ)**: Peaks at ~50% at layer 0, drops to ~5% by layer 10, then fluctuates between 2-10%.
- **A-Anchored (NQ)**: Starts at ~30%, decreases to ~5% by layer 10, then stabilizes (~2-10%).
### Key Observations
1. **Q-Anchored models** (solid lines) exhibit higher variability and sharper declines in IDK rates compared to A-Anchored models (dashed lines).
2. **A-Anchored models** show more stability, with gradual declines or consistent rates across layers.
3. **Dataset-specific trends**:
- **PopQA**: Q-Anchored models start with the highest IDK rates (up to 90% in v0.1) but decline sharply.
- **TriviaQA**: Q-Anchored models have the most dramatic drops (e.g., 80% to 20% in v0.1).
- **NQ**: Q-Anchored models show the steepest declines (e.g., 60% to 10% in v0.1).
4. **Version differences**: Mistral-7B-v0.3 generally has lower baseline IDK rates than v0.1, suggesting improved performance or reduced uncertainty in later layers.
### Interpretation
The data suggests that **Q-Anchored models** (which may prioritize question-specific context) are more sensitive to layer depth, leading to higher initial uncertainty that decreases rapidly. In contrast, **A-Anchored models** (which may rely on broader contextual anchoring) maintain more stable IDK rates, indicating robustness to layer-specific variations. The decline in IDK rates across layers for Q-Anchored models could reflect improved confidence as the model processes deeper layers. However, the variability in trends across datasets highlights that the anchoring method interacts differently with the complexity of each task. The lower baseline rates in v0.3 suggest architectural or training improvements in later versions, though the exact mechanisms remain unclear without additional context.
</details>
Figure 33: Comparisons of i-don’t-know rate between pathways, probing mlp activations of the final token.
<details>
<summary>x87.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across LLaMA Model Layers
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" (percentage of instances where models abstained from answering) across different layers of LLaMA-3.2-1B and LLaMA-3.2-3B models. Each graph includes six data series representing combinations of question-answering (QA) datasets and anchoring methods (Q-Anchored vs. A-Anchored). The graphs show significant variability in I-Don't-Know rates across layers, with distinct patterns emerging for different datasets and anchoring strategies.
### Components/Axes
- **X-Axis (Layer)**:
- LLaMA-3.2-1B: 2.5 → 15.0 (discrete layer markers)
- LLaMA-3.2-3B: 0 → 25 (discrete layer markers)
- **Y-Axis (I-Don't-Know Rate)**: 0% → 100% (continuous scale)
- **Legend**:
- **Q-Anchored (PopQA)**: Solid blue line
- **Q-Anchored (TriviaQA)**: Dashed green line
- **Q-Anchored (HotpotQA)**: Dotted purple line
- **Q-Anchored (NQ)**: Dash-dot pink line
- **A-Anchored (PopQA)**: Solid orange line
- **A-Anchored (TriviaQA)**: Dashed brown line
- **A-Anchored (HotpotQA)**: Dotted gray line
- **A-Anchored (NQ)**: Dash-dot red line
- **Shading**: Confidence intervals (95% CI) around each line
### Detailed Analysis
#### LLaMA-3.2-1B Graph
1. **Q-Anchored (PopQA)**:
- Starts at ~60% (layer 2.5), peaks at ~80% (layer 5), then declines to ~40% (layer 15)
- Sharp drop between layers 5-7.5
2. **Q-Anchored (TriviaQA)**:
- Starts at ~50%, rises to ~70% (layer 7.5), then declines to ~30% (layer 15)
- High variability between layers 7.5-10
3. **Q-Anchored (HotpotQA)**:
- Starts at ~40%, peaks at ~90% (layer 5), then declines to ~20% (layer 15)
- Extreme volatility between layers 2.5-10
4. **Q-Anchored (NQ)**:
- Starts at ~55%, fluctuates between 40-65% (layers 2.5-12.5), then stabilizes at ~50%
5. **A-Anchored (PopQA)**:
- Starts at ~50%, rises to ~70% (layer 7.5), then declines to ~55% (layer 15)
- Smooth U-shaped curve
6. **A-Anchored (TriviaQA)**:
- Starts at ~45%, rises to ~65% (layer 5), then declines to ~40% (layer 15)
- Moderate volatility
7. **A-Anchored (HotpotQA)**:
- Starts at ~35%, peaks at ~85% (layer 5), then declines to ~30% (layer 15)
- Extreme volatility similar to Q-Anchored (HotpotQA)
8. **A-Anchored (NQ)**:
- Starts at ~50%, fluctuates between 40-60% (layers 2.5-15)
#### LLaMA-3.2-3B Graph
1. **Q-Anchored (PopQA)**:
- Starts at ~70%, peaks at ~95% (layer 5), then declines to ~50% (layer 25)
- Sharp drop between layers 5-10
2. **Q-Anchored (TriviaQA)**:
- Starts at ~60%, rises to ~80% (layer 10), then declines to ~40% (layer 25)
- High variability between layers 10-15
3. **Q-Anchored (HotpotQA)**:
- Starts at ~50%, peaks at ~100% (layer 5), then declines to ~20% (layer 25)
- Extreme volatility with multiple peaks
4. **Q-Anchored (NQ)**:
- Starts at ~65%, fluctuates between 50-80% (layers 0-20), then stabilizes at ~60%
5. **A-Anchored (PopQA)**:
- Starts at ~60%, rises to ~80% (layer 10), then declines to ~65% (layer 25)
- Smooth U-shaped curve
6. **A-Anchored (TriviaQA)**:
- Starts at ~55%, rises to ~75% (layer 15), then declines to ~50% (layer 25)
- Moderate volatility
7. **A-Anchored (HotpotQA)**:
- Starts at ~45%, peaks at ~95% (layer 5), then declines to ~40% (layer 25)
- Extreme volatility similar to Q-Anchored (HotpotQA)
8. **A-Anchored (NQ)**:
- Starts at ~55%, fluctuates between 45-70% (layers 0-25)
### Key Observations
1. **Anchoring Method Impact**:
- A-Anchored methods generally show more stable I-Don't-Know rates than Q-Anchored methods
- Q-Anchored (HotpotQA) exhibits the most extreme volatility in both models
2. **Model Size Differences**:
- LLaMA-3.2-3B shows higher baseline I-Don't-Know rates (60-80% vs 40-70% in 1B model)
- 3B model has more pronounced layer-specific variability
3. **Dataset-Specific Patterns**:
- HotpotQA consistently shows the highest I-Don't-Know rates across all anchoring methods
- NQ dataset demonstrates the most stable performance in both models
4. **Layer-Specific Trends**:
- Layers 5-10 consistently show peak I-Don't-Know rates
- Final layers (12.5-15 for 1B, 20-25 for 3B) show significant drops
### Interpretation
The data suggests that anchoring methods significantly influence model uncertainty patterns. A-Anchored methods demonstrate greater stability across layers, potentially indicating better generalization. Q-Anchored methods, particularly with HotpotQA, show extreme volatility suggesting sensitivity to layer depth. The 3B model's larger architecture correlates with higher baseline uncertainty but more pronounced layer-specific patterns. The consistent peaks at layers 5-10 across datasets may indicate critical processing stages where models are most likely to abstain from answering. The stability of NQ dataset results across anchoring methods suggests it may be less sensitive to model architecture variations. These patterns could inform model design choices for question-answering systems, particularly regarding layer selection and anchoring strategies.
</details>
<details>
<summary>x88.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Llama-3 Model Sizes and Anchoring Methods
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" (percentage of unanswered questions) across layers of two Llama-3 language models: Llama-3-8B (left) and Llama-3-70B (right). Each graph shows six data series representing different question datasets (PopQA, TriviaQA, HotpotQA, NQ) and anchoring methods (Q-Anchored vs. A-Anchored). The graphs reveal layer-dependent performance variations, with notable fluctuations in higher layers for the 70B model.
### Components/Axes
- **X-axis**: Layer (0–30 for Llama-3-8B, 0–80 for Llama-3-70B)
- **Y-axis**: I-Don't-Know Rate (%) (0–100)
- **Legend**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- **Color coding**:
- Blue: PopQA
- Green: TriviaQA
- Purple: HotpotQA
- Red: NQ
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **Q-Anchored (PopQA)**: Starts at ~90% at layer 0, drops sharply to ~40% by layer 10, then fluctuates between 30–60%.
- **A-Anchored (PopQA)**: Begins at ~40%, rises to ~60% by layer 10, then stabilizes near 50–70%.
- **Q-Anchored (TriviaQA)**: Peaks at ~80% at layer 0, declines to ~30% by layer 20, with erratic mid-range fluctuations.
- **A-Anchored (TriviaQA)**: Starts at ~50%, dips to ~20% by layer 10, then rises to ~60% by layer 30.
- **Q-Anchored (HotpotQA)**: Begins at ~70%, drops to ~20% by layer 10, then oscillates between 10–50%.
- **A-Anchored (HotpotQA)**: Starts at ~30%, rises to ~50% by layer 10, then stabilizes near 40–60%.
- **Q-Anchored (NQ)**: Peaks at ~85% at layer 0, declines to ~30% by layer 20, with sharp mid-layer dips.
- **A-Anchored (NQ)**: Starts at ~40%, rises to ~70% by layer 10, then fluctuates between 50–80%.
#### Llama-3-70B (Right Chart)
- **Q-Anchored (PopQA)**: Starts at ~80%, drops to ~30% by layer 20, then fluctuates between 20–60%.
- **A-Anchored (PopQA)**: Begins at ~50%, rises to ~70% by layer 40, then stabilizes near 60–80%.
- **Q-Anchored (TriviaQA)**: Peaks at ~90% at layer 0, declines to ~20% by layer 60, with erratic mid-range fluctuations.
- **A-Anchored (TriviaQA)**: Starts at ~40%, dips to ~10% by layer 20, then rises to ~70% by layer 80.
- **Q-Anchored (HotpotQA)**: Begins at ~60%, drops to ~10% by layer 40, then oscillates between 5–50%.
- **A-Anchored (HotpotQA)**: Starts at ~20%, rises to ~50% by layer 40, then stabilizes near 40–60%.
- **Q-Anchored (NQ)**: Peaks at ~95% at layer 0, declines to ~20% by layer 80, with sharp mid-layer dips.
- **A-Anchored (NQ)**: Starts at ~30%, rises to ~80% by layer 60, then fluctuates between 60–90%.
### Key Observations
1. **Model Size Impact**: The 70B model exhibits more pronounced fluctuations in higher layers (e.g., layer 60–80) compared to the 8B model.
2. **Anchoring Method Differences**:
- Q-Anchored methods generally show higher initial I-Don't-Know rates but sharper declines.
- A-Anchored methods maintain more stable or increasing rates in later layers.
3. **Dataset Variability**:
- NQ (Natural Questions) consistently shows the highest initial I-Don't-Know rates.
- HotpotQA (HotpotQA) demonstrates the most erratic behavior in the 70B model.
4. **Layer-Specific Trends**:
- In Llama-3-8B, layer 10–20 shows critical performance shifts for most datasets.
- In Llama-3-70B, layer 40–60 exhibits significant divergence between anchoring methods.
### Interpretation
The data suggests that anchoring methods (Q vs. A) differentially affect model performance across layers and model sizes. Q-Anchored methods may prioritize early-layer accuracy at the cost of later-layer robustness, while A-Anchored methods appear more consistent in higher layers. The 70B model’s increased volatility in later layers could indicate greater sensitivity to architectural complexity or dataset-specific challenges. Notably, the NQ dataset’s extreme initial I-Don't-Know rates (up to 95%) highlight its role as a particularly challenging benchmark. These trends may reflect trade-offs between model capacity, question complexity, and anchoring strategy design.
</details>
<details>
<summary>x89.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate Across Layers for Mistral-7B Models (v0.1 and v0.3)
### Overview
The image contains two line charts comparing the "I-Don't-Know Rate" (y-axis) across 30 layers (x-axis) for different question-answering models and anchoring methods in two versions of Mistral-7B (v0.1 and v0.3). Each chart includes multiple data series with distinct line styles and colors, representing combinations of anchoring types (Q-Anchored/A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). Confidence intervals are visualized as shaded regions around the lines.
---
### Components/Axes
- **X-Axis**: Layer (0–30, integer increments)
- **Y-Axis**: I-Don't-Know Rate (0–100%, integer increments)
- **Legends**:
- **Left Chart (v0.1)**:
- Q-Anchored (PopQA): Solid blue
- A-Anchored (PopQA): Dashed orange
- Q-Anchored (TriviaQA): Dotted green
- A-Anchored (TriviaQA): Dash-dot red
- Q-Anchored (HotpotQA): Solid purple
- A-Anchored (HotpotQA): Dashed gray
- **Right Chart (v0.3)**:
- Q-Anchored (PopQA): Solid blue
- A-Anchored (PopQA): Dashed orange
- Q-Anchored (TriviaQA): Dotted green
- A-Anchored (TriviaQA): Dash-dot red
- Q-Anchored (HotpotQA): Solid purple
- A-Anchored (HotpotQA): Dashed gray
- Q-Anchored (NQ): Dotted pink
- A-Anchored (NQ): Dash-dot gray
---
### Detailed Analysis
#### Left Chart (Mistral-7B-v0.1)
- **Q-Anchored (PopQA)** (blue solid): Peaks at ~90% at layer 5, drops to ~40% at layer 15, then fluctuates between 50–70%.
- **A-Anchored (PopQA)** (orange dashed): Stable between 40–60%, with minor dips at layers 10 and 25.
- **Q-Anchored (TriviaQA)** (green dotted): Sharp spike to ~80% at layer 10, then declines to ~30% by layer 30.
- **A-Anchored (TriviaQA)** (red dash-dot): Gradual decline from ~70% to ~40%, with a plateau at layer 20.
- **Q-Anchored (HotpotQA)** (purple solid): Oscillates between 50–70%, with a peak at layer 25 (~80%).
- **A-Anchored (HotpotQA)** (gray dashed): Relatively flat (~50–60%), with a dip to ~40% at layer 15.
#### Right Chart (Mistral-7B-v0.3)
- **Q-Anchored (PopQA)** (blue solid): Peaks at ~80% at layer 10, then declines to ~50% by layer 30.
- **A-Anchored (PopQA)** (orange dashed): Stable between 50–70%, with a minor dip at layer 20.
- **Q-Anchored (TriviaQA)** (green dotted): Peaks at ~70% at layer 5, declines to ~40% by layer 30.
- **A-Anchored (TriviaQA)** (red dash-dot): Gradual decline from ~60% to ~30%, with a plateau at layer 15.
- **Q-Anchored (HotpotQA)** (purple solid): Peaks at ~75% at layer 20, then declines to ~50%.
- **A-Anchored (HotpotQA)** (gray dashed): Stable between 50–60%, with a dip to ~40% at layer 10.
- **Q-Anchored (NQ)** (pink dotted): Peaks at ~85% at layer 5, declines to ~40% by layer 30.
- **A-Anchored (NQ)** (gray dash-dot): Stable between 50–70%, with a peak at layer 25 (~80%).
---
### Key Observations
1. **Version Comparison**:
- v0.3 shows reduced variability in I-Don't-Know rates compared to v0.1 (narrower shaded confidence intervals).
- v0.1 exhibits sharper spikes (e.g., Q-Anchored TriviaQA at layer 10), while v0.3 trends are smoother.
2. **Anchoring Impact**:
- Q-Anchored models generally show higher I-Don't-Know rates than A-Anchored counterparts in both versions.
- Exceptions: A-Anchored (NQ) in v0.3 matches Q-Anchored (NQ) in variability.
3. **Dataset-Specific Trends**:
- **PopQA**: Q-Anchored models dominate in v0.1 but stabilize in v0.3.
- **TriviaQA**: Q-Anchored models exhibit extreme fluctuations in v0.1, mitigated in v0.3.
- **HotpotQA**: Q-Anchored models show late-layer spikes in v0.1 (layer 25) and v0.3 (layer 20).
4. **Outliers**:
- Q-Anchored (TriviaQA) in v0.1 has an anomalous spike at layer 10 (~80%), far exceeding other series.
- A-Anchored (NQ) in v0.3 peaks at layer 25 (~80%), matching Q-Anchored (NQ) in v0.3.
---
### Interpretation
The data suggests that anchoring methods (Q vs. A) and dataset types significantly influence model uncertainty. Q-Anchored models (e.g., PopQA, TriviaQA) exhibit higher I-Don't-Know rates, particularly in earlier layers, indicating potential over-reliance on specific training data. The reduction in variability in v0.3 implies architectural improvements or better generalization. The late-layer spikes in HotpotQA (v0.1/v0.3) may reflect domain-specific challenges. Notably, A-Anchored (NQ) in v0.3 performs comparably to Q-Anchored models, suggesting that anchoring strategy may be less critical for NQ datasets. These trends highlight trade-offs between specialization and robustness in model design.
</details>
Figure 34: Comparisons of i-don’t-know rate between pathways, probing mlp activations of the token immediately preceding the exact answer tokens.
<details>
<summary>x90.png Details</summary>

### Visual Description
## Line Graph: I-Don't-Know Rate Across Layers in LLaMA-3.2 Models
### Overview
The image contains two line graphs comparing the "I-Don't-Know Rate" (IDK rate) across layers in two LLaMA-3.2 models: **LLaMA-3.2-1B** (left) and **LLaMA-3.2-3B** (right). Each graph shows six data series (lines) representing different anchoring methods (Q-Anchored/A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis measures IDK rate (%), and the x-axis represents model layers.
---
### Components/Axes
- **X-Axis (Layer)**:
- LLaMA-3.2-1B: 0–15 layers (discrete increments).
- LLaMA-3.2-3B: 0–25 layers (discrete increments).
- **Y-Axis (I-Don't-Know Rate)**: 0–100% (continuous scale).
- **Legends**:
- **LLaMA-3.2-1B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted red: A-Anchored (PopQA)
- Dashed gray: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dotted black: A-Anchored (HotpotQA)
- **LLaMA-3.2-3B**:
- Solid blue: Q-Anchored (PopQA)
- Dashed green: Q-Anchored (TriviaQA)
- Dotted red: A-Anchored (PopQA)
- Dashed gray: A-Anchored (TriviaQA)
- Solid purple: Q-Anchored (HotpotQA)
- Dotted black: A-Anchored (NQ)
---
### Detailed Analysis
#### LLaMA-3.2-1B (Left Graph)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ~80% at layer 0, drops sharply to ~20% by layer 5, then fluctuates between ~30–50%.
2. **Q-Anchored (TriviaQA)** (dashed green):
- Begins at ~60%, dips to ~10% at layer 5, then rises to ~40% by layer 15.
3. **A-Anchored (PopQA)** (dotted red):
- Starts at ~50%, peaks at ~70% at layer 5, then declines to ~40%.
4. **A-Anchored (TriviaQA)** (dashed gray):
- Starts at ~40%, drops to ~20% at layer 5, then stabilizes near ~30%.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Begins at ~70%, plunges to ~10% at layer 5, then oscillates between ~20–40%.
6. **A-Anchored (HotpotQA)** (dotted black):
- Starts at ~60%, drops to ~30% at layer 5, then stabilizes near ~40%.
#### LLaMA-3.2-3B (Right Graph)
1. **Q-Anchored (PopQA)** (solid blue):
- Starts at ~90%, drops to ~30% at layer 5, then fluctuates between ~40–60%.
2. **Q-Anchored (TriviaQA)** (dashed green):
- Begins at ~70%, dips to ~10% at layer 5, then rises to ~50% by layer 25.
3. **A-Anchored (PopQA)** (dotted red):
- Starts at ~60%, peaks at ~80% at layer 5, then declines to ~50%.
4. **A-Anchored (TriviaQA)** (dashed gray):
- Starts at ~50%, drops to ~20% at layer 5, then stabilizes near ~35%.
5. **Q-Anchored (HotpotQA)** (solid purple):
- Begins at ~80%, plunges to ~10% at layer 5, then oscillates between ~20–50%.
6. **A-Anchored (NQ)** (dotted black):
- Starts at ~70%, drops to ~40% at layer 5, then stabilizes near ~50%.
---
### Key Observations
1. **General Trend**: IDK rates generally decrease as layers increase, but with significant fluctuations.
2. **Dataset Variability**:
- **HotpotQA** consistently shows the highest initial IDK rates (~70–90%) and sharpest declines.
- **NQ** (only in 3.2-3B) exhibits moderate IDK rates (~40–70%) with gradual declines.
3. **Anchoring Method Differences**:
- **Q-Anchored** methods (PopQA, TriviaQA, HotpotQA) show steeper initial drops compared to **A-Anchored** methods.
- **A-Anchored (PopQA)** in 3.2-3B peaks at ~80% at layer 5, the highest IDK rate observed.
4. **Outliers**:
- Q-Anchored (HotpotQA) in 3.2-3B has a sharp spike to ~50% at layer 20, deviating from its earlier trend.
---
### Interpretation
1. **Model Behavior**:
- The IDK rate reflects the model's uncertainty in answering questions. Lower rates suggest higher confidence.
- **Q-Anchored** methods (question-focused) show more pronounced declines, possibly due to better alignment with question semantics.
- **A-Anchored** methods (answer-focused) exhibit higher variability, suggesting sensitivity to answer-specific features.
2. **Dataset Complexity**:
- **HotpotQA** (multi-hop reasoning) likely drives higher initial uncertainty, as deeper layers may struggle with complex reasoning.
- **NQ** (factual QA) shows more stable IDK rates, indicating consistent performance across layers.
3. **Layer-Specific Insights**:
- Layer 5 consistently acts as a critical point where IDK rates drop sharply, possibly marking a transition from surface-level to deeper contextual processing.
- In 3.2-3B, the larger model size (25 layers) allows for more nuanced IDK rate modulation, especially in later layers (e.g., layer 20+).
---
### Conclusion
The graphs reveal that anchoring methods and dataset complexity significantly influence IDK rates. Q-Anchored methods generally reduce uncertainty more effectively, while larger models (3.2-3B) exhibit finer-grained layer-specific behavior. These trends highlight the importance of anchoring strategies in balancing model confidence and performance.
</details>
<details>
<summary>x91.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate Across Llama-3 Models and Datasets
### Overview
The image presents two line charts comparing the "I-Don't-Know Rate" (percentage of instances where a model responds with "I don't know") across different layers of the Llama-3-8B and Llama-3-70B models. The data is segmented by dataset (PopQA, TriviaQA, HotpotQA, NQ) and anchoring type (Q-Anchored vs. A-Anchored). The charts visualize how the I-Don't-Know rate varies with model layers and dataset-specific characteristics.
---
### Components/Axes
- **X-Axis (Layer)**:
- Llama-3-8B: 0 to 30 (in increments of 10)
- Llama-3-70B: 0 to 80 (in increments of 20)
- **Y-Axis (I-Don't-Know Rate)**: 0 to 100 (percentage)
- **Legend**:
- **Q-Anchored (PopQA)**: Blue solid line
- **Q-Anchored (TriviaQA)**: Green solid line
- **Q-Anchored (HotpotQA)**: Purple solid line
- **Q-Anchored (NQ)**: Pink solid line
- **A-Anchored (PopQA)**: Blue dashed line
- **A-Anchored (TriviaQA)**: Green dashed line
- **A-Anchored (HotpotQA)**: Purple dashed line
- **A-Anchored (NQ)**: Pink dashed line
- **Chart Titles**:
- Left: "Llama-3-8B"
- Right: "Llama-3-70B"
---
### Detailed Analysis
#### Llama-3-8B (Left Chart)
- **PopQA (Blue Solid)**:
- Starts at ~80% in layer 0, drops sharply to ~20% by layer 10, then fluctuates between 10–30%.
- **TriviaQA (Green Solid)**:
- Begins at ~60%, dips to ~10% by layer 10, then rises to ~40% by layer 30.
- **HotpotQA (Purple Solid)**:
- Peaks at ~90% in layer 0, drops to ~30% by layer 10, then stabilizes around 20–40%.
- **NQ (Pink Solid)**:
- Starts at ~70%, declines to ~20% by layer 10, then oscillates between 10–30%.
#### Llama-3-70B (Right Chart)
- **PopQA (Blue Solid)**:
- Begins at ~85%, drops to ~25% by layer 20, then fluctuates between 10–40%.
- **TriviaQA (Green Solid)**:
- Starts at ~65%, dips to ~15% by layer 20, then rises to ~50% by layer 60.
- **HotpotQA (Purple Solid)**:
- Peaks at ~95% in layer 0, drops to ~35% by layer 20, then stabilizes around 20–50%.
- **NQ (Pink Solid)**:
- Begins at ~75%, declines to ~25% by layer 20, then oscillates between 10–40%.
---
### Key Observations
1. **Model Size Impact**:
- Llama-3-70B shows more stable I-Don't-Know rates across layers compared to Llama-3-8B, suggesting larger models may handle uncertainty more consistently.
2. **Dataset Variability**:
- **HotpotQA** consistently exhibits the highest I-Don't-Know rates (up to 95% in layer 0), indicating it is the most challenging dataset.
- **NQ** shows the most erratic fluctuations, with sharp drops and rises across layers.
3. **Anchoring Type**:
- Q-Anchored (solid lines) and A-Anchored (dashed lines) trends are visually similar, but Q-Anchored lines often start higher in layer 0.
4. **Layer-Specific Trends**:
- Early layers (0–10) show the highest I-Don't-Know rates, with a general decline as layers increase, though some datasets (e.g., TriviaQA) exhibit late-layer spikes.
---
### Interpretation
The data suggests that:
- **Model Size**: Larger models (Llama-3-70B) demonstrate more stable I-Don't-Know rates, possibly due to better generalization or reduced uncertainty in deeper layers.
- **Dataset Complexity**: HotpotQA's high initial rates imply it tests the model's ability to handle complex, multi-step reasoning, while NQ's volatility may reflect its reliance on ambiguous or context-dependent queries.
- **Anchoring Methods**: The lack of significant divergence between Q-Anchored and A-Anchored lines suggests that anchoring type has minimal impact on the I-Don't-Know rate, though Q-Anchored lines may reflect initial confidence biases.
Notable anomalies include the sharp drop in HotpotQA's I-Don't-Know rate after layer 10, which could indicate a shift in model behavior or dataset-specific thresholds. The persistent fluctuations in NQ highlight its sensitivity to layer-specific model dynamics.
</details>
<details>
<summary>x92.png Details</summary>

### Visual Description
## Line Chart: I-Don't-Know Rate Across Model Layers (Mistral-7B-v0.1 and v0.3)
### Overview
The image contains two side-by-side line charts comparing the "I-Don't-Know Rate" across 30 layers of the Mistral-7B model (versions v0.1 and v0.3). Each chart displays six data series representing different anchoring strategies (Q-Anchored and A-Anchored) for three question types (PopQA, TriviaQA, HotpotQA) and a general "NQ" (No Question) category. The y-axis ranges from 0 to 100%, and the x-axis spans layers 0–30.
---
### Components/Axes
- **Y-Axis**: "I-Don't-Know Rate" (0–100%)
- **X-Axis**: "Layer" (0–30)
- **Legend**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Colors:
- Blue: Q-Anchored (PopQA)
- Green: Q-Anchored (TriviaQA)
- Purple: Q-Anchored (HotpotQA)
- Pink: Q-Anchored (NQ)
- Orange: A-Anchored (PopQA)
- Red: A-Anchored (TriviaQA)
- Gray: A-Anchored (HotpotQA)
- Dark Gray: A-Anchored (NQ)
---
### Detailed Analysis
#### Mistral-7B-v0.1
1. **Q-Anchored (PopQA)**:
- Starts at ~80% (layer 0), drops sharply to ~20% (layer 10), then rises to ~40% (layer 30).
- Sharp dip at layer 10 suggests a critical transition point.
2. **A-Anchored (PopQA)**:
- Starts at ~60%, fluctuates between ~40–60% (layers 0–30).
- Less volatility than Q-Anchored.
3. **Q-Anchored (TriviaQA)**:
- Begins at ~70%, dips to ~30% (layer 10), then rises to ~50% (layer 30).
4. **A-Anchored (TriviaQA)**:
- Starts at ~50%, stabilizes around ~40–50% (layers 0–30).
5. **Q-Anchored (HotpotQA)**:
- Peaks at ~90% (layer 0), drops to ~30% (layer 10), then rises to ~50% (layer 30).
6. **A-Anchored (HotpotQA)**:
- Starts at ~70%, fluctuates between ~50–70% (layers 0–30).
7. **Q-Anchored (NQ)**:
- Starts at ~60%, dips to ~20% (layer 10), then rises to ~40% (layer 30).
8. **A-Anchored (NQ)**:
- Starts at ~50%, stabilizes around ~40–50% (layers 0–30).
#### Mistral-7B-v0.3
1. **Q-Anchored (PopQA)**:
- Starts at ~70%, drops to ~25% (layer 10), then rises to ~45% (layer 30).
2. **A-Anchored (PopQA)**:
- Starts at ~55%, fluctuates between ~40–60% (layers 0–30).
3. **Q-Anchored (TriviaQA)**:
- Begins at ~65%, dips to ~35% (layer 10), then rises to ~55% (layer 30).
4. **A-Anchored (TriviaQA)**:
- Starts at ~50%, stabilizes around ~40–50% (layers 0–30).
5. **Q-Anchored (HotpotQA)**:
- Peaks at ~85% (layer 0), drops to ~35% (layer 10), then rises to ~55% (layer 30).
6. **A-Anchored (HotpotQA)**:
- Starts at ~65%, fluctuates between ~50–70% (layers 0–30).
7. **Q-Anchored (NQ)**:
- Starts at ~55%, dips to ~25% (layer 10), then rises to ~45% (layer 30).
8. **A-Anchored (NQ)**:
- Starts at ~45%, stabilizes around ~40–50% (layers 0–30).
---
### Key Observations
1. **Layer 10 as a Critical Transition**:
- All Q-Anchored models show sharp drops in I-Don't-Know rates at layer 10, followed by gradual increases.
- A-Anchored models exhibit smoother, more stable trends.
2. **Version Differences**:
- v0.3 shows slightly lower initial I-Don't-Know rates (e.g., Q-Anchored PopQA: 70% vs. 80% in v0.1).
- v0.3’s Q-Anchored models recover more gradually post-layer 10.
3. **Question Type Variability**:
- HotpotQA (complex reasoning) has the highest initial I-Don't-Know rates (~80–90%).
- NQ (no question) shows moderate rates (~50–60%).
4. **Anchoring Strategy Impact**:
- Q-Anchored models are more volatile, with sharper drops and recoveries.
- A-Anchored models maintain steadier performance across layers.
---
### Interpretation
The data suggests that anchoring strategies significantly influence the model’s uncertainty handling:
- **Q-Anchored Models**: Likely prioritize question-specific context, leading to higher initial uncertainty (layer 0) but rapid adaptation (layer 10). However, their volatility may indicate over-reliance on specific question patterns.
- **A-Anchored Models**: Demonstrate robustness, maintaining consistent performance across layers. This suggests better generalization but potentially less sensitivity to question-specific nuances.
- **Version v0.3 Improvements**: Reduced initial uncertainty and smoother recovery post-layer 10 may reflect architectural optimizations or training adjustments.
- **HotpotQA Sensitivity**: High initial uncertainty aligns with its complexity, highlighting challenges in reasoning tasks.
The charts underscore trade-offs between specialization (Q-Anchored) and generalization (A-Anchored), with implications for deployment in dynamic vs. static environments.
</details>
Figure 35: Comparisons of i-don’t-know rate between pathways, probing mlp activations of the last exact answer token.
Appendix H Pathway-Aware Detection
Method LLama-3.2-1B LLama-3.2-3B PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 60.00 49.65 43.34 52.83 54.58 51.76 47.73 53.78 Logits-mean 74.89 60.24 60.18 49.92 73.47 63.46 60.35 54.89 Logits-max 58.56 52.37 52.29 46.19 56.03 54.33 48.65 48.88 Logits-min 78.66 62.37 67.14 51.20 80.92 69.60 71.11 58.24 Scores-mean 72.91 61.13 62.16 64.67 67.99 61.96 64.91 61.71 Scores-max 69.33 59.74 61.29 64.08 63.34 61.92 61.09 57.56 Scores-min 64.84 55.93 59.28 55.81 61.51 56.76 63.95 57.43 Probing Baseline 94.25 77.17 90.25 74.83 90.96 76.61 86.54 74.20 \rowcolor mygray MoP-RandomGate 83.69 69.20 84.11 68.76 79.69 72.38 75.13 67.11 \rowcolor mygray MoP-VanillaExperts 93.86 78.63 90.91 75.73 90.98 77.68 86.41 75.30 \rowcolor mygray MoP 95.85 80.07 91.51 79.19 92.74 78.72 88.16 78.14 \rowcolor mygray PR 96.18 84.22 92.80 86.45 95.70 80.66 90.66 81.91
Table 8: Comparison of hallucination detection performance (AUC) on LLama-3.2-1B and LLama-3.2-3B.
Method LLama-3-8B LLama-3-70B PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 55.85 49.92 52.14 53.27 54.83 50.96 49.39 51.18 Logits-mean 74.52 60.39 51.94 52.63 67.81 52.40 50.45 48.28 Logits-max 58.08 52.20 46.40 47.89 56.21 48.16 43.42 45.33 Logits-min 85.36 70.89 61.28 56.50 79.96 61.53 62.63 52.16 Scores-mean 62.87 62.09 62.06 60.32 56.81 60.70 60.91 58.05 Scores-max 56.62 60.24 59.85 56.06 55.15 59.60 57.32 51.93 Scores-min 60.99 58.27 60.33 57.68 58.77 58.22 64.06 58.05 Probing Baseline 88.71 77.58 82.23 70.20 86.88 81.59 84.45 74.39 \rowcolor mygray MoP-RandomGate 75.52 69.17 79.88 66.56 67.96 70.56 72.16 66.28 \rowcolor mygray MoP-VanillaExperts 89.11 78.73 84.57 71.21 86.04 82.47 82.48 73.85 \rowcolor mygray MoP 92.11 81.18 85.45 74.64 88.54 84.12 86.65 76.12 \rowcolor mygray PR 94.01 83.13 87.81 79.10 90.08 84.21 87.69 78.24
Table 9: Comparison of hallucination detection performance (AUC) on LLama-3-8B and LLama-3-70B.
Method Mistral-7B-v0.1 Mistral-7B-v0.3 PopQA TriviaQA HotpotQA NQ PopQA TriviaQA HotpotQA NQ P(True) 48.78 50.43 51.94 55.52 45.49 47.61 57.87 52.79 Logits-mean 69.09 64.95 54.47 59.41 69.52 66.76 55.45 57.88 Logits-max 54.37 54.76 46.74 56.45 54.34 55.24 48.39 54.37 Logits-min 86.02 76.56 68.06 53.73 87.05 77.33 68.08 54.40 Scores-mean 59.00 59.61 64.18 57.60 58.84 60.22 63.28 60.05 Scores-max 51.71 56.58 63.29 55.82 53.00 55.55 63.13 57.73 Scores-min 60.00 57.48 61.17 48.51 60.59 57.84 59.85 50.76 Probing Baseline 89.61 78.43 83.76 74.10 87.39 81.74 83.19 73.60 \rowcolor mygray MoP-RandomGate 80.50 68.27 74.51 68.05 79.81 70.88 72.23 61.19 \rowcolor mygray MoP-VanillaExperts 89.82 79.51 83.54 74.78 88.53 80.93 82.93 73.77 \rowcolor mygray MoP 92.44 84.03 84.63 76.38 91.66 83.57 85.82 76.87 \rowcolor mygray PR 94.72 84.66 89.04 80.92 93.09 84.36 89.03 79.09
Table 10: Comparison of hallucination detection performance (AUC) on Mistral-7B-v0.1 and Mistral-7B-v0.3.