2510.09033
Model: healer-alpha-free
# Large Language Models Do NOT Really Know What They Donât Know
## Abstract
Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may â know what they donât know â. However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that LLMs donât really know what they donât know.
Large Language Models Do NOT Really Know What They Donât Know
Chi Seng Cheang 1 Hou Pong Chan 2 Wenxuan Zhang 3 Yang Deng 1 1 Singapore Management University 2 DAMO Academy, Alibaba Group 3 Singapore University of Technology and Design cs.cheang.2025@phdcs.smu.edu.sg, houpong.chan@alibaba-inc.com wxzhang@sutd.edu.sg, ydeng@smu.edu.sg
## 1 Introduction
Large language models (LLMs) demonstrate remarkable proficiency in generating coherent and contextually relevant text, yet they remain plagued by hallucination Zhang et al. (2023b); Huang et al. (2025), a phenomenon where outputs appear plausible but are factually inaccurate or entirely fabricated, raising concerns about their reliability and trustworthiness. To this end, researchers suggest that the internal states of LLMs (e.g., hidden representations Azaria and Mitchell (2023); Gottesman and Geva (2024), attention weights YĂŒksekgönĂŒl et al. (2024), output token logits Orgad et al. (2025); Varshney et al. (2023), etc.) can be used to detect hallucinations, indicating that LLMs themselves may actually know what they donât know. These methods typically assume that when a model produces hallucinated outputs (e.g., â Barack Obama was born in the city of Tokyo â in Figure 1), its internal computations for the outputs (â Tokyo â) are detached from the input information (â Barack Obama â), thereby differing from those used to generate factually correct outputs. Thus, the hidden states are expected to capture this difference and serve as indicators of hallucinations.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: LLM Factual Query Processing and Hallucination Types
### Overview
The image is a conceptual diagram illustrating how a Large Language Model (LLM) processes factual queries and generates outputs, categorizing the outputs into factual associations and two types of hallucinations. It visually maps the flow from input queries through the model's internal states to the final generated text.
### Components/Axes
The diagram is organized into three distinct vertical sections, flowing from left to right:
1. **Left Section: Factual Query**
* **Label:** "Factual Query" (accompanied by a magnifying glass icon đ).
* **Content:** Three example text queries about Barack Obama, formatted with the subject in orange and the query in black:
* "Barack Obama studied in the city of"
* "Barack Obama was born in the city of"
* "Barack Obama was born in the city of" (a duplicate of the second query).
* **Flow:** Black arrows point from each query into the central "LLM" block.
2. **Central Section: Internal States**
* **Label:** "Internal States" (accompanied by a brain icon đ§ ).
* **Component:** A large, gray, rounded rectangle labeled "LLM".
* **Visualization:** A dashed-line box to the right of the LLM contains a scatter plot representing the model's internal state space. The plot contains numerous colored dots:
* **Green dots:** Clustered densely in the upper portion.
* **Blue dots:** Scattered in the middle region, partially overlapping with green.
* **Red dots:** Clustered densely in the lower portion.
3. **Right Section: Generated Output**
* **Label:** "Generated Output" (accompanied by a speech bubble icon đŹ).
* **Legend & Examples:** A key explains the color coding of the dots in the Internal States plot, with corresponding example outputs:
* **Green Circle (â
):** "Factual Associations" - Example: "e.g., *Chicago*" (in green text).
* **Blue Circle (â):** "Associated Hallucinations" - Example: "e.g., *Chicago*" (in blue text).
* **Red Circle (â):** "Unassociated Hallucinations" - Example: "e.g., *Tokyo*" (in red text).
### Detailed Analysis
The diagram establishes a clear visual metaphor for LLM behavior:
* **Input Processing:** Identical or similar factual queries ("born in the city of") are fed into the LLM.
* **Internal Representation:** The model's internal processing is represented as a high-dimensional state space (the scatter plot). The spatial clustering of colored dots suggests that different types of outputs originate from distinct regions or patterns of activation within the model.
* **Output Classification:** The legend explicitly defines three output categories based on their relationship to the input query and factual knowledge:
1. **Factual Associations (Green):** Correct, grounded information (e.g., answering "Chicago" to "born in the city of").
2. **Associated Hallucinations (Blue):** Plausible but incorrect information that is semantically related to the subject or query (e.g., also answering "Chicago" to "studied in the city of," which is factually incorrect for Obama).
3. **Unassociated Hallucinations (Red):** Information that is neither correct nor semantically related to the query (e.g., answering "Tokyo" to "born in the city of").
### Key Observations
1. **Duplicate Query:** The second and third input queries are identical ("Barack Obama was born in the city of"). This implies the diagram is illustrating that the *same* input can lead to different output types (green, blue, or red) depending on the internal state activated.
2. **Spatial Separation in Internal States:** The green (factual) and red (unassociated hallucination) clusters are visually distinct and separated, with the blue (associated hallucination) cluster occupying a middle ground. This suggests a potential topological structure in the model's knowledge representation.
3. **Color-Coded Consistency:** The color of the example text in the "Generated Output" section (green "Chicago", blue "Chicago", red "Tokyo") matches the color of the corresponding dot in the legend and the clusters in the Internal States plot.
### Interpretation
This diagram provides a Peircean investigative model for understanding LLM hallucinations. It moves beyond a simple "right vs. wrong" dichotomy by introducing a nuanced taxonomy based on the *source* of the error relative to the query's context.
* **What it demonstrates:** The core message is that hallucinations are not monolithic. "Associated Hallucinations" (blue) are particularly insidious because they stem from the model's correct associative knowledge (linking Obama to Chicago) but apply it to the wrong factual predicate (studied vs. born). This is distinct from "Unassociated Hallucinations" (red), which represent a more complete failure of grounding.
* **How elements relate:** The flow from Query â LLM â Internal States â Output argues that the origin of a hallucination can be traced to specific patterns of activation within the model. The clustering implies that interventions (like decoding strategies or probing) might target these distinct internal regions to suppress errors.
* **Notable implication:** The presence of the same example ("Chicago") for both a factual association and an associated hallucination is critical. It highlights that the *surface form* of an output is insufficient to judge its factuality; the underlying internal state and its relationship to the specific query are what determine correctness. This underscores the challenge of detecting and mitigating hallucinations in practice.
</details>
Figure 1: Illustration of three categories of knowledge. Associated hallucinations follow similar internal knowledge recall processes with factual associations, while unassociated hallucinations arise when the modelâs output is detached from the input.
However, other research (Lin et al., 2022b; Kang and Choi, 2023; Cheang et al., 2023) shows that models can also generate false information that is closely associated with the input information. In particular, models may adopt knowledge shortcuts, favoring tokens that frequently co-occur in the training corpus over factually correct answers Kang and Choi (2023). As shown in Figure 1, given the prompt: âBarack Obama was born in the city ofâ, an LLM may rely on the subject tokensâ representations (i.e., âBarack Obamaâ) to predict a hallucinated output (e.g., âChicagoâ), which is statistically associated with the subject entity but under other contexts (e.g., â Barack Obama studied in the city of Chicago â). Therefore, we suspect that the internal computations may not exhibit distinguishable patterns between correct predictions and input-associated hallucinations, as LLMs rely on the input information to produce both of them. Only when the model produces hallucinations unassociated with the input do the hidden states exhibit distinct patterns that can be reliably identified.
To this end, we conduct a mechanistic analysis of how LLMs internally process factual queries. We first perform causal analysis to identify hidden states crucial for generating Factual Associations (FAs) â factually correct outputs grounded in subject knowledge. We then examine how these hidden states behave when the model produces two types of factual errors: Associated Hallucinations (AHs), which remain grounded in subject knowledge, and Unassociated Hallucinations (UHs), which are detached from it. Our analysis shows that when generating both FAs and AHs, LLMs propagate information encoded in subject representations to the final token during output generation, resulting in overlapping hidden-state geometries that cannot reliably distinguish AHs from FAs. In contrast, UHs exhibit distinct internal computational patterns, producing clearly separable hidden-state geometries from FAs.
Building on the analysis, we revisit several widely-used hallucination detection approaches Gottesman and Geva (2024); YĂŒksekgönĂŒl et al. (2024); Orgad et al. (2025) that adopt internal state probing. The results show that these representations cannot reliably distinguish AHs from FAs due to their overlapping hidden-state geometries, though they can effectively separate UHs from FAs. Moreover, this geometry also shapes the limits the effectiveness of Refusal Tuning Zhang et al. (2024), which trains LLMs to refuse uncertain queries using refusal-aware dataset. Because UH samples exhibit consistent and distinctive patterns, refusal tuning generalizes well to unseen UHs but fails to generalize to unseen AHs. We also find that AH hidden states are more diverse, and thus refusal tuning with AH samples prevents generalization across both AH and UH samples.
Together, these findings highlight a central limitation: LLMs do not encode truthfulness in their hidden states but only patterns of knowledge recall and utilization, showing that LLMs donât really know what they donât know.
## 2 Related Work
Existing hallucination detection methods can be broadly categorized into two types: representation-based and confidence-based. Representation-based methods assume that an LLMâs internal hidden states can reflect the correctness of its generated responses. These approaches train a classifier (often a linear probe) using the hidden states from a set of labeled correct/incorrect responses to predict whether a new response is hallucinatory Li et al. (2023); Azaria and Mitchell (2023); Su et al. (2024); Ji et al. (2024); Chen et al. (2024); Ni et al. (2025); Xiao et al. (2025). Confidence-based methods, in contrast, assume that a lower confidence during the generation led to a higher probability of hallucination. These methods quantify uncertainty through various signals, including: (i) token-level output probabilities (Guerreiro et al., 2023; Varshney et al., 2023; Orgad et al., 2025); (ii) directly querying the LLM to verbalize its own confidence (Lin et al., 2022a; Tian et al., 2023; Xiong et al., 2024; Yang et al., 2024b; Ni et al., 2024; Zhao et al., 2024); or (iii) measuring the semantic consistency across multiple outputs sampled from the same prompt (Manakul et al., 2023; Kuhn et al., 2023; Zhang et al., 2023a; Ding et al., 2024). A response is typically flagged as a hallucination if its associated confidence metric falls below a predetermined threshold.
However, a growing body of work reveals a critical limitation: even state-of-the-art LLMs are poorly calibrated, meaning their expressed confidence often fails to align with the factual accuracy of their generations (Kapoor et al., 2024; Xiong et al., 2024; Tian et al., 2023). This miscalibration limits the effectiveness of confidence-based detectors and raises a fundamental question about the extent of LLMsâ self-awareness of their knowledge boundary, i.e., whether they can â know what they donât know â Yin et al. (2023); Li et al. (2025). Despite recognizing this problem, prior work does not provide a mechanistic explanation for its occurrence. To this end, our work addresses this explanatory gap by employing mechanistic interpretability techniques to trace the internal computations underlying knowledge recall within LLMs.
## 3 Preliminary
Transformer Architecture
Given an input sequence of $T$ tokens $t_1,...,t_T$ , an LLM is trained to model the conditional probability distribution of the next token $p(t_T+1|t_1,...,t_T)$ conditioned on the preceding $T$ tokens. Each token is first mapped to a continuous vector by an embedding layer. The resulting sequence of hidden states is then processed by a stack of $L$ Transformer layers. At layer $\ellâ{1,...,L}$ , each token representation is updated by a Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (MLP) module:
$$
h^\ell=h^\ell-1+a^\ell+m^\ell, \tag{1}
$$
where $a^\ell$ and $m^\ell$ correspond to the MHSA and MLP outputs, respectively, at the $\ell$ -layer.
Internal Process of Knowledge Recall
Prior work investigates the internal activations of LLMs to study the mechanics of knowledge recall. For example, an LLM may encode many attributes that are associated with a subject (e.g., Barack Obama) (Geva et al., 2023). Given a prompt like â Barack Obama was born in the city of â, if the model has correctly encoded the fact, the attribute â Honolulu â propagates through self-attention to the last token, yielding the correct answer. We hypothesize that non-factual predictions follow the same mechanism: spurious attributes such as â Chicago â are also encoded and propagated, leading the model to generate false outputs.
Categorization of Knowledge
To investigate how LLMs internally process factual queries, we define three categories of knowledge, according to two criteria: 1) factual correctness, and 2) subject representation reliance.
- Factual Associations (FA) refer to factual knowledge that is reliably stored in the parameters or internal states of an LLM and can be recalled to produce correct, verifiable outputs.
- Associated Hallucinations (AH) refer to non-factual content produced when an LLM relies on input-triggered parametric associations.
- Unassociated Hallucinations (UH) refer to non-factual content produced without reliance on parametric associations to the input.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Heatmap: Average Jensen-Shannon Divergence Across Model Layers
### Overview
The image is a heatmap visualizing the average Jensen-Shannon (JS) Divergence across 31 layers (0-30) of a model, broken down into three distinct categories or components labeled on the y-axis. The divergence is represented by a color gradient, with darker blue indicating higher divergence values.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Layer". It has numerical markers from 0 to 30 in increments of 2 (0, 2, 4, ..., 30).
* **Y-Axis (Vertical):** Contains three categorical labels, positioned from top to bottom:
1. **Subj.** (Top row)
2. **Attn.** (Middle row)
3. **Last.** (Bottom row)
* **Color Scale/Legend:** Positioned vertically on the right side of the chart. It is labeled "Avg JS Divergence". The scale ranges from 0.2 (lightest blue/white) to 0.6 (darkest blue), with intermediate markers at 0.3, 0.4, and 0.5.
### Detailed Analysis
The heatmap displays three distinct horizontal bands, each corresponding to a y-axis category. The color intensity (divergence value) varies significantly across layers for each band.
1. **"Subj." Row (Top):**
* **Trend:** Shows very high divergence in the early layers, which sharply decreases in the middle layers and remains low in the later layers.
* **Data Points:** Layers 0 through approximately 16 are colored in the darkest blue, indicating an average JS Divergence at or near the maximum of **~0.6**. From layer 17 onward, the color lightens dramatically to a very pale blue/white, indicating a divergence value at or near the minimum of **~0.2**.
2. **"Attn." Row (Middle):**
* **Trend:** Shows a localized peak of moderate divergence in the middle layers, with very low divergence in both early and late layers.
* **Data Points:** Layers 0-10 and 18-30 are very pale, indicating divergence **~0.2**. A distinct block of light-to-medium blue appears between layers 11 and 17. The peak divergence within this block (around layers 13-15) corresponds to a color suggesting a value of approximately **0.3 to 0.35**.
3. **"Last." Row (Bottom):**
* **Trend:** Shows a gradual increase in divergence from the middle layers to the final layers.
* **Data Points:** Layers 0-17 are very pale (**~0.2**). Starting around layer 18, the color begins to darken progressively. By layers 28-30, the color is a medium blue, indicating a divergence value of approximately **0.4 to 0.45**.
### Key Observations
* **Spatial Segregation of Activity:** The three components ("Subj.", "Attn.", "Last.") exhibit high divergence in largely non-overlapping layer ranges. "Subj." dominates early layers (0-16), "Attn." peaks in mid-layers (11-17), and "Last." becomes prominent in late layers (18-30).
* **Magnitude Differences:** The "Subj." component reaches the highest divergence values (~0.6), significantly higher than the peaks of "Attn." (~0.35) and "Last." (~0.45).
* **Sharp vs. Gradual Transitions:** The drop in divergence for "Subj." is abrupt after layer 16. In contrast, the rise for "Last." is more gradual.
### Interpretation
This heatmap likely visualizes the functional specialization or information processing dynamics within a deep neural network (e.g., a Transformer model). The Jensen-Shannon Divergence measures the difference between probability distributions, so high values indicate layers where the model's internal representations for a given component are changing significantly or are distinct from a baseline.
* **"Subj." (Subject):** The high early-layer divergence suggests that processing related to the "subject" of the input (e.g., identifying entities, subjects in a sentence) is a primary and highly variable activity in the initial stages of the model.
* **"Attn." (Attention):** The mid-layer peak indicates that attention mechanism computations become most distinctive or variable in the middle of the network, possibly where complex relationships between elements are being resolved.
* **"Last." (Last Layer/Output):** The increasing divergence in final layers reflects the specialization and refinement of representations as they are prepared for the model's final output task.
The clear spatial separation implies a sequential processing pipeline: the model first heavily processes subject-related information, then focuses on attention-based integration, and finally prepares the output. The higher magnitude for "Subj." could indicate that initial feature extraction is a more variable or fundamental process than the later, more constrained stages of computation.
</details>
(a) Factual Associations
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Heatmap: Average Jensen-Shannon Divergence Across Model Layers and Components
### Overview
The image is a heatmap visualizing the "Avg JS Divergence" (Average Jensen-Shannon Divergence) across different layers of a model (likely a neural network) for three distinct components or metrics. The divergence is represented by a color gradient, with darker blues indicating higher divergence values.
### Components/Axes
* **X-Axis (Horizontal):** Labeled **"Layer"**. It represents model layers, numbered from **0 to 30** in increments of 2 (0, 2, 4, ..., 30).
* **Y-Axis (Vertical):** Lists three categorical components:
1. **Subj.** (Top row)
2. **Attn.** (Middle row)
3. **Last.** (Bottom row)
* **Color Scale/Legend:** Positioned on the **right side** of the chart. It is a vertical color bar labeled **"Avg JS Divergence"**.
* The scale ranges from **0.2** (lightest blue/white) to **0.6** (darkest blue).
* Intermediate marked values are **0.3, 0.4, and 0.5**.
### Detailed Analysis
The heatmap displays a 3x16 grid of colored cells (3 rows for components, 16 columns for the even-numbered layers 0-30). The color intensity in each cell corresponds to the Avg JS Divergence value for that specific component at that layer.
**Trend Verification & Data Point Extraction:**
1. **Row: "Subj." (Top)**
* **Visual Trend:** Starts with very high divergence in the earliest layers, which then decreases significantly in the later layers.
* **Data Points (Approximate):**
* Layers 0-14: Consistently very dark blue, indicating divergence values at or near the maximum of **~0.6**.
* Layer 16: Color lightens noticeably to a medium blue, approximately **~0.45**.
* Layers 18-22: Continues to lighten, reaching values around **~0.3**.
* Layers 24-30: Becomes very light blue/white, indicating low divergence values of **~0.2 to 0.25**.
2. **Row: "Attn." (Middle)**
* **Visual Trend:** Shows generally low divergence across all layers, with a subtle, localized increase in the middle layers.
* **Data Points (Approximate):**
* Layers 0-8: Very light blue/white, divergence **~0.2**.
* Layers 10-18: A band of light-to-medium blue appears, peaking around layers 12-16 with values of approximately **~0.3 to 0.35**.
* Layers 20-30: Returns to very light blue, divergence **~0.2**.
3. **Row: "Last." (Bottom)**
* **Visual Trend:** Shows the inverse pattern of "Subj." â divergence starts very low and increases steadily in the later layers.
* **Data Points (Approximate):**
* Layers 0-14: Very light blue/white, divergence **~0.2**.
* Layer 16: Begins to darken to a light blue, approximately **~0.25**.
* Layers 18-24: Progressively darkens, reaching values of **~0.35 to 0.4**.
* Layers 26-30: Becomes a solid medium blue, indicating divergence values of **~0.45 to 0.5**.
### Key Observations
* **Inverse Relationship:** There is a clear inverse relationship between the "Subj." and "Last." components across the model depth. High early-layer divergence in "Subj." corresponds to low divergence in "Last.", and vice-versa in later layers.
* **"Attn." Stability:** The "Attn." component maintains a relatively low and stable divergence profile, with only a minor, transient increase in the middle layers (10-18).
* **Layer Transition Zone:** Layers 14-18 appear to be a critical transition zone where the divergence profiles for "Subj." and "Last." begin their significant shifts.
### Interpretation
This heatmap likely analyzes the internal dynamics of a deep learning model, such as a Transformer. Jensen-Shannon Divergence measures the difference between probability distributions.
* **What the data suggests:** The "Subj." component (possibly related to subject representation or early feature extraction) is highly distinct or variable in the initial processing layers, becoming more stable and uniform in deeper layers. Conversely, the "Last." component (potentially the final layer output or a high-level representation) starts as a uniform distribution and becomes increasingly specialized or divergent in deeper layers.
* **How elements relate:** The model's processing appears to follow a pattern where early layers handle diverse, low-level features ("Subj."), while later layers consolidate this information into more specific, high-level representations ("Last."). The "Attn." (Attention mechanism) shows a consistent, low-level divergence, suggesting its role is more about modulating information flow rather than creating highly divergent representations itself.
* **Notable pattern:** The most striking finding is the clean, complementary hand-off of divergence from the "Subj." to the "Last." component as data flows through the network layers. This could indicate a successful hierarchical feature learning process.
</details>
(b) Associated Hallucinations
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmap: Average Jensen-Shannon Divergence Across Model Layers
### Overview
The image is a heatmap visualizing the "Avg JS Divergence" (Average Jensen-Shannon Divergence) across different layers of a model for three distinct categories. The heatmap uses a blue color gradient to represent divergence values, with darker blue indicating higher divergence. The data suggests a comparison of how different components or representations within a model diverge from a reference distribution as information propagates through its layers.
### Components/Axes
* **Chart Type:** Heatmap.
* **X-Axis:** Labeled **"Layer"**. It represents model layers, with numerical markers at intervals of 2, ranging from **0 to 30**.
* **Y-Axis:** Contains three categorical labels, positioned on the left side of the heatmap:
1. **"Subj."** (Top row)
2. **"Attn."** (Middle row)
3. **"Last."** (Bottom row)
* **Color Bar/Legend:** Positioned on the right side of the chart. It is labeled **"Avg JS Divergence"** and provides a scale for interpreting the heatmap colors.
* **Scale Range:** Approximately **0.2 to 0.6**.
* **Color Gradient:** A sequential blue palette where lighter shades (near white/light blue) correspond to lower values (~0.2) and darker, saturated blue corresponds to higher values (~0.6).
### Detailed Analysis
The heatmap displays a 3 (categories) x 16 (layer intervals) grid of colored cells. The color intensity represents the average JS Divergence value for that category at that layer.
1. **"Subj." Row (Top):**
* **Trend:** This row shows the highest divergence values, which are concentrated in the earlier layers and gradually decrease.
* **Data Points:** The cells from **Layer 0 to approximately Layer 16** are colored in varying shades of medium to dark blue. The darkest blue (highest divergence, ~0.5-0.6) appears in the very first layers (0-4). The color progressively lightens as the layer number increases, becoming very light blue (divergence ~0.25-0.3) by Layer 16 and remaining light for layers beyond.
2. **"Attn." Row (Middle):**
* **Trend:** This row exhibits consistently low divergence across all layers.
* **Data Points:** All cells in this row are a very light blue or off-white color, indicating divergence values at the low end of the scale, approximately **0.2 to 0.25**. There is no significant visual trend or variation across layers.
3. **"Last." Row (Bottom):**
* **Trend:** This row also shows low divergence overall, with a very slight increase visible in the final layers.
* **Data Points:** Most cells are light blue, similar to the "Attn." row (~0.2-0.25). However, in the final columns corresponding to **Layers 28 and 30**, the blue shade becomes slightly more pronounced, suggesting a minor increase in divergence to approximately **0.25-0.3**.
### Key Observations
* **Dominant Pattern:** The most striking feature is the high divergence in the "Subj." category during the early to mid-layers (0-16), which sharply contrasts with the low, stable divergence of the "Attn." and "Last." categories.
* **Layer Sensitivity:** The "Subj." representation appears to be highly sensitive to layer depth, undergoing significant change (high divergence) early in the network before stabilizing.
* **Stability of Attention:** The "Attn." component shows remarkable stability (low divergence) throughout the entire depth of the model.
* **Late Divergence in "Last.":** There is a subtle but observable uptick in divergence for the "Last." category in the very final layers (28-30).
### Interpretation
This heatmap likely analyzes the internal dynamics of a deep neural network, possibly a transformer model given the "Attn." (Attention) label. Jensen-Shannon Divergence measures the similarity between two probability distributions.
* **What the data suggests:** The high early divergence for **"Subj."** implies that the model's representation of the "subject" (or a subject-related feature) changes dramatically in the initial processing stages. This could indicate the model is actively constructing or refining this concept from raw input.
* **Relationship between elements:** The stark contrast between "Subj." and "Attn." suggests these components play fundamentally different roles. The attention mechanism ("Attn.") appears to operate on a stable, consistent distribution across layers, perhaps serving as a reliable routing or weighting function. In contrast, the subject representation is highly dynamic.
* **Notable anomaly/trend:** The slight rise in divergence for **"Last."** in the final layers is intriguing. It may indicate that the final output representation ("Last") begins to diverge slightly from an intermediate representation as it is fine-tuned for the specific task output, or it could be an artifact of the final layer normalization.
* **Underlying significance:** This visualization helps diagnose where and how a model transforms information. The early, high-divergence zone for "Subj." pinpoints a critical phase of feature formation, while the stability of "Attn." confirms its role as a consistent computational primitive. This kind of analysis is crucial for understanding model interpretability, debugging representation learning, and guiding architectural improvements.
</details>
(c) Unassociated Hallucinations
Figure 2: Effect of interventions across layers of LLaMA-3-8B. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the modelâs predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token.
Dataset Construction
| Factual Association Associated Hallucination Unassociated Hallucination | 3,506 1,406 7,381 | 3,354 1,284 7,655 |
| --- | --- | --- |
| Total | 12,293 | 12,293 |
Table 1: Dataset statistics across categories.
Our study is conducted under a basic knowledge-based question answering setting. The model is given a prompt containing a subject and relation (e.g., â Barack Obama was born in the city of â) and is expected to predict the corresponding object (e.g., â Honolulu â). To build the dataset, we collect knowledge triples $(subject,relation,object)$ form Wikidata. Each relation was paired with a handcrafted prompt template to convert triples into natural language queries. The details of relation selection and prompt templates are provided in Appendix A.1. We then apply the label scheme presented in Appendix A.2: correct predictions are labeled as FAs, while incorrect ones are classified as AHs or UHs depending on their subject representation reliance. Table 1 summarizes the final data statistics.
Models
We conduct the experiments on two widely-adopted open-source LLMs, LLaMA-3 Dubey et al. (2024) and Mistral-v0.3 Jiang et al. (2023). Due to the space limit, details are presented in Appendix A.3, and parallel experimental results on Mistral are summarized in Appendix B.
## 4 Analysis of Internal States in LLMs
To focus our analysis, we first conduct causal interventions to identify hidden states that are crucial for eliciting factual associations (FAs). We then compare their behavior across associated hallucinations (AHs) and unassociated hallucinations (UHs). Prior studies Azaria and Mitchell (2023); Gottesman and Geva (2024); YĂŒksekgönĂŒl et al. (2024); Orgad et al. (2025) suggest that hidden states can reveal when a model hallucinates. This assumes that the modelâs internal computations differ when producing correct versus incorrect outputs, causing their hidden states to occupy distinct subspaces. We revisit this claim by examining how hidden states update when recalling three categories of knowledge (i.e., FAs, AHs, and UHs). If hidden states primarily signal hallucination, AHs and UHs should behave similarly and diverge from FAs. Conversely, if hidden states reflect reliance on encoded knowledge, FAs and AHs should appear similar, and both should differ from UHs.
### 4.1 Causal Analysis of Information Flow
We identify hidden states that are crucial for factual prediction. For each knowledge tuple (subject, relation, object), the model is prompted with a factual query (e.g., â The name of the father of Joe Biden is â). Correct predictions indicate that the model successfully elicits parametric knowledge. Using causal mediation analysis Vig et al. (2020); Finlayson et al. (2021); Meng et al. (2022); Geva et al. (2023), we intervene on intermediate computations and measure the change in output distribution via JS divergence. A large divergence indicates that the intervened computation is critical for producing the fact. Specifically, to test whether token $i$ âs hidden states in the MLP at layer $\ell$ are crucial for eliciting knowledge, we replace the computation with a corrupted version and observe how the output distribution changes. Similarly, following Geva et al. (2023), we mask the attention flow between tokens at layer $\ell$ using a window size of 5 layers. To streamline implementation, interventions target only subject tokens, attention flow, and the last token. Notable observations are as follows:
Obs1: Hidden states crucial for eliciting factual associations.
The results in Figure 2(a) show that three components dominate factual predictions: (1) subject representations in early-layer MLPs, (2) mid-layer attention between subject tokens and the final token, and (3) the final token representations in later layers. These results trace a clear information flow: subject representation, attention flow from the subject to the last token, and last-token representation, consistent with Geva et al. (2023). These three types of internal states are discussed in detail respectively (§ 4.2 - 4.4).
Obs2: Associated hallucinations follow the same information flow as factual associations.
When generating AHs, interventions on these same components also produce large distribution shifts (Figure 2(b)). This indicates that, although outputs are factually wrong, the model still relies on encoded subject information.
Obs3: Unassociated hallucinations present a different information flow.
In contrast, interventions during UH generation cause smaller distribution shifts (Figure 2(c)), showing weaker reliance on the subject. This suggests that UHs emerge from computations not anchored in the subject representation, different from both FAs and AHs.
### 4.2 Analysis of Subject Representations
The analysis in § 4.1 reveals that unassociated hallucinations (UHs) are processed differently from factual associations (FAs) and associated hallucinations (AHs) in the early layers of LLMs, which share a similar pattern. We examine how these differences emerge in the subject representations and why early-layer modules behave this way.
#### 4.2.1 Norm of Subject Representations
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Norm Ratio of Hallucination Associations Across Layers
### Overview
The image displays a line chart comparing the "Norm Ratio" of two types of hallucination associations across 32 layers (indexed 0 to 31) of a neural network or similar model. The chart plots two data series, distinguished by color and marker shape, against a common x-axis representing layers and a y-axis representing the Norm Ratio.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Layers"
* **Scale:** Linear, from 0 to 31.
* **Major Tick Marks:** Labeled at intervals of 5 (0, 5, 10, 15, 20, 25, 30).
* **Y-Axis:**
* **Label:** "Norm Ratio"
* **Scale:** Linear, from approximately 0.94 to 1.02.
* **Major Tick Marks:** Labeled at 0.94, 0.96, 0.98, 1.00, 1.02.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Series 1:** Blue line with circle markers. Label: "Asso. Hallu./Factual Asso."
* **Series 2:** Red (salmon) line with square markers. Label: "Unasso. Hallu./Factual Asso."
* **Grid:** Light gray grid lines are present for both major x and y ticks.
### Detailed Analysis
**Data Series 1: Asso. Hallu./Factual Asso. (Blue Circles)**
* **Trend:** The line is relatively stable, hovering close to a Norm Ratio of 1.00 across all layers with minor fluctuations.
* **Data Points (Approximate):**
* Layer 0: ~0.995
* Layer 1: ~1.000
* Layer 2: ~0.998
* Layer 3: ~1.000
* Layer 4: ~1.001
* Layer 5: ~1.003 (local peak)
* Layers 6-31: The value oscillates gently between approximately 0.995 and 1.000, ending near 0.998 at Layer 31.
**Data Series 2: Unasso. Hallu./Factual Asso. (Red Squares)**
* **Trend:** This series shows significant variation. It starts lower, dips to a pronounced minimum in the early-middle layers, recovers, and then exhibits a sharp, anomalous spike at the final layer.
* **Data Points (Approximate):**
* Layer 0: ~0.970
* Layer 1: ~0.983
* Layer 2: ~0.985 (early peak)
* Layer 3: ~0.980
* Layer 4: ~0.965
* Layer 5: ~0.958
* Layer 6: ~0.956
* Layer 7: ~0.960
* Layer 8: ~0.962
* Layer 9: ~0.967
* Layer 10: ~0.964
* Layer 11: ~0.955
* Layer 12: ~0.940 (global minimum)
* Layer 13: ~0.951
* Layer 14: ~0.958
* Layer 15: ~0.959
* Layer 16: ~0.976
* Layer 17: ~0.984
* Layer 18: ~0.987
* Layer 19: ~0.988
* Layer 20: ~0.988
* Layer 21: ~0.986
* Layers 22-29: The value stabilizes in a narrow band between ~0.983 and ~0.985.
* Layer 30: ~0.982
* Layer 31: ~1.022 (sharp, anomalous peak, the highest value on the chart)
### Key Observations
1. **Stability vs. Volatility:** The "Asso. Hallu." series is remarkably stable near 1.0, while the "Unasso. Hallu." series is highly volatile, especially in layers 0-16.
2. **Critical Dip:** The "Unasso. Hallu." series reaches its lowest point (Norm Ratio ~0.94) at Layer 12.
3. **Convergence and Divergence:** The two series are closest in value around Layers 1-3 and Layers 17-29. They diverge most dramatically at Layer 12 (blue ~0.995 vs. red ~0.940) and at the final Layer 31 (blue ~0.998 vs. red ~1.022).
4. **Final Layer Anomaly:** The most striking feature is the sudden, sharp increase in the "Unasso. Hallu." Norm Ratio at Layer 31, jumping from ~0.982 to ~1.022, surpassing the 1.00 mark significantly for the first and only time.
### Interpretation
This chart likely visualizes a metric comparing the strength or norm of "hallucinated associations" versus "factual associations" within a model's processing layers. The "Norm Ratio" close to 1.0 suggests parity between hallucinated and factual signals.
* **Associated Hallucinations (Blue):** The stable ratio near 1.0 implies that for hallucinations linked to (associated with) factual knowledge, the model maintains a consistent balance between the hallucinated and factual representations across its depth. This could indicate a controlled or integrated processing pathway.
* **Unassociated Hallucinations (Red):** The volatile ratio suggests a more erratic relationship. The deep dip around Layer 12 might represent a processing stage where factual associations strongly dominate or suppress unassociated hallucinatory content. The dramatic spike at the final layer (31) is highly anomalous. It could indicate a late-stage "breakout" or amplification of unassociated hallucinatory content relative to factual content, potentially pointing to a vulnerability or specific failure mode in the model's final output generation layers.
* **Overall Implication:** The data suggests that the model handles hallucinations differently based on their association with factual knowledge. Unassociated hallucinations undergo a more turbulent transformation through the network, culminating in a potentially problematic surge at the very end. This could be critical for understanding and mitigating model hallucinations.
</details>
Figure 3: Norm ratio curves of subject representations in LLaMA-3-8B, comparing AHs and UHs against FAs as the baseline.
To test whether subject representations differ across categories, we measure the average $L_2$ norm of subject-token hidden activations across layers. For subject tokens $t_s_1,..,t_s_{n}$ at layer $\ell$ , the average norm is $||h_s^\ell\|=\tfrac{1}{n}â_i=1^n\|h_s_{i}^\ell\|_2$ , computed by Equation (1). We compare the norm ratio between hallucination samples (AHs or UHs) and correct predictions (FAs), where a ratio near 1 indicates similar norms. Figure 3 shows that in LLaMA-3-8B, AH norms closely match those of correct samples (ratio $â$ 0.99), while UH norms are consistently smaller, starting at the first layer (ratio $â$ 0.96) and diverging further through mid-layers.
Findings:
At early layers, UH subject representations exhibit weaker activations than FAs, whereas AHs exhibit norms similar to FAs.
#### 4.2.2 Relation to Parametric Knowledge
<details>
<summary>x6.png Details</summary>

### Visual Description
## Grouped Bar Chart: Hallucination Ratios for Two Language Models
### Overview
The image displays a grouped bar chart comparing two metrics related to hallucinations and factual associations for two large language models: LLaMA-3-8B and Mistral-7B-v0.3. The chart visualizes the ratio of "Unassociated Hallucinations to Factual Associations" and "Associated Hallucinations to Factual Associations" for each model.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **X-Axis (Categories):** Two primary categories representing different AI models.
* Left Group: `LLaMA-3-8B`
* Right Group: `Mistral-7B-v0.3`
* **Y-Axis (Scale):** Labeled `Ratio`. The scale is linear, ranging from 0.0 to just above 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Positioned at the bottom center of the chart. It defines the two data series:
* **Red Bar:** `Unasso. Hallu./Factual Asso.` (Abbreviation for "Unassociated Hallucinations / Factual Associations")
* **Blue Bar:** `Asso. Hallu./Factual Asso.` (Abbreviation for "Associated Hallucinations / Factual Asso.")
* **Data Series:** Each model category on the x-axis contains two adjacent bars corresponding to the legend.
### Detailed Analysis
**1. LLaMA-3-8B (Left Group):**
* **Red Bar (Unasso. Hallu./Factual Asso.):** The bar height is approximately **0.68**. The visual trend shows a moderate ratio.
* **Blue Bar (Asso. Hallu./Factual Asso.):** The bar height is the tallest in the chart, extending slightly above the 1.0 grid line. The approximate value is **1.05**. The visual trend is a significant increase compared to its paired red bar.
**2. Mistral-7B-v0.3 (Right Group):**
* **Red Bar (Unasso. Hallu./Factual Asso.):** The bar height is approximately **0.38**. This is the lowest value in the chart.
* **Blue Bar (Asso. Hallu./Factual Asso.):** The bar height is approximately **0.80**. The visual trend shows a substantial increase compared to its paired red bar.
**Trend Verification:**
* For both models, the blue bar ("Associated Hallucinations/Factual Associations") is consistently and significantly taller than the red bar ("Unassociated Hallucinations/Factual Associations").
* LLaMA-3-8B exhibits higher ratios for both metrics compared to Mistral-7B-v0.3.
### Key Observations
1. **Consistent Pattern:** Across both models, the ratio of Associated Hallucinations to Factual Associations is higher than the ratio of Unassociated Hallucinations to Factual Associations.
2. **Model Comparison:** LLaMA-3-8B shows higher values for both metrics than Mistral-7B-v0.3.
3. **Notable Outlier:** The "Asso. Hallu./Factual Asso." ratio for LLaMA-3-8B exceeds 1.0 (â1.05). This is the only data point above the 1.0 threshold.
4. **Relative Difference:** The proportional increase from the red to the blue bar appears more pronounced for LLaMA-3-8B than for Mistral-7B-v0.3.
### Interpretation
This chart presents a comparative analysis of how two language models generate hallucinations in relation to factual associations. The data suggests a fundamental difference in behavior between "associated" and "unassociated" hallucinations.
* **What the data suggests:** The consistently higher blue bars indicate that for both models, hallucinations that are *associated* with the factual context are more frequent (relative to the number of factual associations) than hallucinations that are *unassociated*. This could imply that models are more prone to generating plausible-sounding but incorrect information that is topically related to the factual content, rather than generating completely unrelated falsehoods.
* **Model Behavior:** LLaMA-3-8B's ratio exceeding 1.0 for associated hallucinations is particularly noteworthy. It suggests that, for this model and this specific metric, the count of associated hallucinations may be on par with or even exceed the count of factual associations in the evaluated context. This could point to a higher propensity for this type of error in LLaMA-3-8B compared to Mistral-7B-v0.3 under the test conditions.
* **Relationship Between Elements:** The chart directly contrasts two error types (associated vs. unassociated hallucinations) across two models. The grouping allows for both intra-model comparison (red vs. blue for one model) and inter-model comparison (same color across models). The clear visual separation emphasizes that the observed pattern (blue > red) is model-agnostic, while the absolute values differ.
* **Anomaly:** The value >1.0 for LLaMA-3-8B's associated hallucination ratio is the primary anomaly. It warrants further investigation into the evaluation methodology, the definition of "association," and the specific failure modes of that model.
</details>
Figure 4: Comparison of subspace overlap ratios.
We next investigate why early layers encode subject representations differently across knowledge types by examining how inputs interact with the parametric knowledge stored in MLP modules. Inspired by Kang et al. (2024), the output norm of an MLP layer depends on how well its input aligns with the subspace spanned by the weight matrix: poorly aligned inputs yield smaller output norms.
For each MLP layer $\ell$ , we analyze the down-projection weight matrix $W_down^\ell$ and its input $x^\ell$ . Given the input $x_s^\ell$ corresponding to the subject tokens, we compute its overlap ratio with the top singular subspace $V_top$ of $W_down^\ell$ :
$$
r(x_s^\ell)=\frac{â€ft\lVert{x_s^\ell}^â€V_topV_top^â€\right\rVert^2}{â€ft\lVert x_s^\ell\right\rVert^2}. \tag{2}
$$
A higher overlap ratio $r(x_s^\ell)$ indicates stronger alignment to the subspace spanned by $W_down^\ell$ , leading to larger output norms.
To highlight relative deviations from the factual baseline (FA), we report the relative ratios between AH/FA and UH/FA. Focusing on the layer with the largest UH norm shift, Figure 4 shows that UHs have significantly lower $r(x_s^\ell)$ than AHs in both LLaMA and Mistral. This reveals that early-layer parametric weights are more aligned with FA and AH subject representations than with UH subjects, producing higher norms for the former ones. These results also suggest that the model has sufficiently learned representations for FA and AH subjects during pretraining but not for UH subjects.
Findings:
Similar to FAs, AH hidden activations align closely with the weight subspace, while UHs do not. This indicates that the model has sufficiently encoded subject representations into parametric knowledge for FAs and AHs but not for UHs.
#### 4.2.3 Correlation with Subject Popularity
<details>
<summary>x7.png Details</summary>

### Visual Description
## Bar Chart: Hallucination and Association Percentages by Category
### Overview
The image is a grouped bar chart displaying the percentage distribution of three types of outputsâFactual Associations, Associated Hallucinations, and Unassociated Hallucinationsâacross three categories labeled "Low," "Mid," and "High." The chart visually compares how the prevalence of these output types changes across the categories.
### Components/Axes
* **Chart Type:** Grouped bar chart.
* **X-Axis (Horizontal):** Represents categorical groups. The labels, from left to right, are **"Low"**, **"Mid"**, and **"High"**.
* **Y-Axis (Vertical):** Represents a percentage scale. The axis title is **"Percentage (%)"**. The scale runs from 0 to 100, with major tick marks and labels at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Legend:** Positioned at the bottom center of the chart. It defines three data series by color:
* **Green square:** "Factual Associations"
* **Blue square:** "Associated Hallucinations"
* **Red square:** "Unassociated Hallucinations"
* **Data Labels:** Each bar has a numerical percentage value displayed directly above it.
### Detailed Analysis
The data is grouped by the three x-axis categories. Each group contains three bars corresponding to the legend.
**1. Category: Low**
* **Factual Associations (Green Bar):** 5%
* **Associated Hallucinations (Blue Bar):** 1%
* **Unassociated Hallucinations (Red Bar):** 94%
* **Trend within Group:** The "Unassociated Hallucinations" bar is overwhelmingly dominant, nearly reaching the top of the chart. The other two categories are minimal.
**2. Category: Mid**
* **Factual Associations (Green Bar):** 27%
* **Associated Hallucinations (Blue Bar):** 7%
* **Unassociated Hallucinations (Red Bar):** 66%
* **Trend within Group:** "Unassociated Hallucinations" remains the largest category but has decreased significantly from the "Low" group. "Factual Associations" shows a notable increase.
**3. Category: High**
* **Factual Associations (Green Bar):** 52%
* **Associated Hallucinations (Blue Bar):** 14%
* **Unassociated Hallucinations (Red Bar):** 34%
* **Trend within Group:** "Factual Associations" is now the largest category. "Unassociated Hallucinations" has dropped to its lowest point. "Associated Hallucinations" has increased but remains the smallest category.
### Key Observations
* **Inverse Relationship:** There is a clear inverse relationship between the "Unassociated Hallucinations" (red) and "Factual Associations" (green) series. As one increases across the LowâMidâHigh categories, the other decreases.
* **Dominant Shift:** The dominant output type shifts completely from "Unassociated Hallucinations" in the "Low" category to "Factual Associations" in the "High" category.
* **Associated Hallucinations Trend:** The "Associated Hallucinations" (blue) series shows a steady, monotonic increase from 1% to 14% across the categories, but it remains the minority output in all cases.
* **Summation Check:** For each category, the three percentages sum to 100% (Low: 5+1+94=100; Mid: 27+7+66=100; High: 52+14+34=100), confirming the data represents a complete distribution.
### Interpretation
The chart demonstrates a strong correlation between the categorical label (Low, Mid, High) and the quality or type of output generated by a system, likely an AI model. The "Low" category is characterized by a very high rate of "Unassociated Hallucinations" (94%), suggesting outputs that are not grounded in or related to the source material. As we move to "Mid" and then "High," the system's outputs become progressively more grounded, with "Factual Associations" rising to become the majority (52%) in the "High" category.
This suggests that the "Low," "Mid," and "High" labels may represent a measure of input quality, model confidence, or training data relevance. The data implies that under "High" conditions, the system is far more reliable, producing factual associations more often than unassociated hallucinations. The steady rise in "Associated Hallucinations" (plausible but incorrect associations) is a notable secondary trend, indicating that even as overall accuracy improves, a specific type of error becomes slightly more common. The chart effectively visualizes a trade-off or transition between error types and factual accuracy across different operational contexts.
</details>
Figure 5: Sample distribution across different subject popularity (low, mid, high) in LLaMA-3-8B, measured by monthly Wikipedia page views.
We further investigate why AH representations align with weight subspaces as strongly as FAs, while UHs do not. A natural hypothesis is that this difference arises from subject popularity in the training data. We use average monthly Wikipedia page views as a proxy for subject popularity during pre-training and bin subjects by popularity, then measure the distribution of UHs, AHs, and FAs. Figure 5 shows a clear trend: UHs dominate among the least popular subjects (94% for LLaMA), while AHs are rare (1%). As subject popularity rises, UH frequency falls and both FAs and AHs become more common, with AHs rising to 14% in the high-popularity subjects. This indicates that subject representation norms reflect training frequency, not factual correctness.
Findings:
Popular subjects yield stronger early-layer activations. AHs arise mainly on popular subjects and are therefore indistinguishable from FAs by popularity-based heuristics, contradicting prior work Mallen et al. (2023a) that links popularity to hallucinations.
### 4.3 Analysis of Attention Flow
Having examined how the model forms subject representations, we next study how this information is propagated to the last token of the input where the model generates the object of a knowledge tuple. In order to produce factually correct outputs at the last token, the model must process subject representation and propagate it via attention layers, so that it can be read from the last position to produce the outputs Geva et al. (2023).
To quantify the specific contribution from subject tokens $(s_1,...,s_n)$ to the last token, we compute the attention contribution from subject tokens to the last position:
$$
a^\ell_last=â\nolimits_kâ\nolimits_hA^\ell,h_last,s_k(h^\ell-1_s_{k}W^\ell,h_V)W^\ell,h_O. \tag{3}
$$
where $A^\ell,h_i,j$ denotes the attention weight assigned by the $h$ -th head in the layer $\ell$ from the last position $i$ to subjec token $j$ . Here, $a^\ell_last$ represents the subject-to-last attention contribution at layer $\ell$ . Intuitively, if subject information is critical for prediction, this contribution should have a large norm; otherwise, the norm should be small.
Figure 6 shows that in LLaMA-3-8B, both AHs and FAs exhibit large attention-contribution norms in mid-layers, indicating a strong information flow from subject tokens to the target token. In contrast, UHs show consistently lower norms, implying that their predictions rely far less on subject information. YĂŒksekgönĂŒl et al. (2024) previously argued that high attention flow from subject tokens signals factuality and proposed using attention-based hidden states to detect hallucinations. Our results challenge this view: the model propagates subject information just as strongly when generating AHs as when producing correct facts.
Findings:
Mid-layer attention flow from subject to last token is equally strong for AHs and FAs but weak for UHs. Attention-based heuristics can therefore separate UHs from FAs but cannot distinguish AHs from factual outputs, limiting their reliability for hallucination detection.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Association and Hallucination Norms Across Layers
### Overview
This image is a line chart plotting the "Norm" (y-axis) against "Layer" (x-axis) for three distinct data series. The chart visualizes how the magnitude (norm) of three different phenomenaâFactual Association, Associated Hallucination, and Unassociated Hallucinationâchanges across the layers of a system, likely a neural network model. The data shows significant volatility and synchronized spikes in the later layers.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Layer"
* **Scale:** Linear, ranging from 0 to 30.
* **Major Ticks:** 0, 5, 10, 15, 20, 25, 30.
* **Y-Axis:**
* **Label:** "Norm"
* **Scale:** Linear, ranging from 0.0 to 2.0.
* **Major Ticks:** 0.0, 0.5, 1.0, 1.5, 2.0.
* **Legend:** Located in the top-left corner of the plot area.
1. **Factual Asso.** - Represented by a green line with upward-pointing triangle markers (âČ).
2. **Asso. Hallu.** - Represented by a blue line with circle markers (â).
3. **Unasso. Hallu.** - Represented by a red/salmon line with square markers (â ).
* **Grid:** A light gray grid is present for both x and y axes.
### Detailed Analysis
The analysis is segmented by data series, with trends described before key data points.
**1. Factual Asso. (Green line, âČ)**
* **Trend:** The line remains relatively low and stable (norm < 0.5) from layer 0 to approximately layer 17. It then exhibits extreme volatility, with sharp, high-magnitude spikes and drops in layers 18 through 30.
* **Key Data Points (Approximate):**
* Layers 0-17: Fluctuates between ~0.1 and ~0.45.
* Layer 18: **Major spike** to a norm of ~1.8.
* Layer 19: Drops to ~1.3.
* Layer 20: Sharp drop to ~0.2.
* Layer 22: Spike to ~1.25.
* Layer 24: Spike to ~1.0.
* Layer 27: Spike to ~0.9.
* Layer 30: Ends at ~0.25.
**2. Asso. Hallu. (Blue line, â)**
* **Trend:** Follows a pattern highly correlated with "Factual Asso." but with generally higher peak magnitudes. It is low and stable in early layers, then shows even more pronounced spikes in the later layers, often exceeding the "Factual Asso." values at the same points.
* **Key Data Points (Approximate):**
* Layers 0-17: Fluctuates between ~0.1 and ~0.4, closely tracking the green line.
* Layer 18: **Highest peak on the chart** at a norm of ~1.95.
* Layer 19: Drops to ~1.4.
* Layer 20: Sharp drop to ~0.45.
* Layer 22: Spike to ~1.35.
* Layer 24: Spike to ~1.3.
* Layer 27: Spike to ~1.15.
* Layer 30: Ends at ~0.2.
**3. Unasso. Hallu. (Red/Salmon line, â )**
* **Trend:** This series maintains a lower baseline norm compared to the other two throughout all layers. While it also shows increased volatility after layer 15, its spikes are significantly smaller in magnitude. It does not exhibit the same extreme peaks as the other series.
* **Key Data Points (Approximate):**
* Layers 0-17: Fluctuates between ~0.1 and ~0.35.
* Layer 18: Moderate spike to ~0.65.
* Layer 19: Drops to ~0.5.
* Layer 20: Drops to ~0.2.
* Layer 22: Spike to ~0.45.
* Layer 24: Spike to ~0.6.
* Layer 27: Spike to ~0.6.
* Layer 30: Ends at ~0.4.
### Key Observations
1. **Synchronized Volatility:** All three series transition from a stable, low-norm state to a highly volatile state at approximately the same point (around layer 17-18).
2. **Magnitude Hierarchy:** In the volatile region (layers 18-30), the hierarchy of norms is consistent: `Asso. Hallu.` (blue) â„ `Factual Asso.` (green) > `Unasso. Hallu.` (red). The blue line's peaks are almost always the highest.
3. **Correlated Spikes:** The spikes for "Factual Asso." and "Asso. Hallu." are tightly synchronized in layer position (e.g., layers 18, 22, 24, 27), suggesting a common underlying cause or trigger at those specific layers.
4. **Lower Baseline for Unassociated Hallucination:** The "Unasso. Hallu." series, while volatile, operates on a different, lower scale, indicating this phenomenon is consistently less intense than the associated forms.
### Interpretation
This chart likely illustrates the internal dynamics of a large language model or similar neural network. The "Layer" axis corresponds to the depth of processing within the model.
* **What the data suggests:** The findings imply that the phenomena of "association" and "hallucination" (both associated and unassociated) are not constant but are activated or amplified in specific, deeper layers of the network (post layer 17). The strong correlation between "Factual Association" and "Associated Hallucination" spikes suggests they may share computational pathways or be two outcomes of the same underlying process. The fact that "Associated Hallucination" norms are often higher than "Factual Association" norms at peak points could indicate that the mechanism for generating plausible but incorrect associations (hallucinations) is more powerful or less constrained than the mechanism for retrieving factual associations in these critical layers.
* **How elements relate:** The x-axis (Layer) is the independent variable, representing the model's processing stage. The y-axis (Norm) is the dependent variable, measuring the strength or salience of the tracked phenomena. The legend defines the three distinct processes being measured. The synchronized spikes are the most critical relational feature, pointing to layer-specific computational events.
* **Notable anomalies:** The most significant anomaly is the dramatic phase shift in behavior at layer ~18. The system's state changes fundamentally at this depth. The consistent pattern where "Asso. Hallu." meets or exceeds "Factual Asso." during spikes is also noteworthy, as it quantitatively demonstrates a potential vulnerability where hallucinatory associations can overpower factual ones at specific points in the model's processing hierarchy.
</details>
Figure 6: Subject-to-last attention contribution norms across layers in LLaMA-3-8B. Values show the norm of the attention contribution from subject tokens to the last token at each layer.
### 4.4 Analysis of Last Token Representations
Our earlier analysis showed strong subject-to-last token information transfer for both FAs and AHs, but minimal transfer for UHs. We now examine how this difference shapes the distribution of last-token representations. When subject information is weakly propagated (UHs), last-token states receive little subject-specific update. For UH samples sharing the same prompt template, these states should therefore cluster in the representation space. In contrast, strong subject-driven propagation in FAs and AHs produces diverse last-token states that disperse into distinct subspaces.
To test this, we compute cosine similarity among last-token representations $h_T^\ell$ . As shown in Figure 7, similarity is high ( $â$ 0.9) for all categories in early layers, when little subject information is transferred. From mid-layers onward, FAs and AHs diverge sharply, dropping to $â$ 0.2 by layer 25. UHs remain moderately clustered, with similarity only declining to $â$ 0.5.
Figure 8 shows the t-SNE visualization of the last tokenâs representations at layer 25 of LLaMA-3-8B. The hidden representations of UH are clearly separated from FA, whereas AH substantially overlap with FA. These results indicate that the model processes UH differently from FA, while processing AH in a manner similar to FA. More visualization can be found in Appendix C.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Cosine Similarity Across Layers for Different Association Types
### Overview
The image displays a line chart plotting "Cosine Similarity" on the y-axis against "Layers" on the x-axis. It compares three distinct data series, each representing a different category of association or hallucination, showing how their similarity scores evolve across 31 layers (0 to 30). The chart suggests an analysis of internal representations within a layered model (likely a neural network), tracking how the similarity of different concept types changes with depth.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Label:** "Layers"
* **Scale:** Linear, ranging from 0 to 30.
* **Major Tick Marks:** At intervals of 5 (0, 5, 10, 15, 20, 25, 30).
* **Y-Axis:**
* **Label:** "Cosine Similarity"
* **Scale:** Linear, ranging from approximately 0.25 to 0.95.
* **Major Tick Marks:** At intervals of 0.1 (0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9).
* **Legend:**
* **Position:** Bottom-left corner of the plot area, slightly overlapping the data lines.
* **Entries:**
1. **Factual Associations:** Represented by a green line with upward-pointing triangle markers (âČ).
2. **Associated Hallucinations:** Represented by a blue line with circle markers (â).
3. **Unassociated Hallucinations:** Represented by a red/salmon line with square markers (â ).
### Detailed Analysis
**1. Factual Associations (Green Line, âČ):**
* **Trend:** Starts very high, experiences a gradual decline through the mid-layers, followed by a steep drop, and finally a partial recovery in the final layers.
* **Data Points (Approximate):**
* Layer 0: ~0.95
* Layer 5: ~0.88
* Layer 10: ~0.85
* Layer 15: ~0.75
* Layer 20: ~0.45
* Layer 25: ~0.25 (Global minimum for this series)
* Layer 30: ~0.45
**2. Associated Hallucinations (Blue Line, â):**
* **Trend:** Follows a very similar trajectory to Factual Associations, closely paralleling it but generally sitting slightly lower in the early-to-mid layers and converging with it in the later layers.
* **Data Points (Approximate):**
* Layer 0: ~0.90
* Layer 5: ~0.88
* Layer 10: ~0.85
* Layer 15: ~0.80
* Layer 20: ~0.50
* Layer 25: ~0.25 (Global minimum, nearly identical to Factual Associations)
* Layer 30: ~0.40
**3. Unassociated Hallucinations (Red/Salmon Line, â ):**
* **Trend:** Starts the highest, maintains a high plateau longer than the other two series, declines more gradually and to a lesser extent, and shows the strongest recovery in the final layers.
* **Data Points (Approximate):**
* Layer 0: ~0.95
* Layer 5: ~0.88
* Layer 10: ~0.85
* Layer 15: ~0.78
* Layer 20: ~0.60
* Layer 25: ~0.52 (Global minimum for this series, significantly higher than the others)
* Layer 30: ~0.65
### Key Observations
1. **Common Dip:** All three series exhibit a pronounced U-shaped curve, with cosine similarity decreasing to a minimum around Layer 25 before increasing again.
2. **Divergence in Mid-Layers:** Between Layers 15 and 25, the lines diverge significantly. "Unassociated Hallucinations" maintains a much higher similarity score than the other two categories, which drop sharply together.
3. **Convergence at Extremes:** At the very first layers (0-5) and the final layers (28-30), the values for all three series are relatively closer together compared to the wide spread in the middle.
4. **Relative Ordering:** For the majority of the chart (especially Layers 15-28), the order from highest to lowest similarity is consistently: Unassociated Hallucinations > Associated Hallucinations â Factual Associations.
### Interpretation
This chart likely visualizes how a model's internal representations of different concepts evolve through its layers. The high initial cosine similarity suggests that in early layers, all three types of associations (factual, associated hallucination, unassociated hallucination) are represented in a broadly similar, perhaps shallow or perceptual, manner.
The steep decline for "Factual Associations" and "Associated Hallucinations" indicates that as information propagates through the network, these representations become more specialized or distinct, leading to lower similarity. The fact that they track so closely suggests the model may process associated hallucinations in a way that is fundamentally similar to how it processes factual knowledge, at least until the deepest layers.
The most striking finding is the behavior of "Unassociated Hallucinations." Its consistently higher similarity, especially in the middle layers, implies these concepts maintain a more stable, perhaps more generic or less refined, representation throughout the network. They do not undergo the same degree of specialization or transformation as factual or associated concepts. The recovery in similarity in the final layers for all series could indicate a final integration or output preparation stage where representations become more aligned again.
**In summary, the data suggests a key difference in processing:** The model appears to treat factual knowledge and hallucinations linked to that knowledge through a similar representational pathway that changes significantly with depth. In contrast, hallucinations with no clear association follow a distinct, more stable representational trajectory, which may be a signature of how the model generates unsupported or "ungrounded" information.
</details>
Figure 7: Cosine similarity of target-token hidden states across layers in LLaMA-3-8B.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Scatter Plot: Distribution of Factual Associations vs. Hallucination Types
### Overview
The image is a 2D scatter plot visualizing the distribution of three distinct categories of data points across a Cartesian coordinate system. The plot suggests a clustering analysis, likely from a machine learning or cognitive science context, comparing "Factual Associations" against two types of "Hallucinations" (Associated and Unassociated). The data points are colored circles, and a legend is provided for identification.
### Components/Axes
* **Legend:** Located in the top-left corner of the plot area. It contains three entries:
* Green circle: `Factual Asso.`
* Blue circle: `Asso. Hallu.`
* Red circle: `Unasso. Hallu.`
* **X-Axis:** Horizontal axis with a numerical scale. Major tick marks and labels are present at intervals of 10, ranging from approximately -25 to +30. The axis is not explicitly labeled with a title (e.g., "Dimension 1," "PC1").
* **Y-Axis:** Vertical axis with a numerical scale. Major tick marks and labels are present at intervals of 10, ranging from -30 to +30. The axis is not explicitly labeled with a title (e.g., "Dimension 2," "PC2").
* **Data Points:** Hundreds of filled circles plotted according to their (x, y) coordinates, colored per the legend.
### Detailed Analysis
**Spatial Distribution and Clustering:**
1. **Factual Asso. (Green):** These points are widely dispersed but show a primary concentration in the lower-left quadrant (negative X, negative Y). A secondary, sparser grouping extends towards the center and upper-right. The approximate center of the main cluster is around (-10, -10). The points span from roughly X: -25 to +25 and Y: -30 to +25.
2. **Asso. Hallu. (Blue):** This category forms a dense, tight cluster primarily located in the lower-left quadrant, heavily overlapping with the main cluster of green points. Its center is approximately (-15, -5). The spread is more confined than the green points, mostly between X: -25 to +5 and Y: -25 to +10.
3. **Unasso. Hallu. (Red):** This group forms a distinct, dense cluster in the upper-right quadrant (positive X, positive Y). Its center is approximately (+15, +20). The cluster is relatively compact, with points ranging from about X: 0 to +30 and Y: +5 to +30. A few red points are scattered outside this main cluster, notably one outlier near (-25, -5).
**Trend Verification:** There is no continuous line trend. The visual trend is one of **clustering and separation**. The blue and green points largely co-mingle in the lower-left region, while the red points form a separate, distinct cluster in the upper-right region. This suggests a significant dimensional difference between "Unassociated Hallucinations" and the other two categories.
### Key Observations
1. **Clear Cluster Separation:** The most prominent feature is the spatial separation between the main cluster of `Unasso. Hallu.` (red) and the intermixed clusters of `Factual Asso.` (green) and `Asso. Hallu.` (blue).
2. **Overlap of Factual and Associated Hallucination:** The green and blue points show substantial overlap, indicating these categories may share similar characteristics in the plotted feature space.
3. **Density Variation:** The red cluster appears the densest, followed by the blue cluster. The green points are the most scattered.
4. **Outliers:** A small number of points from each category lie outside their primary clusters. Most notably, a few red points are found within the lower-left region, and a few green points are found within the upper-right red cluster.
### Interpretation
This scatter plot likely visualizes the output of a dimensionality reduction technique (like t-SNE or PCA) applied to internal representations or embeddings from a language model or similar AI system. The data suggests:
* **Semantic or Representational Distance:** The spatial separation implies that the model's internal processing of "Unassociated Hallucinations" is fundamentally different (occupying a distinct region of the latent space) from its processing of "Factual Associations" and "Associated Hallucinations."
* **Proximity of Fact and Associated Error:** The close proximity and overlap of factual associations and associated hallucinations suggest the model may generate the latter by making plausible but incorrect leaps from factual knowledge bases. They are "near" facts in the representational space.
* **Distinct Nature of Unassociated Hallucination:** The isolated red cluster indicates that unassociated hallucinationsâerrors not grounded in the immediate context or factual knowledgeâarise from a different mechanism or represent a more severe deviation in the model's processing.
* **Model Behavior Insight:** This visualization provides evidence for a potential diagnostic tool: monitoring a model's output embeddings could help classify the type of error (factual, associated hallucination, unassociated hallucination) based on their location in this feature space, aiding in targeted debugging and alignment research.
**Language Declaration:** All text in the image is in English.
</details>
Figure 8: t-SNE visualization of last tokenâs representations at layer 25 of LLaMA-3-8B.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Violin Plot: Token Probability Distributions for Language Model Outputs
### Overview
The image is a violin plot comparing the distribution of token probabilities for two large language models (LLMs) across three distinct categories of generated content. The plot visualizes how confidently each model assigns probabilities to tokens associated with factual information versus different types of hallucinations.
### Components/Axes
* **Chart Type:** Violin Plot (a combination of a box plot and a kernel density plot).
* **Y-Axis:** Labeled **"Token Probability"**. The scale runs from 0.0 to 1.0, with major gridlines at intervals of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
* **X-Axis:** Represents two distinct language models:
1. **LLaMA-3-8B** (left group)
2. **Mistral-7B-v0.3** (right group)
* **Legend:** Positioned at the bottom of the chart, centered. It defines three color-coded categories:
* **Green:** **Factual Associations**
* **Blue:** **Associated Hallucinations**
* **Red:** **Unassociated Hallucinations**
* **Data Series:** For each model, there are three violins, one for each category in the legend, placed side-by-side. Each violin shows the probability density of the data. Inside each violin, a horizontal line marks the median, and vertical lines (whiskers) extend to the rest of the distribution, excluding outliers.
### Detailed Analysis
**1. LLaMA-3-8B (Left Group):**
* **Factual Associations (Green):** The violin is widest between approximately 0.2 and 0.4, indicating the highest density of data points in this range. The median line is at approximately **0.35**. The distribution spans from near 0.0 to just below 1.0, with a relatively broad spread.
* **Associated Hallucinations (Blue):** This violin has a similar shape to the green one but is slightly shifted upward. Its widest point is between ~0.25 and 0.45. The median is slightly higher, at approximately **0.38**. The range is also broad, from near 0.0 to just below 1.0.
* **Unassociated Hallucinations (Red):** This distribution is markedly different. It is much tighter and concentrated at the lower end of the probability scale. The violin is widest between ~0.05 and 0.2. The median is significantly lower, at approximately **0.12**. The overall range is narrow, from 0.0 to about 0.5.
**2. Mistral-7B-v0.3 (Right Group):**
* **Factual Associations (Green):** The shape closely mirrors the green violin for LLaMA-3-8B. The density peaks between ~0.2 and 0.4. The median is approximately **0.35**. The distribution range is similar, from near 0.0 to just below 1.0.
* **Associated Hallucinations (Blue):** This violin also resembles its counterpart in the LLaMA group. The highest density is between ~0.25 and 0.45. The median is around **0.40**, which appears marginally higher than the median for LLaMA's Associated Hallucinations. The range extends from near 0.0 to just below 1.0.
* **Unassociated Hallucinations (Red):** This distribution is again the most constrained. It is concentrated between ~0.05 and 0.2, with a median of approximately **0.10**. The range is very narrow, from 0.0 to about 0.4.
### Key Observations
1. **Consistent Pattern Across Models:** Both LLaMA-3-8B and Mistral-7B-v0.3 exhibit the same fundamental pattern: **Factual Associations** and **Associated Hallucinations** have broadly similar, high-variance distributions with medians in the 0.35-0.40 range. **Unassociated Hallucinations** are distinctly different, with a tight, low-probability distribution (medians ~0.10-0.12).
2. **Low Confidence in Unassociated Hallucinations:** The most striking trend is that both models assign very low token probabilities to "Unassociated Hallucinations." The red violins are short, narrow, and anchored near zero, indicating the models are generally not confident about these tokens.
3. **Similarity Between Factual and Associated Hallucinations:** The green and blue violins for each model are quite similar in shape and median. This suggests the models' confidence levels (as measured by token probability) are comparable when generating factual associations versus hallucinations that are associated with the context.
4. **Model Comparison:** The distributions for Mistral-7B-v0.3 appear very slightly shifted toward higher probabilities compared to LLaMA-3-8B, particularly for Associated Hallucinations (blue median ~0.40 vs ~0.38). However, the difference is subtle, and the overall patterns are highly consistent.
### Interpretation
This chart provides a quantitative look at model confidence across different output types. The data suggests that:
* **Models are "aware" of unassociated nonsense:** The consistently low probabilities for Unassociated Hallucinations indicate the models have learned to assign low confidence to tokens that are contextually irrelevant or nonsensical, which is a desirable trait.
* **The challenge of associated hallucinations:** The fact that Associated Hallucinations receive token probabilities similar to Factual Associations is significant. It implies that when a model generates a plausible-sounding but incorrect statement (an associated hallucination), it does so with a level of internal confidence comparable to when it generates correct information. This makes such hallucinations particularly difficult to detect based on confidence scores alone.
* **Consistency across architectures:** The strong similarity between LLaMA-3-8B and Mistral-7B-v0.3 suggests this probability distribution pattern may be a common characteristic of modern LLMs, rather than an artifact of a specific model's training.
**In essence, the chart visualizes a core challenge in AI safety and reliability: models can be confidently wrong.** They struggle to internally distinguish, via token probability, between factual associations and contextually plausible hallucinations, while they are much better at identifying and downgrading completely unassociated content.
</details>
Figure 9: Distribution of last token probabilities.
This separation also appears in the entropy of the output distribution (Figure 9). Strong subject-to-last propagation in FAs and AHs yields low-entropy predictions concentrated on the correct or associated entity. In contrast, weak propagation in UHs produces broad, high-entropy distributions, spreading probability mass across many plausible candidates (e.g., multiple possible names for â The name of the father of <subject> is â).
Finding:
From mid-layers onward, UHs retain clustered last-token representations and high-entropy outputs, while FAs and AHs diverge into subject-specific subspaces with low-entropy outputs. This provides a clear signal to separate UHs from FAs and AHs, but not for FAs and AHs.
## 5 Revisiting Hallucination Detection
The mechanistic analysis in § 4 reveals that Internal states of LLMs primarily capture how the model recalls and utilizes its parametric knowledge, not whether the output is truthful. As both factual associations (FAs) and associated hallucinations (AHs) rely on the same subject-driven knowledge recall, their internal states show no clear separation. We therefore hypothesize that internal or black-box signals cannot effectively distinguish AHs from FAs, even though they could be effective in distinguishing unassociated hallucinations (UHs), which do not rely on parametric knowledge, from FAs.
Experimental Setups
To verify this, we revisit the effectiveness of widely-adopted white-box hallucination detection approaches that use internal state probing as well as black-box approaches that rely on scalar features. We evaluate on three settings: 1) AH Only (1,000 FAs and 1,000 AHs for training; 200 of each for testing), 2) UH Only (1,000 FAs and 1,000 UHs for training; 200 of each for testing), and 3) Full (1,000 FAs and 1,000 hallucination samples mixed of AHs and UHs for training; 200 of each for testing). For each setting, we use five random seeds to construct the training and testing datasets. We report the mean AUROC along with its standard deviation across seeds.
White-box methods: We extract and normalize internal features and then train a probe.
- Subject representations: last subject token hidden state from three consecutive layers Gottesman and Geva (2024).
- Attention flow: attention weights from the last token to subject tokens across all layers YĂŒksekgönĂŒl et al. (2024).
- Last-token representations: final token hidden state from the last layer Orgad et al. (2025).
Black-box methods: We test two commonly used scalar features, including answer token probability (Orgad et al., 2025) and subject popularity (average monthly Wikipedia page views) (Mallen et al., 2023a). As discussed in § 4.2.3 and § 4.4, these features are also related to whether the model relies on encoded knowledge to produce outputs rather than with truthfulness itself.
Experimental Results
| Subject Attention Last Token | $0.65± 0.02$ $0.58± 0.04$ $0.69± 0.03$ | $0.91± 0.01$ $0.92± 0.02$ $0.93± 0.01$ | $0.57± 0.02$ $0.58± 0.07$ $0.63± 0.02$ | $0.81± 0.02$ $0.87± 0.01$ $0.92± 0.01$ |
| --- | --- | --- | --- | --- |
| Probability | $0.49± 0.01$ | $0.86± 0.01$ | $0.46± 0.00$ | $0.89± 0.00$ |
| Subject Pop. | $0.48± 0.01$ | $0.87± 0.01$ | $0.52± 0.01$ | $0.84± 0.01$ |
Table 2: Hallucination detection performance on AH Only and UH Only settings.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Grouped Bar Chart: AUROC by Representation Type and Hallucination Category
### Overview
This is a grouped bar chart comparing the Area Under the Receiver Operating Characteristic curve (AUROC) performance metric for two different types of hallucinations ("Unassociated" and "Associated") across three different model representation types ("Subject", "Attention", and "Last Token"). The chart includes error bars for each data point.
### Components/Axes
* **Chart Type:** Grouped bar chart with error bars.
* **X-Axis (Horizontal):**
* **Label:** "Representation Type"
* **Categories (from left to right):** "Subject", "Attention", "Last Token".
* **Y-Axis (Vertical):**
* **Label:** "AUROC"
* **Scale:** Linear scale from 0.4 to 0.9, with major gridlines at 0.1 intervals (0.4, 0.5, 0.6, 0.7, 0.8, 0.9).
* **Legend:**
* **Position:** Bottom center, below the x-axis label.
* **Items:**
* **Red Bar:** "Unassociated Hallucination"
* **Blue Bar:** "Associated Hallucination"
* **Data Series:** Two series, represented by red and blue bars, plotted for each of the three x-axis categories.
### Detailed Analysis
**Data Points and Approximate Values (with visual uncertainty):**
1. **Subject Representation:**
* **Unassociated Hallucination (Red Bar):** The bar height is approximately **0.83**. The error bar extends from roughly **0.81 to 0.86**.
* **Associated Hallucination (Blue Bar):** The bar height is approximately **0.60**. The error bar extends from roughly **0.55 to 0.65**.
2. **Attention Representation:**
* **Unassociated Hallucination (Red Bar):** The bar height is approximately **0.84**. The error bar extends from roughly **0.81 to 0.87**.
* **Associated Hallucination (Blue Bar):** The bar height is approximately **0.56**. The error bar extends from roughly **0.54 to 0.59**.
3. **Last Token Representation:**
* **Unassociated Hallucination (Red Bar):** The bar height is approximately **0.88**. The error bar extends from roughly **0.85 to 0.91**.
* **Associated Hallucination (Blue Bar):** The bar height is approximately **0.59**. The error bar extends from roughly **0.55 to 0.63**.
**Trend Verification:**
* **Red Bars (Unassociated Hallucination):** Show a slight upward trend from left to right. The "Subject" bar is the shortest, "Attention" is marginally taller, and "Last Token" is the tallest.
* **Blue Bars (Associated Hallucination):** Show a slight downward trend from left to right. The "Subject" bar is the tallest, "Attention" is the shortest, and "Last Token" is slightly taller than "Attention" but shorter than "Subject".
### Key Observations
1. **Significant Performance Gap:** For all three representation types, the AUROC for detecting "Unassociated Hallucination" (red bars, ~0.83-0.88) is substantially higher than for "Associated Hallucination" (blue bars, ~0.56-0.60).
2. **Relative Consistency:** The performance for "Unassociated Hallucination" is relatively high and consistent across representations, with "Last Token" showing a slight advantage. Performance for "Associated Hallucination" is consistently lower and shows more variability, with "Attention" performing the worst.
3. **Error Bar Overlap:** The error bars for the two categories (red vs. blue) do not overlap within any representation type, indicating a statistically significant difference in performance between detecting unassociated vs. associated hallucinations.
4. **Within-Category Variability:** The error bars suggest moderate variability in the measurements, particularly for the "Associated Hallucination" in the "Subject" representation.
### Interpretation
This chart presents a comparative analysis of a model's ability to detect two distinct types of hallucinatory errors based on different internal representations.
* **Core Finding:** The data strongly suggests that the model's representations are far more effective at identifying "Unassociated Hallucinations" (likely errors where generated content is unrelated to the source) than "Associated Hallucinations" (likely errors where generated content is related but incorrect or fabricated). The AUROC values above 0.8 for unassociated errors indicate good discriminative ability, while values near 0.6 for associated errors suggest performance only slightly better than random chance.
* **Implication for Representation:** The "Last Token" representation appears marginally best for detecting unassociated errors, while the "Subject" representation is best (though still poor) for associated errors. This implies that different parts of the model's processing pathway may be more attuned to different failure modes.
* **Underlying Challenge:** The stark contrast in performance highlights a fundamental difficulty in AI safety and reliability: it is significantly harder for the model to detect subtle, contextually-relevant falsehoods (associated hallucinations) than it is to detect outright irrelevant or nonsensical outputs (unassociated hallucinations). This has critical implications for building trustworthy systems, as the more dangerous errors are often the plausible-sounding ones.
* **Investigative Lens (Peircean):** The chart acts as an *index* pointing to a specific property of the model's internal stateâits ability to flag errors. The consistent gap between the two bar colors is a *sign* that the nature of the hallucination (associated vs. unassociated) is a primary factor influencing detectability, more so than the specific representation type used for probing. The data invites the *hypothesis* that current representation analysis techniques are better at catching "out-of-distribution" style errors than "in-distribution" factual errors.
</details>
Figure 10: Hallucination detection performance on the Full setting (LLaMA-3-8B).
Table 2 shows that hallucination detection methods behave very differently in the AH Only and UH Only settings. For white-box probes, all approaches effectively distinguish UHs from FAs, with last-token hidden states reaching AUROC scores of about 0.93 for LLaMA and 0.92 for Mistral. In contrast, performance drops sharply on the AH Only setting, where the last-token probe falls to 0.69 for LLaMA and 0.63 for Mistral. Black-box methods follow the same pattern. Figure 10 further highlights this disparity under the Full setting: detection is consistently stronger on UH samples than on AH samples, and adding AHs to the training set significantly dilutes performance on UHs (AUROC $â$ 0.9 on UH Only vs. $â$ 0.8 on Full).
These results confirm that both internal probes and black-box methods capture whether a model draws on parametric knowledge, not whether its outputs are factually correct. Unassociated hallucinations are easier to detect because they bypass this knowledge, while associated hallucinations are produced through the same recall process as factual answers, leaving no internal cues to distinguish them. As a result, LLMs lack intrinsic awareness of their own truthfulness, and detection methods relying on these signals risk misclassifying associated hallucinations as correct, fostering harmful overconfidence in model outputs.
## 6 Challenges of Refusal Tuning
A common strategy to mitigate potential hallucination in the modelâs responses is to fine-tune LLMs to refuse answering when they cannot provide a factual response, e.g., Refusal Tuning Zhang et al. (2024). For such refusal capability to generalize, the training data must contain a shared feature pattern across hallucinated outputs, allowing the model to learn and apply it to unseen cases.
Our analysis in the previous sections shows that this prerequisite is not met. The structural mismatch between UHs and AHs suggests that refusal tuning on UHs may generalize to other UHs, because their hidden states occupy a common activation subspace, but will not transfer to AHs. Refusal tuning on AHs is even less effective, as their diverse representations prevent generalization to either unseen AHs or UHs.
Experimental Setups
To verify the hypothesis, we conduct refusal tuning on LLMs under two settings: 1) UH Only, where 1,000 UH samples are paired with 10 refusal templates, and 1,000 FA samples are preserved with their original answers. 2) AH Only, where 1,000 AH samples are paired with refusal templates, with 1,000 FA samples again leave unchanged. We then evaluate both models on 200 samples each of FAs, UHs, and AHs. A response matching any refusal template is counted as a refusal, and we report the Refusal Ratio as the proportion of samples eliciting refusals. This measures not only whether the model refuses appropriately on UHs and AHs, but also whether it âover-refusesâ on FA samples.
Experimental Results
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Refusal Ratio by Training Set and Testing Condition
### Overview
This is a grouped bar chart comparing the "Refusal Ratio (%)" of a system (likely an AI model) across two different training set conditions ("UH Only" and "AH Only") when evaluated on three distinct testing sets. The chart visualizes how the model's tendency to refuse requests varies based on its training data and the type of hallucination present in the test prompt.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **X-Axis (Horizontal):** Labeled **"Training Set"**. It contains two categorical groups:
1. **"UH Only"** (Left group)
2. **"AH Only"** (Right group)
* **Y-Axis (Vertical):** Labeled **"Refusal Ratio (%)"**. It is a linear scale ranging from 0 to 100, with major gridlines at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Legend:** Located in the **top-right corner** of the chart area, titled **"Testing set"**. It defines three data series by color:
* **Green square:** **"Factual Asso."** (Factual Association)
* **Blue square:** **"Asso. Hallu."** (Associated Hallucination)
* **Red/Salmon square:** **"Unasso. Halluc."** (Unassociated Hallucination)
### Detailed Analysis
The chart presents the following approximate refusal ratios for each training set and testing condition combination. Values are estimated based on bar height relative to the y-axis gridlines.
**1. Training Set: "UH Only" (Left Group)**
* **Testing Set: Factual Asso. (Green Bar):** The bar height is approximately **30%**. The trend is a moderate refusal rate.
* **Testing Set: Asso. Hallu. (Blue Bar):** The bar height is approximately **28%**, slightly lower than the green bar. The trend is a refusal rate similar to, but marginally less than, the factual condition.
* **Testing Set: Unasso. Halluc. (Red Bar):** The bar height is approximately **82%**. This is the tallest bar in the entire chart, showing a very strong upward trend compared to the other two conditions in this group.
**2. Training Set: "AH Only" (Right Group)**
* **Testing Set: Factual Asso. (Green Bar):** The bar height is approximately **22%**. The trend is a lower refusal rate compared to the "UH Only" training set for the same test.
* **Testing Set: Asso. Hallu. (Blue Bar):** The bar height is approximately **33%**. This is the tallest bar within the "AH Only" group, showing an upward trend relative to the other conditions in this group.
* **Testing Set: Unasso. Halluc. (Red Bar):** The bar height is approximately **24%**. The trend is a refusal rate much lower than the corresponding condition in the "UH Only" set and comparable to the factual condition within its own group.
### Key Observations
1. **Dominant Outlier:** The refusal ratio for **"Unasso. Halluc."** when the model is trained on **"UH Only"** (~82%) is dramatically higher than any other data point in the chart. It is more than triple the value of the same test condition under "AH Only" training.
2. **Training Set Impact:** The training set fundamentally alters the model's refusal profile:
* **"UH Only" Training:** Creates a model that is highly sensitive and refuses overwhelmingly to "Unasso. Halluc." prompts, while maintaining moderate, similar refusal rates for "Factual Asso." and "Asso. Hallu.".
* **"AH Only" Training:** Creates a model with a more balanced refusal profile across all test types, with the highest refusal rate (~33%) directed at "Asso. Hallu." prompts.
3. **Reversal of Hallucination Sensitivity:** The model's sensitivity to hallucination type flips based on training. "UH Only" training leads to extreme sensitivity to *Unassociated* Hallucinations. "AH Only" training leads to the highest sensitivity to *Associated* Hallucinations.
### Interpretation
This chart demonstrates a clear case of **training data bias shaping model behavior**. The "Refusal Ratio" likely measures how often a model declines to answer a prompt, possibly due to safety filters or uncertainty.
* **What the data suggests:** The model's refusal mechanism is not general but is specifically tuned to the type of hallucinations it encountered during training. Training on "Unassociated Hallucinations" (UH) appears to create an over-correction, making the model hyper-vigilant and prone to refusing similar prompts during testing. Conversely, training on "Associated Hallucinations" (AH) results in a more calibrated response, with a slight increase in caution towards the specific type of hallucination it was trained on.
* **Relationship between elements:** The stark contrast between the two red bars ("Unasso. Halluc.") across the two training sets is the central finding. It indicates that the "UH Only" training method may be less robust or lead to more brittle behavior compared to "AH Only" training, which yields more consistent performance across different test scenarios.
* **Implication:** For developers, this highlights the critical importance of **training data composition**. To build a model that refuses appropriately and consistently, the training data must carefully represent the spectrum of issues (like different hallucination types) the model will face. Relying on a narrow set of negative examples (like only UH) can create unintended and extreme behaviors.
</details>
Figure 11: Refusal tuning performance across three types of samples (LLaMA-3-8B).
Figure 11 shows that training with UHs leads to strong generalization across UHs, with refusal ratios of 82% for LLaMA. However, this effect does not transfer to AHs, where refusal ratios fall to 28%, respectively. Moreover, some FA cases are mistakenly refused (29.5%). These results confirm that UHs share a common activation subspace, supporting generalization within the category, while AHs and FAs lie outside this space. By contrast, training with AHs produces poor generalization. On AH test samples, refusal ratio is only 33%, validating that their subject-specific hidden states prevent consistent refusal learning. Generalization to UHs is also weak (23.5%), again reflecting the divergence between AH and UH activation spaces.
Overall, these findings show that the generalizability of refusal tuning is fundamentally limited by the heterogeneous nature of hallucinations. UH representations are internally consistent enough to support refusal generalization, but AH representations are too diverse for either UH-based or AH-based training to yield a broadly applicable and reliable refusal capability.
## 7 Conclusions and Future Work
In this work, we revisit the widely accepted claim that hallucinations can be detected from a modelâs internal states. Our mechanistic analysis reveals that hidden states encode whether models are reliance on their parametric knowledge rather than truthfulness. As a result, detection methods succeed only when outputs are detached from the input but fail when hallucinations arise from the same knowledge-recall process as correct answers.
These findings lead to three key implications. First, future evaluations should report detection performance separately for Associated Hallucinations (AHs) and Unassociated Hallucinations (UHs), as they stem from fundamentally different internal processes and require distinct detection strategies. Second, relying solely on hidden states is insufficient for reliable hallucination detection. Future research should integrate LLMs with external feedback mechanisms, such as fact-checking modules or retrieval-based verifiers, to assess factuality more robustly. Third, future studies should prioritize improving AH detection. Because AHs occur more frequently in widely known or highly popular topics (§ 4.2.3), their undetected errors pose greater risks to user trust and the practical reliability of LLMs.
## Limitations
We identify several limitations of our work.
Focus on Factual Knowledge
While our analysis identifies failure cases of hallucination detection methods, our study is primarily limited to factual completion prompts. It does not extend to long-form or open-ended text generation tasks Wei et al. (2024); Min et al. (2023); Huang and Chen (2024). Future work should broaden this investigation to these tasks in order to draw more comprehensive conclusions.
Lack of Analysis on Prompt-based Hallucination Detection Approaches
Our analysis focuses on white-box hallucination detection methods based on internal states and two black-box approaches based on external features. We do not include verbalization-based strategies Lin et al. (2022a); Tian et al. (2023); Xiong et al. (2024); Yang et al. (2024b); Ni et al. (2024); Zhao et al. (2024), such as prompting the model to report or justify its confidence explicitly, which constitute a different line of approach. Exploring such approaches may offer complementary insights into how models internally represent and express uncertainty.
Applicability to Black-box LLMs or Large Reasoning Models
Our study is limited to open-source LLMs. Conducting mechanistic analyses on commercial black-box LLMs is not permitted due to access restrictions. Future work could explore alternative evaluation protocols or collaboration frameworks that enable partial interpretability analyses on such systems. In addition, recent studies Mei et al. (2025); Zhang et al. (2025) have begun examining the internal states of large reasoning models for hallucination detection, suggesting a promising direction for extending our methodology to models with multi-step reasoning capabilities.
## Ethical Considerations
This work analyzes the internal mechanisms of large language models using data constructed from Wikidata Vrandecic and Krötzsch (2014), which is released under the Creative Commons CC0 1.0 Universal license, allowing unrestricted use and redistribution of its data. All data are derived from publicly available resources, and no private or sensitive information about individuals is included. We employ the LLM tools for polishing.
## References
- Azaria and Mitchell (2023) Amos Azaria and Tom M. Mitchell. 2023. The internal state of an LLM knows when itâs lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967â976.
- Cheang et al. (2023) Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li, Yanming Sun, Shudong Liu, and Lidia S. Chao. 2023. Can lms generalize to future data? an empirical analysis on text summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 16205â16217. Association for Computational Linguistics.
- Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: llmsâ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Daniel Han and team (2023) Michael Han Daniel Han and Unsloth team. 2023. Unsloth.
- Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Ding et al. (2024) Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2024. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. CoRR, abs/2402.10612.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, and 82 others. 2024. The llama 3 herd of models. CoRR, abs/2407.21783.
- Finlayson et al. (2021) Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart M. Shieber, Tal Linzen, and Yonatan Belinkov. 2021. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1828â1843. Association for Computational Linguistics.
- Gekhman et al. (2025) Zorik Gekhman, Eyal Ben-David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. 2025. Inside-out: Hidden factual knowledge in llms. CoRR, abs/2503.15299.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12216â12235. Association for Computational Linguistics.
- Gottesman and Geva (2024) Daniela Gottesman and Mor Geva. 2024. Estimating knowledge in large language models without generating a single token. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, pages 3994â4019.
- Guerreiro et al. (2023) Nuno Miguel Guerreiro, Elena Voita, and AndrĂ© F. T. Martins. 2023. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 1059â1075. Association for Computational Linguistics.
- Huang and Chen (2024) Chao-Wei Huang and Yun-Nung Chen. 2024. Factalign: Long-form factuality alignment of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16363â16375.
- Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2):42:1â42:55.
- Ji et al. (2024) Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. 2024. LLM internal states reveal hallucination risk faced with a query. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 88â104, Miami, Florida, US. Association for Computational Linguistics.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
- Kang and Choi (2023) Cheongwoong Kang and Jaesik Choi. 2023. Impact of co-occurrence on factual knowledge of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7721â7735.
- Kang et al. (2024) Katie Kang, Amrith Setlur, Claire J. Tomlin, and Sergey Levine. 2024. Deep neural networks tend to extrapolate predictably. In The Twelfth International Conference on Learning Representations, ICLR 2024.
- Kapoor et al. (2024) Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. 2024. Large language models must be taught to know what they donât know. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.
- Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Li et al. (2023) Kenneth Li, Oam Patel, Fernanda ViĂ©gas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36:41451â41530.
- Li et al. (2025) Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. 2025. Knowledge boundary of large language models: A survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, pages 5131â5157.
- Lin et al. (2022a) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022a. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022.
- Lin et al. (2022b) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022b. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214â3252.
- Mallen et al. (2023a) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023a. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, pages 9802â9822.
- Mallen et al. (2023b) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9802â9822. Association for Computational Linguistics.
- Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9004â9017. Association for Computational Linguistics.
- Mei et al. (2025) Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, and Anirudha Majumdar. 2025. Reasoning about uncertainty: Do reasoning models know when they donât know? CoRR, abs/2506.18183.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359â17372.
- Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 12076â12100.
- Ni et al. (2024) Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. 2024. When do llms need retrieval augmentation? mitigating llmsâ overconfidence helps retrieval augmentation. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11375â11388. Association for Computational Linguistics.
- Ni et al. (2025) Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. 2025. Towards fully exploiting LLM internal states to enhance knowledge boundary perception. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24315â24329. Association for Computational Linguistics.
- Orgad et al. (2025) Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. 2025. Llms know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, ICLR 2025.
- Sciavolino et al. (2021) Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6138â6148. Association for Computational Linguistics.
- Su et al. (2024) Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 14379â14391. Association for Computational Linguistics.
- Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5433â5442. Association for Computational Linguistics.
- Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. CoRR, abs/2307.03987.
- Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388â12401.
- Vrandecic and Krötzsch (2014) Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78â85.
- Wei et al. (2024) Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V. Le. 2024. Long-form factuality in large language models. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RĂ©mi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingfaceâs transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
- Xiao et al. (2025) Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, and Yu Rong. 2025. Analyzing llmsâ knowledge boundary cognition across languages through the lens of internal representations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24099â24115. Association for Computational Linguistics.
- Xiong et al. (2024) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Yang et al. (2024a) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others. 2024a. Qwen2.5 technical report. CoRR, abs/2412.15115.
- Yang et al. (2024b) Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. 2024b. Alignment for honesty. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024.
- Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do large language models know what they donât know? In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8653â8665. Association for Computational Linguistics.
- Yona et al. (2024) Gal Yona, Roee Aharoni, and Mor Geva. 2024. Narrowing the knowledge evaluation gap: Open-domain question answering with multi-granularity answers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 6737â6751. Association for Computational Linguistics.
- YĂŒksekgönĂŒl et al. (2024) Mert YĂŒksekgönĂŒl, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, and Besmira Nushi. 2024. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, ICLR 2024.
- Zhang et al. (2024) Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large language models to say âi donât knowâ. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, pages 7113â7139.
- Zhang et al. (2023a) Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, and Kumar Sricharan. 2023a. Sac ${}^\mbox{3}$ : Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. CoRR, abs/2311.01740.
- Zhang et al. (2025) Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. 2025. On the self-awareness of large reasoning modelsâ capability boundaries. Preprint, arXiv:2509.24711.
- Zhang et al. (2023b) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023b. Sirenâs song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219.
- Zhao et al. (2024) Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. 2024. Knowing what llms DO NOT know: A simple yet effective self-detection method. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, pages 7051â7063.
## Appendix
## Appendix A Datasets and Implementations
### A.1 Selected Relations and Prompt Templates
We employed a set of criteria to select relations from Wikidata in order to construct our dataset. Our criteria largely follow the framework proposed by Gekhman et al. (2025). Specifically, we require that each factual query in the dataset be unambiguous: given a subjectârelation pair, the object should be unique and easy verifiable. The criteria are as follows:
- Avoid granularity ambiguity. We exclude relations whose answers can vary in their level of detail. For example, in location queries, the response could be expressed as a city, state, or country, making it ill-defined Yona et al. (2024).
- Avoid surface-level guessing. We exclude relations whose correct answers can often be inferred from shallow patterns. For instance, country of citizenship can frequently be guessed from shallow lexical patterns, rather then reflecting actual memorization Mallen et al. (2023b).
Following these criteria, Gekhman et al. (2025) narrowed the 24 relations introduced by Sciavolino et al. (2021) down to four. However, we observe that their filtering primarily addresses ambiguity at the relation and object levels, but does not consider ambiguity at the subject level. In practice, some relations involve subjects that are inherently ambiguous. For example, the relation record label can be problematic because many songs share identical names, leading to unclear subjectâobject mappings.
To mitigate such cases, we apply an additional subject-level filtering step and restrict our dataset to relations where the subject is a person, thereby reducing ambiguity. In addition, we manually include certain relations to strengthen the dataset. Concretely, we use the following four relations: P22 (father), P25 (mother), P26 (spouse), and P569 (date of birth). We show the list of the templates used to create our dataset in Table 3.
| father mother spouse | The name of the father of [subject] is The name of the mother of [subject] is The name of the spouse of [subject] is |
| --- | --- |
| date of birth | The birth date of [subject] is |
Table 3: Relations and prompt templates for querying factual knowledge of models. [subject] is a placeholder replaced with subject entities.
| I will give you a factual query (e.g., âThe name of the father of <subj>â), a gold answer to the factual query, and a proposed answer generated by an LLM. You need to compare the proposed answer to the gold answer and assign it one of the possible grades using the steps below. |
| --- |
| Possible grades are: |
| A: CORRECT |
| B: INCORRECT |
| C: WRONG GOLD |
| D: ERROR |
| Spelling errors, synonyms, abbreviations, or hedging expressions (e.g., âit is possible thatâ) should not alter the grade if the person referred to in the proposed answer matches the gold answer. |
| Steps: |
| Step 1: If the gold answer does not correspond to an answer for the question, output âCâ and finish. Otherwise, proceed to Step 2. |
| Step 2: Extract all predicted entities from the proposed answer. Proceed to Step 3. |
| Step 3: If each predicted entity refers to the answer mentioned in the gold answer, output âAâ and finish. Otherwise, proceed to Step 4. |
| Step 4: If the predicted entity does not refer to the gold answer, output âBâ and finish. Otherwise, proceed to Step 5. |
| Step 5: Double-check whether the proposed answer refers to a different answer from the gold answer. If it does, output âB.â Otherwise, output âDâ and finish. |
| Input format: |
| Question: {question} |
| Gold answer: {gold_answer} |
| Proposed answer: {proposed_answer} |
| Instruction: Output your reasoning steps. After that, conclude your response with âOutput:â followed by the letter (A, B, C, or D). Do not provide any further explanation. |
Figure 12: LLM Judge prompt used for evaluation.
### A.2 Labeling Scheme
We follow the criteria in § 3 to label the data samples into different categories:
- Factual Correctness: We construct correctness labels through a two-stage process. First, we use spaCy https://spacy.io/ Named Entity Recognizer to extract the target entity from the modelâs output. If it matches the ground truth, the answer is marked correct. Otherwise, or if extraction fails, we rely on Qwen2.5-14B-Instruct Yang et al. (2024a) as an automatic judge to compare the predicted answer with the ground truth. Following Gekhman et al. (2025), we design the evaluation prompt, which is shown in Figure 12.
- Subject Representation Reliance: We assess whether a prediction relies on the subjectâs representation by blocking attention from subject tokens and measuring the resulting distribution shift. If the subject is crucial, masking disrupts information flow and yields a large shift; if not, the effect is minimal. Concretely, we compare the output distributions of the original prompt and the masked prompt (e.g., with â Barack Obama â masked), using JensenâShannon (JS) divergence to quantify the difference. A high JS divergence indicates strong reliance on the subject, while a low value suggests limited contribution. We then set a threshold based on the average JS divergence across all correct answers, assuming these inherently depend on subject representations.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Heatmap: Average Jensen-Shannon Divergence Across Model Layers
### Overview
The image is a heatmap visualizing the average Jensen-Shannon (JS) Divergence across 31 layers (0-30) of a model for three distinct categories or components. The divergence is represented by a color gradient, with darker blues indicating higher divergence values. The chart is designed to compare how the divergence metric evolves across model depth for different aspects of the model's processing.
### Components/Axes
* **Y-Axis (Vertical):** Lists three categorical components. From top to bottom:
* `Subj.` (likely "Subject")
* `Attn.` (likely "Attention")
* `Last.` (likely "Last" or "Final" layer representation)
* **X-Axis (Horizontal):** Labeled "Layer". It displays discrete layer numbers from 0 to 30, with tick marks at every even number (0, 2, 4, ..., 30).
* **Color Bar (Legend):** Located on the right side of the chart.
* **Title:** "Avg JS Divergence"
* **Scale:** A continuous vertical gradient from light blue/white at the bottom to dark blue at the top.
* **Labeled Ticks:** 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. The gradient suggests values can exist between these ticks.
### Detailed Analysis
The heatmap is a grid where each cell's color corresponds to the average JS Divergence for a specific component at a specific layer. The following analysis is based on visual estimation of color intensity against the provided scale.
**1. "Subj." Row (Top Row):**
* **Trend:** Starts with very high divergence in the earliest layers, which gradually decreases and then drops off sharply in the later layers.
* **Data Points (Estimated):**
* Layers 0-10: Very dark blue, indicating divergence values between **~0.55 and 0.6**.
* Layers 11-18: Medium to dark blue, showing a gradual decline from **~0.5 to ~0.4**.
* Layers 19-22: Light blue, indicating a rapid drop to **~0.2 to 0.15**.
* Layers 23-30: Very light blue/white, indicating divergence at or below **~0.1**.
**2. "Attn." Row (Middle Row):**
* **Trend:** Shows consistently low divergence across all layers, with a very slight, localized increase in the middle layers.
* **Data Points (Estimated):**
* Layers 0-10: Very light blue/white, indicating divergence at or below **~0.1**.
* Layers 11-20: Light blue, showing a slight increase to approximately **~0.15 to 0.2**.
* Layers 21-30: Returns to very light blue/white, indicating divergence at or below **~0.1**.
**3. "Last." Row (Bottom Row):**
* **Trend:** Shows a steady, monotonic increase in divergence from the first layer to the last.
* **Data Points (Estimated):**
* Layers 0-6: Very light blue/white, indicating divergence at or below **~0.1**.
* Layers 7-14: Light blue, showing a gradual increase from **~0.1 to ~0.2**.
* Layers 15-22: Medium blue, indicating values from **~0.2 to ~0.3**.
* Layers 23-30: Medium-dark blue, showing a continued rise to approximately **~0.35**.
### Key Observations
1. **Divergent Patterns:** The three components exhibit fundamentally different divergence profiles across the model's depth. "Subj." is high-then-low, "Attn." is consistently low, and "Last." is low-then-high.
2. **"Subj." Dominance in Early Layers:** The highest divergence values in the entire chart are found in the "Subj." component within the first ~10 layers.
3. **"Attn." Stability:** The attention mechanism ("Attn.") shows the least variation and the lowest overall divergence, suggesting its internal representations are relatively stable or consistent across layers as measured by JS Divergence.
4. **"Last." Accumulation:** The "Last." component shows a clear pattern of accumulating divergence as information propagates through the network layers.
### Interpretation
This heatmap likely analyzes the internal dynamics of a deep neural network, possibly a transformer model given the "Attn." label. Jensen-Shannon Divergence measures the similarity between two probability distributions. Here, it is probably comparing the distribution of activations or attention patterns at each layer to some reference distribution (e.g., the distribution at the final layer, or across different inputs).
* **"Subj." (Subject):** The high early-layer divergence suggests that the model's initial processing of subject-related information is highly variable or distinct from its later, more refined representations. The sharp drop indicates this information becomes consolidated or standardized in deeper layers.
* **"Attn." (Attention):** The consistently low divergence implies that the fundamental patterns of how the model attends to different parts of the input remain relatively constant throughout its depth. The minor bump in middle layers could indicate a phase of subtle reweighting.
* **"Last." (Final Representation):** The steadily increasing divergence suggests that the model's high-level, integrated representations become progressively more distinct or specialized layer by layer, moving away from the initial, more generic input representation.
**Overall Implication:** The chart reveals a functional specialization across the network's depth. Early layers are highly active in processing and differentiating core semantic elements ("Subj."), middle layers maintain stable attention mechanisms ("Attn."), and deeper layers progressively build unique, complex representations ("Last."). This pattern is consistent with the understanding of deep networks learning hierarchical features.
</details>
(a) Factual Associations
<details>
<summary>x15.png Details</summary>

### Visual Description
## Heatmap: Average Jensen-Shannon Divergence Across Model Layers
### Overview
The image is a heatmap visualizing the average Jensen-Shannon (JS) Divergence across 31 layers (0-30) of a model for three distinct components or metrics. The heatmap uses a blue color gradient to represent the magnitude of divergence, with a corresponding color scale bar on the right.
### Components/Axes
* **Y-Axis (Vertical):** Lists three categories, positioned on the left side of the chart.
* `Subj.` (Top row)
* `Attn.` (Middle row)
* `Last.` (Bottom row)
* **X-Axis (Horizontal):** Labeled "Layer" at the bottom center. It displays numerical markers from 0 to 30, incrementing by 2 (0, 2, 4, ..., 30). Each integer layer from 0 to 30 is represented by a vertical column in the heatmap.
* **Color Scale (Legend):** Positioned on the right side of the chart. It is a vertical bar labeled "Avg JS Divergence". The scale ranges from a light blue/white at the bottom (value `0.1`) to a dark blue at the top (value `0.6`), with intermediate markers at `0.2`, `0.3`, `0.4`, and `0.5`.
### Detailed Analysis
The heatmap displays the following patterns for each row (component) across the layers (columns):
1. **Row: `Subj.`**
* **Trend:** Starts with very high divergence in the earliest layers, which gradually decreases and fades to very low divergence in the later layers.
* **Approximate Values:** Layers 0-12 show the darkest blue, indicating Avg JS Divergence values near or at the maximum of `~0.6`. The color begins to lighten noticeably around layer 14 (`~0.5`), continues to fade through layers 16-20 (`~0.4` to `~0.2`), and becomes very light (near `0.1`) from layer 22 onward.
2. **Row: `Attn.`**
* **Trend:** Shows generally low divergence across most layers, with a subtle, localized increase in the middle layers.
* **Approximate Values:** Layers 0-10 are very light, indicating values near `0.1`. A slight darkening is visible from approximately layer 12 to layer 20, suggesting a modest increase in divergence to around `~0.2` to `~0.3`. The divergence returns to very low levels (`~0.1`) from layer 22 to layer 30.
3. **Row: `Last.`**
* **Trend:** Starts with very low divergence in the early layers and shows a steady, progressive increase in divergence across the subsequent layers.
* **Approximate Values:** Layers 0-6 are very light (`~0.1`). A gradual darkening begins around layer 8 (`~0.15`), becoming more pronounced through the middle layers (e.g., layer 16 is `~0.3`, layer 22 is `~0.4`). The divergence continues to increase, with the final layers (28-30) showing a medium blue, corresponding to values of approximately `~0.45` to `0.5`.
### Key Observations
* **Inverse Relationship:** The `Subj.` and `Last.` rows exhibit a near-inverse relationship. `Subj.` divergence is highest in early layers and decays, while `Last.` divergence is lowest in early layers and grows.
* **Localized Activity:** The `Attn.` row shows a distinct, bounded region of slightly elevated divergence in the model's middle layers (approx. 12-20), unlike the broad trends of the other two rows.
* **Maximum Divergence:** The highest divergence values (`~0.6`) are exclusively found in the `Subj.` row for the first dozen layers.
* **Layer 22 Transition:** Layer 22 appears to be a transition point where the `Subj.` row's divergence has nearly vanished, the `Attn.` row's minor elevation ends, and the `Last.` row's divergence becomes firmly established.
### Interpretation
This heatmap likely visualizes how different types of information or representations evolve within a deep neural network (e.g., a transformer model) across its layers.
* **`Subj.` (Subject/Subject Representation):** The high early-layer divergence suggests that subject-related information is processed and is highly variable or "divergent" in the initial stages of the network. Its decay indicates this representation becomes more stable or converges as information flows deeper.
* **`Attn.` (Attention):** The localized increase in the middle layers aligns with the hypothesis that attention mechanisms perform significant, focused computations in the intermediate processing stages of the model, before the representations are finalized.
* **`Last.` (Last Layer/Final Representation):** The steadily increasing divergence suggests that the final output representation becomes progressively more distinct or specialized layer-by-layer, accumulating information from earlier processing stages. The high divergence in later layers may reflect the model's preparation for a specific, fine-grained prediction task.
**Overall Narrative:** The data suggests a processing pipeline where initial layers heavily work on subject-related features (`Subj.`), middle layers engage in focused relational computations (`Attn.`), and later layers build up a complex, divergent final representation (`Last.`) suitable for the model's ultimate objective. The inverse trend between `Subj.` and `Last.` could indicate a transformation from raw, variable input features to a refined, task-specific output.
</details>
(b) Associated Hallucinations
<details>
<summary>x16.png Details</summary>

### Visual Description
## Heatmap: Average Jensen-Shannon Divergence Across Model Layers and Components
### Overview
The image is a heatmap chart visualizing the average Jensen-Shannon (JS) divergence across different layers of a model for three distinct components or categories. The divergence is represented by a color gradient, with darker blue indicating higher divergence values.
### Components/Axes
* **X-Axis (Horizontal):** Labeled **"Layer"**. It represents model layers, with tick marks and numerical labels at intervals of 2, starting from **0** and ending at **30** (0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30).
* **Y-Axis (Vertical):** Contains three categorical labels, positioned on the left side of the chart. From top to bottom, they are:
1. **"Subj."** (likely an abbreviation for "Subject" or a similar term).
2. **"Attn."** (likely an abbreviation for "Attention").
3. **"Last."** (likely an abbreviation for "Last" or "Final").
* **Color Bar/Legend:** Positioned on the right side of the chart. It is a vertical gradient bar labeled **"Avg JS Divergence"**. The scale runs from **0.1** (lightest blue/white at the bottom) to **0.6** (darkest blue at the top), with intermediate markers at **0.2, 0.3, 0.4, and 0.5**.
### Detailed Analysis
The heatmap displays a 3 (rows) x 16 (columns, representing layers 0-30 in steps of 2) grid of colored cells. The color intensity of each cell corresponds to the average JS divergence value for that specific component at that specific layer.
* **Row 1: "Subj."**
* **Trend:** Shows a clear **downward slope** from left to right. Divergence is highest in the earliest layers and decreases steadily as the layer number increases.
* **Values:** The cells for layers 0-4 are the darkest blue in the entire chart, indicating values near or at the maximum of the scale (~0.6). The blue intensity fades progressively. By layers 14-16, the color is a medium blue (~0.3-0.4). From layer 18 onward, the cells become very light blue to nearly white, indicating values approaching the minimum (~0.1-0.2).
* **Row 2: "Attn."**
* **Trend:** The row is **uniformly flat and light**. There is no significant visual trend across layers.
* **Values:** All cells in this row are a very light blue or off-white color, consistent with the low end of the divergence scale. The values appear to be consistently low, likely in the range of **0.1 to 0.15**, across all layers from 0 to 30.
* **Row 3: "Last."**
* **Trend:** Shows a clear **upward slope** from left to right, which is the inverse of the "Subj." trend. Divergence is lowest in the earliest layers and increases as the layer number increases.
* **Values:** The cells for layers 0-8 are very light, similar to the "Attn." row (~0.1-0.15). A noticeable increase in blue intensity begins around layer 10. The color deepens through the middle layers (12-20), reaching a medium blue (~0.3-0.4). The divergence appears to plateau or increase slightly more slowly in the final layers (22-30), maintaining a medium blue shade.
### Key Observations
1. **Inverse Relationship:** The "Subj." and "Last." components exhibit nearly opposite divergence patterns across the model's depth. "Subj." divergence is high early and decays, while "Last." divergence is low early and grows.
2. **Stable Low Divergence for Attention:** The "Attn." component maintains a consistently low average JS divergence throughout all measured layers, suggesting its output distribution is relatively stable or similar across the conditions being compared.
3. **Peak Divergence Location:** The highest divergence values in the entire model are found in the "Subj." component within the first few layers (0-4).
4. **Transition Zones:** The most significant changes in divergence for "Subj." and "Last." occur in the early-to-middle layers (approximately layers 0-16 for "Subj." decay and layers 8-20 for "Last." growth).
### Interpretation
This heatmap likely visualizes how the similarity (or dissimilarity, measured by JS divergence) of internal model representations changes across its depth for different functional components. The data suggests a fundamental shift in processing:
* The **"Subj."** component's representations diverge significantly early in the network, possibly indicating initial, high-variance processing of subject-related features that becomes more refined and consistent in deeper layers.
* The **"Attn."** (Attention) mechanism's representations remain consistently similar across the compared conditions at all depths, implying its role is stable and not a primary source of representational divergence in this analysis.
* The **"Last."** component's representations become increasingly divergent in deeper layers. This could indicate that final-layer outputs or late-stage processing are where the model's representations differentiate the most based on the input or condition being tested, potentially aligning with task-specific decision boundaries.
The contrasting patterns between "Subj." and "Last." highlight a potential architectural or functional dichotomy: early-layer specialization versus late-layer differentiation. The stability of "Attn." provides a control baseline, showing that not all components exhibit such dramatic depth-dependent changes in representational similarity.
</details>
(c) Unassociated Hallucinations
Figure 13: Effect of interventions across layers of Mistral-7B-v0.3. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the modelâs predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token.
### A.3 Implementation Details
Checkpoints and GPU resources.
All the checkpoints used in our experiments are provided by the Hugging Face Transformers library Wolf et al. (2019). Specifically, we use the checkpoint âmeta-llama/Meta-Llama-3-8Bâ https://huggingface.co/meta-llama/Meta-Llama-3-8B and âmistralai/Mistral-7B-v0.3â https://huggingface.co/mistralai/Mistral-7B-v0.3 for the experiments of response generation (§ 3), hidden-state analysis (§ 4) and accessing the performance of hallucination detection methods (§ 5). For refusal tuning (§ 6), we use checkpoints provided by the Unsloth framework Daniel Han and team (2023), namely âunsloth/llama-3-8bâ https://huggingface.co/unsloth/llama-3-8b and âunsloth/mistral-7b-v0.3â https://huggingface.co/unsloth/mistral-7b-v0.3, which enable more efficient fine-tuning. All experiments are conducted on 4 NVIDIA L40S GPUs.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Layer-wise Normalized Ratio Comparison Chart
### Overview
The image is a line chart comparing two normalized ratio metrics across 32 layers (indexed 0 to 31) of a model or system. The chart plots the "Norm Ratio" on the y-axis against "Layers" on the x-axis. It features two distinct data series, differentiated by color and marker shape, as defined in a legend positioned at the top-center of the plot area.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Layers"
* **Scale:** Linear, ranging from 0 to 30, with major tick marks every 5 units (0, 5, 10, 15, 20, 25, 30). The data extends to layer 31.
* **Y-Axis:**
* **Label:** "Norm Ratio"
* **Scale:** Linear, ranging from approximately 0.96 to 1.01, with major tick marks at 0.96, 0.97, 0.98, 0.99, 1.00, and 1.01.
* **Legend:** Located in the top-center region of the chart.
* **Series 1:** Label: "Asso. Hallu./Factual Asso." | Color: Blue | Marker: Circle (â) | Line: Solid blue.
* **Series 2:** Label: "Unasso. Hallu./Factual Asso." | Color: Light Red/Salmon | Marker: Square (â ) | Line: Solid light red/salmon.
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
**Series 1: Asso. Hallu./Factual Asso. (Blue Line, Circle Markers)**
* **Trend:** This series exhibits a relatively stable trend with a shallow, broad dip in the early-to-mid layers. It starts above 1.00, decreases to a minimum around layer 6, and then gradually recovers, ending at its highest point.
* **Approximate Data Points (Key Layers):**
* Layer 0: ~1.002
* Layer 3: ~1.004 (local peak)
* Layer 6: ~0.996 (approximate minimum)
* Layer 15: ~0.999
* Layer 20: ~1.000
* Layer 31: ~1.005 (approximate maximum)
**Series 2: Unasso. Hallu./Factual Asso. (Light Red Line, Square Markers)**
* **Trend:** This series shows high volatility. It begins below 1.00, experiences a sharp decline to a deep trough in the early layers (5-15), followed by a steep and sustained increase through the later layers, ultimately surpassing the blue series at the final layer.
* **Approximate Data Points (Key Layers):**
* Layer 0: ~0.989
* Layer 1: ~0.992 (local peak before drop)
* Layer 5: ~0.955 (approximate global minimum)
* Layer 10: ~0.963
* Layer 15: ~0.963
* Layer 18: ~0.992
* Layer 20: ~0.999
* Layer 25: ~0.993
* Layer 30: ~0.986
* Layer 31: ~1.012 (approximate global maximum, sharp spike)
### Key Observations
1. **Contrasting Volatility:** The "Unasso." series (red) is far more volatile than the "Asso." series (blue), with a much larger range of values (~0.955 to ~1.012 vs. ~0.996 to ~1.005).
2. **Divergent Early Behavior:** In the first 15 layers, the two series move in opposite directions. The blue series dips slightly, while the red series plummets to its lowest values.
3. **Convergence and Crossover:** The series converge around layers 19-20, where both are near 1.00. After layer 20, the red series generally declines while the blue series slowly rises, until a dramatic final spike in the red series at layer 31 causes it to exceed the blue series for the first and only time.
4. **Minimum Points:** The minimum for the blue series occurs around layer 6, while the red series hits its minimum earlier, around layer 5.
### Interpretation
This chart likely visualizes a metric comparing the prevalence or strength of "hallucinations" (unfounded associations) relative to "factual associations" across different layers of a neural network or similar hierarchical model. The "Norm Ratio" suggests a normalized comparison.
* **"Asso. Hallu./Factual Asso." (Blue):** This ratio remains close to 1.00 across all layers, indicating that *associated* hallucinations (perhaps those linked to some contextual or semantic cue) maintain a fairly constant relationship with factual associations throughout the model's processing depth. The shallow dip suggests a slight relative decrease in associated hallucinations in the early-middle layers.
* **"Unasso. Hallu./Factual Asso." (Red):** This ratio shows a dramatic U-shaped pattern. The deep trough in layers 5-15 suggests that in the early-to-middle processing stages, *unassociated* hallucinations (random or baseless errors) are significantly suppressed relative to factual knowledge. The steep rise thereafter indicates that in deeper layers, these unassociated errors become increasingly prominent, even surpassing the factual association baseline at the very end (layer 31). This could imply that later layers are more prone to generating or amplifying unsupported inferences.
**Overall Implication:** The model's processing appears to have distinct phases. Early/mid layers are highly effective at filtering out unassociated noise (red line low), while maintaining a stable baseline of associated errors (blue line stable). Later layers, however, show a breakdown in this filtering for unassociated errors, leading to a surge in their relative presence. The final spike is a notable anomaly, suggesting a potential point of instability or a specific phenomenon occurring at the model's output layer.
</details>
Figure 14: Norm ratio curves of subject representations in Mistral-7B-v0.3, comparing AHs and UHs against FAs as the baseline. At earlier layers, the norm of UH samples is significantly lower than that of AH samples.
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Grouped Bar Chart: Distribution of Association Types Across Three Levels
### Overview
This is a grouped bar chart displaying the percentage distribution of three distinct types of associations (Factual Associations, Associated Hallucinations, and Unassociated Hallucinations) across three categorical levels: Low, Mid, and High. The chart compares how the prevalence of these association types changes across the levels.
### Components/Axes
* **X-Axis (Horizontal):** Categorical axis labeled with three levels: "Low", "Mid", and "High".
* **Y-Axis (Vertical):** Numerical axis labeled "Percentage (%)" with a scale from 0 to 100, marked at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Legend:** Positioned at the bottom of the chart. It defines three data series by color:
* **Green Bar:** "Factual Associations"
* **Blue Bar:** "Associated Hallucinations"
* **Red Bar:** "Unassociated Hallucinations"
### Detailed Analysis
The data is presented in three groups, one for each level on the x-axis. Each group contains three bars corresponding to the legend.
**1. Low Level Group (Leftmost):**
* **Factual Associations (Green):** 5%
* **Associated Hallucinations (Blue):** 2%
* **Unassociated Hallucinations (Red):** 93%
* *Trend Verification:* The red bar is overwhelmingly dominant, while the green and blue bars are minimal.
**2. Mid Level Group (Center):**
* **Factual Associations (Green):** 25%
* **Associated Hallucinations (Blue):** 6%
* **Unassociated Hallucinations (Red):** 70%
* *Trend Verification:* The red bar remains the largest but has decreased significantly from the Low level. The green bar has increased notably, and the blue bar shows a slight increase.
**3. High Level Group (Rightmost):**
* **Factual Associations (Green):** 48%
* **Associated Hallucinations (Blue):** 12%
* **Unassociated Hallucinations (Red):** 40%
* *Trend Verification:* The green bar is now the largest in this group, having increased steadily from Low to High. The red bar has continued its downward trend. The blue bar shows its highest value here.
### Key Observations
1. **Dominant Inverse Trend:** There is a clear inverse relationship between "Factual Associations" and "Unassociated Hallucinations" across the levels. As the level increases from Low to High, the percentage of Factual Associations rises sharply (5% â 25% â 48%), while the percentage of Unassociated Hallucinations falls sharply (93% â 70% â 40%).
2. **Minor Trend for Associated Hallucinations:** "Associated Hallucinations" show a modest but consistent upward trend across levels (2% â 6% â 12%).
3. **Shift in Dominance:** The dominant association type shifts from "Unassociated Hallucinations" at the Low level to "Factual Associations" at the High level.
4. **Magnitude of Change:** The most dramatic single change is the 45-percentage-point drop in Unassociated Hallucinations from the Low to the Mid level.
### Interpretation
The data suggests a strong correlation between the categorical level (Low, Mid, High) and the nature of the associations produced. The "Low" level is characterized almost exclusively by "Unassociated Hallucinations," implying a state where generated content is largely disconnected from factual input. As the level increases to "Mid" and then "High," there is a progressive shift towards more "Factual Associations," indicating improved grounding or accuracy. The concurrent, though smaller, rise in "Associated Hallucinations" suggests that as systems become more factually grounded, the errors they do make may become more subtly related to the input rather than completely unassociated.
This chart likely illustrates the performance or output characteristics of a system (e.g., an AI model) under different conditions or configurations labeled Low, Mid, and High. The key takeaway is that moving from the "Low" to "High" condition significantly reduces completely ungrounded errors (Unassociated Hallucinations) and increases factually correct outputs (Factual Associations).
</details>
Figure 15: Sample distribution across different subject popularity (low, mid, high) in Mistral-7B-v0.3, measured by monthly Wikipedia page views.
Decoding algorithm.
We employ greedy decoding ( $temperature=0$ ) for response generation, with models run in BF16 precision.
PEFT settings for refusal tuning.
For refusal tuning, we fine-tune with both models using QLoRA Dettmers et al. (2023), implemented with the Unsloth framework Daniel Han and team (2023), with rank $r=8$ , and $α=8$ . QLoRA adapters are applied to all attention and MLP modules, and each model is fine-tuned for one epoch.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Chart: Norm Values Across Layers for Three Association/Hallucination Types
### Overview
This is a line chart plotting a metric called "Norm" on the vertical y-axis against "Layer" on the horizontal x-axis. It compares three distinct data series over a range of layers from 0 to approximately 31. The chart shows that two of the series exhibit a dramatic, synchronized spike around layer 20, while the third remains relatively low and stable.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Layer"
* **Scale:** Linear, from 0 to 30, with major tick marks every 5 units (0, 5, 10, 15, 20, 25, 30). Data points appear to be plotted for every integer layer from 0 to 31.
* **Y-Axis:**
* **Label:** "Norm"
* **Scale:** Linear, from 0 to 5, with major tick marks every 1 unit (0, 1, 2, 3, 4, 5).
* **Legend:** Located in the **top-left corner** of the plot area. It defines three series:
1. **Factual Asso.** - Represented by a **green line with upward-pointing triangle markers**.
2. **Asso. Hallu.** - Represented by a **blue line with circle markers**.
3. **Unasso. Hallu.** - Represented by a **red (salmon/pink) line with square markers**.
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
**1. Factual Asso. (Green Triangle Line):**
* **Trend:** Starts near zero, shows a very gradual, shallow increase from layer 0 to ~15. Experiences a sharp, dramatic increase beginning around layer 17, peaking at its maximum value. Following the peak, it declines sharply, then exhibits volatile fluctuations with secondary peaks.
* **Key Data Points (Approximate):**
* Layers 0-15: Norm value slowly rises from ~0.05 to ~0.7.
* Layer 17: ~1.0
* Layer 18: ~1.6
* **Layer 20 (Peak): ~4.8**
* Layer 21: ~3.5
* Layer 22: ~3.6
* Layer 23: ~3.1
* Layer 24: ~0.6 (sharp drop)
* Layer 25: ~1.5
* Layer 26: ~0.3
* Layer 27: ~0.3
* Layer 28: ~1.5
* **Layer 29 (Secondary Peak): ~2.0**
* Layer 30: ~1.1
* Layer 31: ~1.5
**2. Asso. Hallu. (Blue Circle Line):**
* **Trend:** Closely mirrors the "Factual Asso." line in shape and timing, particularly during the major spike. Its values are generally slightly lower than the green line at corresponding points, especially at the peak and in the later fluctuations.
* **Key Data Points (Approximate):**
* Layers 0-15: Tracks very closely with the green line, from ~0.05 to ~0.8.
* Layer 17: ~1.1
* Layer 18: ~1.6
* **Layer 20 (Peak): ~4.6**
* Layer 21: ~3.4
* Layer 22: ~3.4
* Layer 23: ~3.1
* Layer 24: ~0.5
* Layer 25: ~1.4
* Layer 26: ~0.5
* Layer 27: ~0.5
* Layer 28: ~1.7
* Layer 29: ~1.9
* Layer 30: ~1.1
* Layer 31: ~1.5
**3. Unasso. Hallu. (Red Square Line):**
* **Trend:** Shows a much more subdued pattern. It has a very gradual, low-amplitude increase from layer 0, with a modest peak around layer 20, followed by a slight decline and then a gentle rise again towards the end. It never exhibits the dramatic spikes seen in the other two series.
* **Key Data Points (Approximate):**
* Layers 0-15: Very slow rise from ~0.05 to ~0.5.
* Layer 17: ~0.5
* Layer 18: ~0.7
* **Layer 20 (Peak): ~1.2**
* Layer 21: ~1.0
* Layer 22: ~0.9
* Layer 23: ~0.9
* Layer 24: ~0.8
* Layer 25: ~0.5
* Layer 26: ~0.5
* Layer 27: ~0.5
* Layer 28: ~0.9
* Layer 29: ~1.0
* Layer 30: ~1.3
* Layer 31: ~1.2
### Key Observations
1. **Synchronized Spike:** The most prominent feature is the massive, synchronized spike in both "Factual Asso." and "Asso. Hallu." at **Layer 20**. Their values jump from ~1.6 to over 4.5 within 2-3 layers.
2. **Strong Correlation:** The green and blue lines are highly correlated in their movement, suggesting a strong relationship between "Factual Association" and "Associated Hallucination" metrics across layers.
3. **Divergence Post-Peak:** After the major spike (post-layer 23), the correlation weakens. The green line ("Factual Asso.") shows more extreme volatility, with deeper troughs and higher secondary peaks compared to the blue line.
4. **Baseline Difference:** The "Unasso. Hallu." (red) series operates on a fundamentally different scale, remaining below a Norm of 1.5 throughout, while the other two exceed 4.5 at their peak.
5. **Layer Sensitivity:** The system or model being measured appears to have a critical processing stage or transformation occurring around **Layer 20**, which dramatically amplifies the measured "Norm" for factual and associated hallucination content.
### Interpretation
This chart likely visualizes internal metrics from a neural network or language model, where "Layer" refers to the depth within the model's architecture. The "Norm" could represent the magnitude of activations, attention weights, or some other internal representation strength.
* **What the data suggests:** The data strongly indicates that the model's processing of factual associations and hallucinations that are associated with those facts are deeply intertwined and undergo a significant, amplified transformation at a specific depth (Layer 20). This could be a layer where high-level semantic integration or reasoning occurs.
* **Relationship between elements:** The near-identical trajectory of "Factual Asso." and "Asso. Hallu." implies that the mechanisms generating associated hallucinations are closely linked to, or perhaps a byproduct of, the same processes that handle factual associations. The "Unasso. Hallu." series, being decorrelated and lower magnitude, may represent a different, more baseline error mode.
* **Notable anomalies:** The extreme volatility after Layer 20, especially in the factual association metric, is notable. It suggests that after this critical processing point, the model's internal state becomes less stable or more sensitive, leading to large fluctuations in the measured norm. The secondary peak around Layer 29 might indicate another, less intense processing stage.
**In summary, the chart reveals a model architecture with a highly active and potentially critical processing hub around Layer 20, where factual and associated hallucination signals are massively amplified in a correlated manner, followed by a period of increased volatility.**
</details>
Figure 16: Subject-to-last attention contribution norms across layers in Mistral-7B-v0.3. Values show the norm of the attention contribution from subject tokens to the last token at each layer.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: Cosine Similarity Across Model Layers for Different Association Types
### Overview
This image is a line chart plotting "Cosine Similarity" on the y-axis against "Layers" (of a neural network model) on the x-axis. It compares the similarity trends for three distinct categories: Factual Associations, Associated Hallucinations, and Unassociated Hallucinations. The chart shows how the representational similarity of these concepts changes across the depth of the model.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:**
* **Label:** "Layers"
* **Scale:** Linear, ranging from 0 to approximately 31.
* **Major Tick Marks:** 0, 5, 10, 15, 20, 25, 30.
* **Y-Axis:**
* **Label:** "Cosine Similarity"
* **Scale:** Linear, ranging from 0.3 to approximately 0.95.
* **Major Tick Marks:** 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9.
* **Legend:**
* **Position:** Bottom-left corner of the plot area.
* **Entries:**
1. **Factual Associations:** Green line with upward-pointing triangle markers (âČ).
2. **Associated Hallucinations:** Blue line with circle markers (â).
3. **Unassociated Hallucinations:** Red (salmon) line with square markers (â ).
### Detailed Analysis
**1. Factual Associations (Green Line, âČ):**
* **Trend:** Starts very high, exhibits a gradual, slightly noisy decline through the early and middle layers, followed by a steep drop in the later layers, reaching a minimum before a final upward turn.
* **Data Points (Approximate):**
* Layer 0: ~0.92
* Layer 5: ~0.88
* Layer 10: ~0.84
* Layer 15: ~0.83
* Layer 17: ~0.82 (Last point before steep decline)
* Layer 20: ~0.55
* Layer 25: ~0.27 (Minimum point)
* Layer 30: ~0.40
* Layer 31: ~0.47
**2. Associated Hallucinations (Blue Line, â):**
* **Trend:** Follows a path very closely aligned with the Factual Associations line for most of the chart. It starts high, declines gradually, then drops steeply in tandem with the green line, reaching a similar minimum before a final rise.
* **Data Points (Approximate):**
* Layer 0: ~0.91
* Layer 5: ~0.87
* Layer 10: ~0.82
* Layer 15: ~0.81
* Layer 17: ~0.80
* Layer 20: ~0.54
* Layer 25: ~0.28 (Minimum point)
* Layer 30: ~0.41
* Layer 31: ~0.48
**3. Unassociated Hallucinations (Red Line, â ):**
* **Trend:** Maintains a consistently higher cosine similarity than the other two series throughout the entire chart. It shows a much more gradual decline, a less severe dip in the later layers, and a stronger recovery at the end.
* **Data Points (Approximate):**
* Layer 0: ~0.93
* Layer 5: ~0.90
* Layer 10: ~0.86
* Layer 15: ~0.85
* Layer 17: ~0.85
* Layer 20: ~0.72
* Layer 25: ~0.60 (Minimum point)
* Layer 30: ~0.66
* Layer 31: ~0.72
### Key Observations
1. **Divergence Point:** All three lines are tightly clustered and slowly declining until approximately Layer 17. After this point, they diverge sharply.
2. **The "Valley":** The Factual Associations and Associated Hallucinations lines plummet to a deep minimum (cosine similarity ~0.27-0.28) around Layer 25. The Unassociated Hallucinations line dips much less severely, bottoming out around 0.60.
3. **Final Recovery:** All three lines show an upward trend in the final layers (25-31). The recovery is strongest for Unassociated Hallucinations, moderate for the other two.
4. **Tight Coupling:** The green (Factual) and blue (Associated Hallucinations) lines are nearly superimposed for the first half of the model and follow an almost identical trajectory thereafter, suggesting their representations are processed very similarly.
### Interpretation
This chart likely visualizes the internal representational dynamics of a large language model. The cosine similarity measures how closely the model's internal activations for different concepts align.
* **Early Layers (0-17):** All conceptsâfactual knowledge, hallucinations related to that knowledge, and unrelated hallucinationsâstart with high similarity. This suggests the model's early processing is general, dealing with broad semantic or syntactic features common to all text.
* **Middle-to-Late Layers (17-25):** This is the critical processing stage. The sharp drop for Factual Associations and Associated Hallucinations indicates the model is performing specialized, fine-grained differentiation. It is actively separating the neural pathways for factual recall from those generating associated errors. The fact that they drop together suggests "Associated Hallucinations" are closely tied to the factual representation, perhaps being distortions of it.
* **Unassociated Hallucinations** remain more similar to the early, general representations, implying they are less integrated with the model's core factual knowledge structures.
* **Final Layers (25-31):** The recovery in similarity might represent a final integration stage before output, where diverse representations are mapped back into a coherent output space. The stronger recovery for Unassociated Hallucinations could indicate they are less constrained by the factual knowledge framework at this final stage.
**In essence, the data suggests the model has a dedicated "processing valley" in its middle-to-late layers where it rigorously distinguishes factual knowledge from potential errors, with hallucinations closely tied to facts being processed in a very similar manner to the facts themselves.**
</details>
Figure 17: Cosine similarity of target-token hidden states across layers in Mistral-7B-v0.3. From mid-layers onward, FAs and AHs diverge sharply as subject information propagates, while UHs remain more clustered, confirming weaker subject-dependent updates.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Scatter Plot of Categorical Data Points
### Overview
The image is a 2D scatter plot displaying data points categorized into three distinct groups, differentiated by color. The plot visualizes the spatial distribution and clustering of these categories across a Cartesian coordinate system. No specific axis titles are provided, indicating the data is likely projected into an abstract or reduced-dimensional space (e.g., from a technique like PCA or t-SNE).
### Components/Axes
* **Legend:** Located in the top-right corner of the plot area. It contains three entries:
* **Green Circle:** Labeled "Factual Asso."
* **Blue Circle:** Labeled "Asso. Hallu."
* **Red Circle:** Labeled "Unasso. Hallu."
* **X-Axis:** A horizontal numerical axis. Major tick marks and labels are present at intervals of 10, ranging from **-20** to **30**. The visible labels are: -20, -10, 0, 10, 20, 30.
* **Y-Axis:** A vertical numerical axis. Major tick marks and labels are present at intervals of 10, ranging from **-20** to **30**. The visible labels are: -20, -10, 0, 10, 20, 30.
* **Plot Area:** A white background containing all data points. The axes form a bounding box around the data.
### Detailed Analysis
The data points are distributed with clear spatial segregation based on category:
1. **Unasso. Hallu. (Red Points):**
* **Spatial Grounding:** Primarily clustered in the **top-left quadrant** of the plot.
* **Trend & Distribution:** Forms a dense, roughly elliptical cluster. The center of mass appears to be around coordinates **(-10, 15)**. The cluster spans approximately from X = -25 to X = 0 and Y = 0 to Y = 30. This group shows the most distinct separation from the others.
2. **Asso. Hallu. (Blue Points):**
* **Spatial Grounding:** Predominantly located in the **bottom half and right side** of the plot.
* **Trend & Distribution:** Exhibits a more dispersed, elongated distribution. Points are scattered from approximately X = -25 to X = 30 and Y = -20 to Y = 15. There is a noticeable concentration in the **bottom-right quadrant** (X > 0, Y < 0). This group significantly overlaps with the green points.
3. **Factual Asso. (Green Points):**
* **Spatial Grounding:** Intermixed with the blue points, primarily in the **bottom-center and bottom-right** regions.
* **Trend & Distribution:** Also shows a dispersed distribution, similar in range to the blue points (X: -20 to 30, Y: -20 to 10). The highest density appears in the region around **X = 0 to 20, Y = -15 to 0**. There is substantial spatial overlap with the "Asso. Hallu." (blue) category.
### Key Observations
* **Clear Cluster Separation:** The "Unasso. Hallu." (red) category forms a tight, isolated cluster in the upper-left, distinct from the other two groups.
* **Significant Overlap:** The "Asso. Hallu." (blue) and "Factual Asso." (green) categories are heavily intermingled across the lower and right portions of the plot, suggesting they share similar characteristics in this projected space.
* **Density Variation:** The red cluster appears denser than the more scattered blue and green point distributions.
* **Absence of Outliers:** There are no extreme outliers far removed from their respective group's general distribution.
### Interpretation
This scatter plot likely visualizes the output of a dimensionality reduction technique applied to data related to language model outputs or knowledge associations, given the labels ("Asso." for Association, "Hallu." for Hallucination).
* **What the data suggests:** The spatial separation implies that the underlying features used for projection can effectively distinguish "Unassociated Hallucinations" (red) from the other two categories. The significant overlap between "Associated Hallucinations" (blue) and "Factual Associations" (green) indicates that, in this feature space, these two phenomena are not easily separable. This could mean they share similar statistical or semantic properties, or that the model's internal representations for factual knowledge and associated hallucinations are closely aligned.
* **How elements relate:** The plot demonstrates a potential hierarchy or relationship. "Unassociated Hallucinations" appear as a distinct outlier class. The core challenge highlighted is the ambiguity between "Associated Hallucinations" and correct "Factual Associations," as they occupy a similar region in the latent space. This visualizes the difficulty a model might have in distinguishing between a fact it knows and a plausible-sounding but incorrect association it generates.
* **Notable Anomalies:** The primary anomaly is the stark isolation of the red cluster. The lack of axis titles is a critical limitation, as the meaning of the dimensions (e.g., "semantic similarity," "confidence score," "embedding dimension 1") is essential for a full technical interpretation. The data suggests that "Unassociated Hallucinations" may arise from a fundamentally different process or represent a different type of error compared to "Associated Hallucinations."
</details>
Figure 18: t-SNE visualization of last tokenâs representations at layer 25 of Mistral-7B-v0.3.
<details>
<summary>x22.png Details</summary>

### Visual Description
## Bar Chart: AUROC Comparison by Hallucination Type and Representation
### Overview
The image is a grouped bar chart with error bars, comparing the Area Under the Receiver Operating Characteristic curve (AUROC) for two types of hallucinations across three different representation types. The chart visually demonstrates a consistent performance gap between the two hallucination categories.
### Components/Axes
* **Y-Axis:** Labeled **"AUROC"**. The scale ranges from 0.4 to 0.9, with major grid lines at 0.1 intervals (0.4, 0.5, 0.6, 0.7, 0.8, 0.9).
* **X-Axis:** Labeled **"Representation Type"**. It contains three categorical groups:
1. **Subject**
2. **Attention**
3. **Last Token**
* **Legend:** Positioned at the bottom center of the chart.
* **Red Bar:** Labeled **"Unassociated Hallucination"**.
* **Blue Bar:** Labeled **"Associated Hallucination"**.
* **Error Bars:** Black vertical lines extending from the top of each bar, indicating variability or confidence intervals around the mean AUROC value.
### Detailed Analysis
The chart presents the following approximate AUROC values (estimated from the grid lines) and trends for each representation type:
**1. Subject Representation:**
* **Unassociated Hallucination (Red):** The bar is the tallest in the chart, reaching approximately **0.89**. The error bar spans roughly from 0.88 to 0.90.
* **Associated Hallucination (Blue):** The bar is significantly shorter, at approximately **0.59**. The error bar spans roughly from 0.56 to 0.62.
* **Trend:** This category shows the largest performance gap between the two hallucination types.
**2. Attention Representation:**
* **Unassociated Hallucination (Red):** The bar reaches approximately **0.78**. The error bar spans roughly from 0.76 to 0.80.
* **Associated Hallucination (Blue):** The bar reaches approximately **0.56**. The error bar spans roughly from 0.53 to 0.60.
* **Trend:** Both values are lower than their counterparts in the "Subject" category, but the gap remains substantial.
**3. Last Token Representation:**
* **Unassociated Hallucination (Red):** The bar reaches approximately **0.84**. The error bar spans roughly from 0.82 to 0.86.
* **Associated Hallucination (Blue):** The bar reaches approximately **0.56**. The error bar spans roughly from 0.54 to 0.58.
* **Trend:** The Unassociated value is high (second only to "Subject"), while the Associated value is similar to that of the "Attention" category.
### Key Observations
1. **Consistent Performance Gap:** Across all three representation types (Subject, Attention, Last Token), the AUROC for **Unassociated Hallucination** is markedly higher than for **Associated Hallucination**.
2. **Highest and Lowest Points:** The highest measured AUROC is for Unassociated Hallucination using the **Subject** representation (~0.89). The lowest measured AUROC is for Associated Hallucination using the **Attention** representation (~0.56).
3. **Stability of Associated Hallucination Scores:** The AUROC values for Associated Hallucination are relatively stable and low across all representation types, clustering between approximately 0.56 and 0.59.
4. **Variability:** The error bars suggest moderate variability in the measurements, with the largest apparent spread (uncertainty) seen in the Associated Hallucination score for the "Subject" representation.
### Interpretation
This chart likely comes from a study evaluating methods for detecting hallucinations in AI models (e.g., large language models). The data suggests a fundamental difference in the detectability of the two hallucination types:
* **Unassociated Hallucinations** (likely errors where the model generates information not associated with the input context) appear to be **significantly easier to detect**, as indicated by the high AUROC scores (approaching 0.9). The "Subject" representation is the most effective signal for this detection.
* **Associated Hallucinations** (likely errors where the model generates information that is associated with but incorrect or distorted from the input context) are **much harder to detect**, with AUROC scores hovering just above 0.5. An AUROC of 0.5 represents random guessing, so these scores indicate only a slight detectability advantage over chance.
The implication is that current representation-based detection methods are relatively successful at flagging completely fabricated, context-free information but struggle significantly with more subtle errors that are contextually linked. This highlights a key challenge in AI safety and reliability: catching the more insidious, associated mistakes. The choice of representation ("Subject," "Attention," "Last Token") has a notable impact on detecting unassociated hallucinations but a minimal impact on detecting associated ones.
</details>
Figure 19: Hallucination detection performance on the Full setting (Mistral-v0.3-7B).
<details>
<summary>x23.png Details</summary>

### Visual Description
## Grouped Bar Chart: Refusal Ratio by Training Set and Testing Set
### Overview
The image is a grouped bar chart comparing the "Refusal Ratio (%)" of a system (likely an AI model) when tested on three different types of data, after being trained on one of two specific training sets. The chart visually demonstrates how the training data composition affects the model's tendency to refuse responses during testing.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:** Labeled **"Refusal Ratio (%)"**. The scale runs from 0 to 100 in increments of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis:** Labeled **"Training Set"**. It contains two categorical groups:
1. **"UH Only"** (left group)
2. **"AH Only"** (right group)
* **Legend:** Located in the **top-right corner** of the chart area, titled **"Testing set"**. It defines three data series by color:
* **Green square:** "Factual Asso." (Factual Association)
* **Blue square:** "Asso. Hallu." (Associated Hallucination)
* **Red square:** "Unasso. Halluc." (Unassociated Hallucination)
### Detailed Analysis
The chart presents data for two training conditions, each tested on three data types. Values are approximate visual estimates.
**1. Training Set: "UH Only"**
* **Factual Asso. (Green Bar):** The bar height indicates a refusal ratio of approximately **10%**.
* **Asso. Hallu. (Blue Bar):** The bar height indicates a refusal ratio of approximately **15%**.
* **Unasso. Halluc. (Red Bar):** This is the tallest bar in the group, indicating a very high refusal ratio of approximately **85%**.
**2. Training Set: "AH Only"**
* **Factual Asso. (Green Bar):** The bar height indicates a refusal ratio of approximately **18%**.
* **Asso. Hallu. (Blue Bar):** The bar height indicates a refusal ratio of approximately **22%**.
* **Unasso. Halluc. (Red Bar):** The bar height indicates a refusal ratio of approximately **52%**.
**Trend Verification:**
* For the **"UH Only"** training set, the refusal ratio shows a steep, positive trend from "Factual Asso." to "Asso. Hallu." to "Unasso. Halluc.".
* For the **"AH Only"** training set, the refusal ratio also shows a positive trend across the same sequence, but the slope is less steep, and the absolute values are more moderate.
### Key Observations
1. **Dominant Effect of "Unasso. Halluc.":** Across both training sets, the "Unasso. Halluc." testing set (red bars) consistently elicits the highest refusal ratio.
2. **Training Set Impact:** The "UH Only" training set leads to a dramatically higher refusal ratio for "Unasso. Halluc." (~85%) compared to the "AH Only" training set (~52%).
3. **Factual Baseline:** The refusal ratio for "Factual Asso." is the lowest in both groups, serving as a baseline. It is slightly higher in the "AH Only" condition (~18%) than in the "UH Only" condition (~10%).
4. **Associated Hallucination Response:** The refusal ratio for "Asso. Hallu." is intermediate in both groups, sitting between the values for factual data and unassociated hallucinations.
### Interpretation
This chart likely illustrates the results of an experiment on AI model safety or alignment, specifically measuring a model's propensity to "refuse" to answer certain prompts. The data suggests a strong correlation between the type of data a model is trained on and its subsequent refusal behavior.
* **"UH Only" Training:** Models trained exclusively on data related to **Unassociated Hallucinations** become extremely sensitive to that specific type of prompt during testing, refusing them at a very high rate (85%). However, this specialization comes at a cost: their refusal rate for factual associations is the lowest, suggesting they may be less cautious or discerning with factual information.
* **"AH Only" Training:** Models trained on **Associated Hallucinations** show a more balanced, though still elevated, refusal profile. They are less hyper-sensitive to unassociated hallucinations than the UH-only model but maintain a higher baseline refusal rate across all categories, including factual associations. This could indicate a more generalized, but potentially over-cautious, safety behavior.
* **Underlying Pattern:** The consistent ordering of refusal rates (Factual < Associated Hallucination < Unassociated Hallucination) across both training regimes indicates a fundamental hierarchy in how the model categorizes and responds to these prompt types. Unassociated hallucinations are treated as the most "dangerous" or requiring the strongest refusal response. The training set primarily modulates the *intensity* of this response, especially for the most extreme category.
</details>
Figure 20: Refusal tuning performance across three types of samples (Mistral-v0.3-7B).
## Appendix B Parallel Experiments on Mistral
This section is for documenting parallel experiments conducted on the Mistral-7B-v0.3 model under the same settings as described in the main text (Figures 13 â 20).
The results from Mistral exhibit similar patterns to those observed in LLaMA, as described before. Specifically, we find consistent patterns in the modelâs internal computations, hidden-state behaviors, and the performance of hallucination detection and refusal tuning experiments.
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Scatter Plot: Distribution of Association and Hallucination Categories
### Overview
The image is a scatter plot displaying data points across a two-dimensional Cartesian coordinate system. The plot visualizes the distribution and clustering of three distinct categories of data, differentiated by color. The overall distribution shows significant overlap between categories in some regions and distinct clustering in others.
### Components/Axes
* **Chart Type:** Scatter Plot
* **X-Axis:** Linear scale ranging from approximately -25 to +25. Major tick marks are labeled at intervals of 10: -20, -10, 0, 10, 20.
* **Y-Axis:** Linear scale ranging from approximately -25 to +25. Major tick marks are labeled at intervals of 10: -20, -10, 0, 10, 20.
* **Legend:** Located in the bottom-left quadrant of the plot area. It contains three entries:
* **Green Circle:** Labeled "Factual Asso."
* **Blue Circle:** Labeled "Asso. Hallu."
* **Red Circle:** Labeled "Unasso. Hallu."
* **Data Points:** Hundreds of individual circular markers plotted according to their (x, y) coordinates.
### Detailed Analysis
**Spatial Distribution and Trends by Category:**
1. **Factual Asso. (Green):**
* **Trend/Cluster:** Primarily clustered in the upper-left quadrant (negative X, positive Y). The density is highest roughly between X = -20 to 0 and Y = 0 to 20.
* **Spread:** Shows a moderate spread, with some points extending into the upper-right quadrant and a few scattered in the lower half of the plot.
* **Visual Check:** The green points form a broad, diffuse cloud centered in the upper-left region.
2. **Asso. Hallu. (Blue):**
* **Trend/Cluster:** Heavily overlaps with the "Factual Asso." (green) cluster in the upper-left quadrant. The blue points are densely intermingled with the green points in the region of X = -20 to 0, Y = 0 to 20.
* **Spread:** Also shows a significant spread, with a notable tail of points extending diagonally down towards the bottom-right quadrant. Some blue points are found in all four quadrants.
* **Visual Check:** The blue points share the primary upper-left cluster with green but exhibit a more pronounced diagonal dispersion towards the lower-right.
3. **Unasso. Hallu. (Red):**
* **Trend/Cluster:** Forms a distinct, dense cluster primarily in the bottom-right quadrant (positive X, negative Y). The core of this cluster is centered approximately around X = 15, Y = -10.
* **Spread:** This category is the most spatially concentrated. While a few red points are scattered elsewhere (e.g., near the top center), the vast majority are tightly grouped in the lower-right region.
* **Visual Check:** The red cluster is visually separate from the main green/blue overlap, creating a clear separation in the plot's lower-right area.
**Cross-Reference Verification:**
* The legend's green circle corresponds to the points labeled "Factual Asso." located mainly in the upper-left.
* The legend's blue circle corresponds to the points labeled "Asso. Hallu." which overlap with green in the upper-left and spread diagonally.
* The legend's red circle corresponds to the points labeled "Unasso. Hallu." forming the distinct cluster in the bottom-right.
### Key Observations
1. **Primary Clustering:** There are two major clusters: a mixed green/blue cluster in the upper-left and a distinct red cluster in the bottom-right.
2. **Category Separation:** "Unasso. Hallu." (red) is largely separated from the other two categories, suggesting a fundamental difference in its underlying data characteristics.
3. **Category Overlap:** "Factual Asso." (green) and "Asso. Hallu." (blue) show substantial overlap, indicating these categories share similar characteristics in this 2D projection, though the blue series shows greater dispersion.
4. **Outliers:** A few red points appear near the top of the plot (Y â 20), and a few blue points are found deep within the red cluster. These could be outliers or misclassifications.
### Interpretation
This scatter plot likely represents the output of a dimensionality reduction technique (like t-SNE or PCA) applied to high-dimensional data, projecting it into 2D for visualization. The data appears to be related to the analysis of associations and hallucinations, possibly in the context of language models or cognitive science.
* **What the data suggests:** The clear separation of "Unasso. Hallu." (Unassociated Hallucinations) implies that these instances have a distinct signature or feature set compared to associated hallucinations and factual associations. The significant overlap between "Factual Asso." and "Asso. Hallu." suggests that associated hallucinations may be generated through processes or possess features that are very similar to those involved in retrieving factual associations, making them harder to distinguish in this feature space.
* **Relationship between elements:** The spatial proximity in such plots typically indicates similarity. Therefore, the model or system being analyzed treats many "Asso. Hallu." cases similarly to "Factual Asso." cases, while "Unasso. Hallu." cases are treated as a separate class.
* **Notable implications:** The diagonal spread of the blue "Asso. Hallu." points from the main cluster towards the red cluster could represent a continuum or transition zone. This might indicate varying degrees of "association strength" or "hallucination confidence" within the associated hallucination category. The plot visually argues that unassociated hallucinations are a distinct phenomenon, while associated hallucinations are closely tied to factual retrieval processes.
</details>
Figure 21: t-SNE visualization of subject tokensâ representations at layer 11 of LLaMA-3-8B.
<details>
<summary>x25.png Details</summary>

### Visual Description
## Scatter Plot: Association and Hallucination Categories
### Overview
The image is a scatter plot displaying data points categorized into three groups, differentiated by color. The plot visualizes the distribution and clustering of these categories across a two-dimensional space defined by numerical x and y axes. No explicit axis titles are provided, suggesting the axes represent abstract or derived dimensions (e.g., principal components, latent space coordinates).
### Components/Axes
* **Legend:** Located in the top-right corner. It defines three categories:
* **Green Circle:** "Factual Asso." (Factual Association)
* **Blue Circle:** "Asso. Hallu." (Associated Hallucination)
* **Red Circle:** "Unasso. Hallu." (Unassociated Hallucination)
* **X-Axis:** A horizontal numerical axis with major tick marks labeled at -20, -10, 0, 10, and 20. The visible range extends slightly beyond these marks, approximately from -25 to +25.
* **Y-Axis:** A vertical numerical axis with major tick marks labeled at -20, -10, 0, 10, 20, and 30. The visible range extends from approximately -25 to +35.
* **Data Points:** Hundreds of colored circles (green, blue, red) are scattered across the plot area. The points are semi-transparent, allowing some visibility of overlap.
### Detailed Analysis
**Spatial Distribution and Trends:**
1. **Factual Asso. (Green):**
* **Trend:** These points are widely dispersed but show a concentration in the lower half of the plot (negative y-values). They are spread across the entire x-axis range.
* **Key Regions:** A dense cluster exists in the bottom-center and bottom-right quadrants (x: 0 to 20, y: -20 to 0). Another notable grouping is in the bottom-left (x: -15 to -5, y: -20 to -10). Some green points are interspersed within the blue and red clusters.
2. **Asso. Hallu. (Blue):**
* **Trend:** These points form the most widespread and centrally located group. They appear as a broad, diffuse cloud covering much of the central and right portions of the plot.
* **Key Regions:** The highest density is in the central region (x: -5 to 15, y: -10 to 15). There is a significant spread towards the right side (positive x). Blue points are heavily intermingled with green points in the lower half and with red points in the upper-left.
3. **Unasso. Hallu. (Red):**
* **Trend:** This group shows the most distinct clustering. The points are predominantly concentrated in the upper-left quadrant.
* **Key Regions:** A very dense, tight cluster is located approximately between x: -15 to 0 and y: 5 to 20. This cluster has the highest y-values on average. A secondary, less dense scattering of red points extends towards the center and right, often overlapping with blue points.
**Data Point Approximation (Representative Examples):**
* **Extreme High Y (Red):** A red point is near (x â -8, y â 32).
* **Extreme Low Y (Green):** A green point is near (x â -2, y â -24).
* **Extreme Left (Blue):** A blue point is near (x â -23, y â 0).
* **Extreme Right (Green/Blue):** Points from both categories are near (x â 24, y â -5).
### Key Observations
1. **Clear Separation of "Unasso. Hallu.":** The red points form a distinct, dense cluster in the upper-left, suggesting this category occupies a specific region of this feature space.
2. **Overlap Between "Factual Asso." and "Asso. Hallu.":** Green and blue points are heavily intermixed, particularly in the lower half of the plot (y < 0). This indicates these categories are not well-separated by these two dimensions.
3. **Gradient in Y-Value:** There is a rough vertical stratification: red points dominate the top (high y), blue points the middle, and green points the bottom (low y), though with significant overlap.
4. **X-Axis Spread:** All categories span the full x-axis range, but the red cluster is skewed towards negative x-values.
### Interpretation
This scatter plot likely visualizes the output of a dimensionality reduction technique (like t-SNE or PCA) applied to data related to language model outputs or knowledge associations. The axes represent latent dimensions that capture variance in the data.
The spatial arrangement suggests:
* **"Unassociated Hallucinations" (Red)** are a distinct phenomenon, characterized by high values in the latent dimension represented by the y-axis. Their clustering implies consistency in whatever underlying feature causes this separation.
* **"Factual Associations" (Green) and "Associated Hallucinations" (Blue)** are more similar to each other in this latent space, as evidenced by their significant overlap. This could mean that, based on the analyzed features, it is difficult to distinguish between a correct association and a hallucinated one that is still contextually associated.
* The overall distribution implies that the model's "factual" and "associated hallucination" states are part of a continuum, while "unassociated hallucinations" represent a more extreme or outlier state. The plot serves as a diagnostic tool to understand the separability and characteristics of different model error types.
</details>
Figure 22: t-SNE visualization of subject tokensâ representations at layer 11 of Mistral-7B-v0.3.
## Appendix C More Visualization on Hidden States
In this section, we provide t-SNE visualization of subject tokensâ hidden states in Figure 21 and Figure 22.
Compared to the last-token representations, the t-SNE visualization of subject-token hidden states shows that unassociated hallucinations (UHs) are moderately separated from factual and associated samples, but the separation is less distinct than in the last-token representations. This observation aligns with the results in § 5, where the hallucination detection performance using last-token hidden states outperforms that based on subject-token representations.