2510.09033v1

Model: nemotron-free

# Large Language Models Do NOT Really Know What They Don’t Know Abstract Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may “ know what they don’t know ”. However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that LLMs don’t really know what they don’t know. Large Language Models Do NOT Really Know What They Don’t Know Chi Seng Cheang 1 Hou Pong Chan 2 Wenxuan Zhang 3 Yang Deng 1 1 Singapore Management University 2 DAMO Academy, Alibaba Group 3 Singapore University of Technology and Design cs.cheang.2025@phdcs.smu.edu.sg, houpong.chan@alibaba-inc.com wxzhang@sutd.edu.sg, ydeng@smu.edu.sg 1 Introduction Large language models (LLMs) demonstrate remarkable proficiency in generating coherent and contextually relevant text, yet they remain plagued by hallucination Zhang et al. (2023b); Huang et al. (2025), a phenomenon where outputs appear plausible but are factually inaccurate or entirely fabricated, raising concerns about their reliability and trustworthiness. To this end, researchers suggest that the internal states of LLMs (e.g., hidden representations Azaria and Mitchell (2023); Gottesman and Geva (2024), attention weights Yüksekgönül et al. (2024), output token logits Orgad et al. (2025); Varshney et al. (2023), etc.) can be used to detect hallucinations, indicating that LLMs themselves may actually know what they don’t know. These methods typically assume that when a model produces hallucinated outputs (e.g., “ Barack Obama was born in the city of Tokyo ” in Figure 1), its internal computations for the outputs (“ Tokyo ”) are detached from the input information (“ Barack Obama ”), thereby differing from those used to generate factually correct outputs. Thus, the hidden states are expected to capture this difference and serve as indicators of hallucinations. <details> <summary>x1.png Details</summary> ![a2963ce1](/v1/image/a2963ce105d0fb633c8334fd486727df123e9608f50a8ca420e7c73399f42ddc) ### Visual Description ## Diagram: Language Model Processing of Factual Queries ### Overview This diagram illustrates the process of a Large Language Model (LLM) handling factual queries about Barack Obama's birthplace, showing how internal states map to generated outputs with varying degrees of factual accuracy. The visualization uses color-coded dots to represent different types of associations within the model's internal states. ### Components/Axes - **Left Panel**: Textual factual queries about Barack Obama (three identical queries about his birthplace). - **Central Block**: Labeled "LLM" (Large Language Model), acting as the processing unit. - **Internal States Box**: Contains colored dots representing different association types: - Green dots: Factual Associations (e.g., "Chicago") - Blue dots: Associated Hallucinations (e.g., "Chicago") - Red dots: Unassociated Hallucinations (e.g., "Tokyo") - **Right Panel**: Generated Output section with color-coded labels matching the internal states. - **Legend**: Located in the top-right corner, mapping colors to association types. ### Detailed Analysis 1. **Factual Queries**: - Three identical queries: "Barack Obama studied in the city of" and "Barack Obama was born in the city of" (repeated three times). - Positioned on the far left, connected via arrows to the LLM block. 2. **Internal States**: - A box containing clustered dots in three colors: - **Green (Factual)**: Clustered in the upper-left quadrant, representing correct associations (e.g., "Chicago"). - **Blue (Associated Hallucinations)**: Mixed with green dots but slightly offset, indicating partial correctness (e.g., "Chicago" but in wrong context). - **Red (Unassociated Hallucinations)**: Clustered in the lower-right quadrant, representing entirely incorrect associations (e.g., "Tokyo"). 3. **Generated Output**: - Three labeled examples on the far right: - Green checkmark: "Factual Associations" (e.g., "Chicago"). - Blue X: "Associated Hallucinations" (e.g., "Chicago"). - Red X: "Unassociated Hallucinations" (e.g., "Tokyo"). ### Key Observations - **Color Distribution**: - Green dots dominate the upper-left, suggesting strong factual grounding for correct answers. - Blue dots are interspersed with green, indicating the model sometimes associates correct entities but with contextual errors. - Red dots are isolated in the lower-right, showing clear separation from factual associations. - **Spatial Grounding**: - The legend is positioned in the top-right, ensuring easy reference for all viewers. - Arrows flow left-to-right, emphasizing the sequential processing from query → LLM → internal states → output. ### Interpretation This diagram demonstrates how an LLM processes factual queries through internal states that encode both correct and incorrect associations. The color-coded dots reveal: 1. **Factual Accuracy**: Green dots represent reliable knowledge (e.g., Obama's birthplace as Chicago). 2. **Hallucination Patterns**: - **Associated Hallucinations** (blue): The model retains partial correctness (e.g., mentioning Chicago but in an incorrect context). - **Unassociated Hallucinations** (red): Complete fabrication (e.g., Tokyo), showing the model's vulnerability to generating entirely false information. The spatial separation of red dots suggests the model has mechanisms to suppress entirely incorrect associations, but the presence of blue dots highlights challenges in maintaining contextual accuracy. This visualization underscores the tension between factual retrieval and creative generation in LLMs, with implications for improving model reliability in knowledge-intensive tasks. </details> Figure 1: Illustration of three categories of knowledge. Associated hallucinations follow similar internal knowledge recall processes with factual associations, while unassociated hallucinations arise when the model’s output is detached from the input. However, other research (Lin et al., 2022b; Kang and Choi, 2023; Cheang et al., 2023) shows that models can also generate false information that is closely associated with the input information. In particular, models may adopt knowledge shortcuts, favoring tokens that frequently co-occur in the training corpus over factually correct answers Kang and Choi (2023). As shown in Figure 1, given the prompt: “Barack Obama was born in the city of”, an LLM may rely on the subject tokens’ representations (i.e., “Barack Obama”) to predict a hallucinated output (e.g., “Chicago”), which is statistically associated with the subject entity but under other contexts (e.g., “ Barack Obama studied in the city of Chicago ”). Therefore, we suspect that the internal computations may not exhibit distinguishable patterns between correct predictions and input-associated hallucinations, as LLMs rely on the input information to produce both of them. Only when the model produces hallucinations unassociated with the input do the hidden states exhibit distinct patterns that can be reliably identified. To this end, we conduct a mechanistic analysis of how LLMs internally process factual queries. We first perform causal analysis to identify hidden states crucial for generating Factual Associations (FAs) — factually correct outputs grounded in subject knowledge. We then examine how these hidden states behave when the model produces two types of factual errors: Associated Hallucinations (AHs), which remain grounded in subject knowledge, and Unassociated Hallucinations (UHs), which are detached from it. Our analysis shows that when generating both FAs and AHs, LLMs propagate information encoded in subject representations to the final token during output generation, resulting in overlapping hidden-state geometries that cannot reliably distinguish AHs from FAs. In contrast, UHs exhibit distinct internal computational patterns, producing clearly separable hidden-state geometries from FAs. Building on the analysis, we revisit several widely-used hallucination detection approaches Gottesman and Geva (2024); Yüksekgönül et al. (2024); Orgad et al. (2025) that adopt internal state probing. The results show that these representations cannot reliably distinguish AHs from FAs due to their overlapping hidden-state geometries, though they can effectively separate UHs from FAs. Moreover, this geometry also shapes the limits the effectiveness of Refusal Tuning Zhang et al. (2024), which trains LLMs to refuse uncertain queries using refusal-aware dataset. Because UH samples exhibit consistent and distinctive patterns, refusal tuning generalizes well to unseen UHs but fails to generalize to unseen AHs. We also find that AH hidden states are more diverse, and thus refusal tuning with AH samples prevents generalization across both AH and UH samples. Together, these findings highlight a central limitation: LLMs do not encode truthfulness in their hidden states but only patterns of knowledge recall and utilization, showing that LLMs don’t really know what they don’t know. 2 Related Work Existing hallucination detection methods can be broadly categorized into two types: representation-based and confidence-based. Representation-based methods assume that an LLM’s internal hidden states can reflect the correctness of its generated responses. These approaches train a classifier (often a linear probe) using the hidden states from a set of labeled correct/incorrect responses to predict whether a new response is hallucinatory Li et al. (2023); Azaria and Mitchell (2023); Su et al. (2024); Ji et al. (2024); Chen et al. (2024); Ni et al. (2025); Xiao et al. (2025). Confidence-based methods, in contrast, assume that a lower confidence during the generation led to a higher probability of hallucination. These methods quantify uncertainty through various signals, including: (i) token-level output probabilities (Guerreiro et al., 2023; Varshney et al., 2023; Orgad et al., 2025); (ii) directly querying the LLM to verbalize its own confidence (Lin et al., 2022a; Tian et al., 2023; Xiong et al., 2024; Yang et al., 2024b; Ni et al., 2024; Zhao et al., 2024); or (iii) measuring the semantic consistency across multiple outputs sampled from the same prompt (Manakul et al., 2023; Kuhn et al., 2023; Zhang et al., 2023a; Ding et al., 2024). A response is typically flagged as a hallucination if its associated confidence metric falls below a predetermined threshold. However, a growing body of work reveals a critical limitation: even state-of-the-art LLMs are poorly calibrated, meaning their expressed confidence often fails to align with the factual accuracy of their generations (Kapoor et al., 2024; Xiong et al., 2024; Tian et al., 2023). This miscalibration limits the effectiveness of confidence-based detectors and raises a fundamental question about the extent of LLMs’ self-awareness of their knowledge boundary, i.e., whether they can “ know what they don’t know ” Yin et al. (2023); Li et al. (2025). Despite recognizing this problem, prior work does not provide a mechanistic explanation for its occurrence. To this end, our work addresses this explanatory gap by employing mechanistic interpretability techniques to trace the internal computations underlying knowledge recall within LLMs. 3 Preliminary Transformer Architecture Given an input sequence of $T$ tokens $t_{1},...,t_{T}$ , an LLM is trained to model the conditional probability distribution of the next token $p(t_{T+1}|t_{1},...,t_{T})$ conditioned on the preceding $T$ tokens. Each token is first mapped to a continuous vector by an embedding layer. The resulting sequence of hidden states is then processed by a stack of $L$ Transformer layers. At layer $\ell∈{1,...,L}$ , each token representation is updated by a Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (MLP) module: $$ \mathbf{h}^{\ell}=\mathbf{h}^{\ell-1}+\mathbf{a}^{\ell}+\mathbf{m}^{\ell}, \tag{1} $$ where $\mathbf{a}^{\ell}$ and $\mathbf{m}^{\ell}$ correspond to the MHSA and MLP outputs, respectively, at the $\ell$ -layer. Internal Process of Knowledge Recall Prior work investigates the internal activations of LLMs to study the mechanics of knowledge recall. For example, an LLM may encode many attributes that are associated with a subject (e.g., Barack Obama) (Geva et al., 2023). Given a prompt like “ Barack Obama was born in the city of ”, if the model has correctly encoded the fact, the attribute “ Honolulu ” propagates through self-attention to the last token, yielding the correct answer. We hypothesize that non-factual predictions follow the same mechanism: spurious attributes such as “ Chicago ” are also encoded and propagated, leading the model to generate false outputs. Categorization of Knowledge To investigate how LLMs internally process factual queries, we define three categories of knowledge, according to two criteria: 1) factual correctness, and 2) subject representation reliance. - Factual Associations (FA) refer to factual knowledge that is reliably stored in the parameters or internal states of an LLM and can be recalled to produce correct, verifiable outputs. - Associated Hallucinations (AH) refer to non-factual content produced when an LLM relies on input-triggered parametric associations. - Unassociated Hallucinations (UH) refer to non-factual content produced without reliance on parametric associations to the input. <details> <summary>x2.png Details</summary> ![fc8ec2be](/v1/image/fc8ec2be6044218868b0df8f97675e3ba291951b3b1792fd98f73369a32ba643) ### Visual Description ## Heatmap: Average JS Divergence Across Layers and Components ### Overview The image is a heatmap visualizing average JS divergence values across three components ("Subj.", "Attn.", "Last.") and 31 layers (0–30). The color intensity ranges from light blue (low divergence, ~0.2) to dark blue (high divergence, ~0.6), with a vertical colorbar on the right. --- ### Components/Axes - **X-axis (Layer)**: Labeled "Layer," with integer values from 0 to 30. - **Y-axis (Components)**: Three categories: - "Subj." (Subject) - "Attn." (Attention) - "Last." (Last) - **Colorbar**: Vertical legend on the right, labeled "Avg JS Divergence," with values from 0.2 (light blue) to 0.6 (dark blue). --- ### Detailed Analysis 1. **Subj. (Subject)**: - Dark blue bars dominate layers 0–15, indicating high divergence (~0.5–0.6). - Gradual lightening from layer 16 onward, dropping to ~0.3–0.4 by layer 30. 2. **Attn. (Attention)**: - Light blue (low divergence, ~0.2–0.3) in layers 0–14. - Medium blue peak (~0.4–0.5) at layer 15. - Returns to light blue in layers 16–30. 3. **Last. (Last)**: - Light blue (low divergence, ~0.2–0.3) in layers 0–19. - Gradual darkening from layer 20 onward, reaching ~0.4–0.5 by layer 30. --- ### Key Observations - **Highest divergence**: "Subj." in early layers (0–15) with values near 0.6. - **Moderate divergence**: "Attn." at layer 15 (~0.4–0.5) and "Last." in later layers (20–30, ~0.4–0.5). - **Low divergence**: Most regions are light blue, except for the noted peaks. --- ### Interpretation The heatmap suggests that: 1. **Early layers (0–15)** are critical for "Subj." (subject-related) divergence, possibly indicating strong subject-specific processing in initial model layers. 2. **Layer 15** stands out for "Attn." divergence, hinting at a specialized attention mechanism at this depth. 3. **Later layers (20–30)** show increasing divergence for "Last.," suggesting cumulative or sequential processing in deeper layers. 4. The overall pattern implies that divergence is not uniformly distributed across layers or components, with distinct peaks correlating to specific tasks or mechanisms. This could reflect architectural design choices in a neural network, where certain layers are optimized for specific functions (e.g., subject identification, attention weighting, or final output processing). </details> (a) Factual Associations <details> <summary>x3.png Details</summary> ![838aa7da](/v1/image/838aa7da3dbf437c97bd64863035e67b9f55d57aa9144eeb7b5b66ec57e29728) ### Visual Description ## Heatmap: Average JS Divergence Across Layers and Categories ### Overview The image is a heatmap visualizing the average JS divergence across three categories ("Subj.", "Attn.", "Last.") and 31 layers (0–30). The color intensity represents divergence magnitude, with darker blue indicating higher values (up to 0.6) and lighter blue indicating lower values (down to 0.2). ### Components/Axes - **Y-Axis (Categories)**: - "Subj." (Subject) - "Attn." (Attention) - "Last." (Last) - **X-Axis (Layers)**: - Layer indices from 0 to 30, incremented by 2. - **Color Bar (Legend)**: - Label: "Avg JS Divergence" - Scale: 0.2 (lightest blue) to 0.6 (darkest blue). ### Detailed Analysis - **Subj. (Subject)**: - Layers 0–14: Dark blue (0.5–0.6 divergence). - Layers 16–18: Medium blue (0.4–0.5). - Layers 20–30: Light blue (0.2–0.3). - **Attn. (Attention)**: - Layers 12–18: Medium blue (0.4–0.5). - Layers 20–30: Light blue (0.2–0.3). - **Last. (Last)**: - Layers 20–30: Medium to dark blue (0.3–0.6). ### Key Observations 1. **Subj.** shows the highest divergence in early layers (0–14), dropping sharply after layer 14. 2. **Attn.** peaks in mid-layers (12–18) but declines afterward. 3. **Last.** exhibits increasing divergence from layer 20 onward, reaching the highest values (0.5–0.6) in later layers. 4. The color gradient aligns with the legend: darker blues correspond to higher divergence values. ### Interpretation The heatmap suggests a dynamic shift in divergence patterns across layers: - **Early layers (0–14)**: Dominated by "Subj." with high divergence, possibly indicating initial focus on subject-specific features. - **Mid-layers (12–18)**: "Attn." becomes prominent, suggesting attention mechanisms engage during this phase. - **Later layers (20–30)**: "Last." dominates, with divergence increasing sharply, potentially reflecting final processing or output generation. The data implies a layered computational process where subject analysis dominates early, attention mechanisms modulate mid-layers, and final layers exhibit heightened divergence, possibly due to complex decision-making or output refinement. </details> (b) Associated Hallucinations <details> <summary>x4.png Details</summary> ![8a89494a](/v1/image/8a89494a9775ad9a078183fbd252d7582130f311af3fd60a7b5511a7354786e7) ### Visual Description ## Heatmap: Average JS Divergence Across Layers and Components ### Overview The image is a heatmap visualizing the average Jensen-Shannon (JS) divergence across 31 layers (0–30) for three components: "Subj." (subject), "Attn." (attention), and "Last." (final layer). The color gradient ranges from light blue (low divergence, ~0.2) to dark blue (high divergence, ~0.6), with the legend positioned on the right. --- ### Components/Axes - **X-axis (Layer)**: Labeled "Layer," with integer increments from 0 to 30. - **Y-axis (Components)**: Three categories: - "Subj." (subject) - "Attn." (attention) - "Last." (final layer) - **Color Scale**: "Avg JS Divergence" (0.2–0.6), with darker blue indicating higher divergence. --- ### Detailed Analysis 1. **Subj. (Subject)**: - Dark blue bars dominate the first 15 layers (0–14), indicating high divergence (~0.5–0.6). - Gradual lightening from layer 15 onward, with values dropping to ~0.3 by layer 30. - Spatial grounding: Darkest bars are clustered in the top-left region. 2. **Attn. (Attention)**: - Uniform light blue across all layers (~0.2–0.3), with no significant variation. - Slightly darker in the first 10 layers (~0.25), but no clear trend. 3. **Last. (Final Layer)**: - Light blue throughout, with minimal divergence (~0.2–0.3). - No discernible pattern across layers. --- ### Key Observations - **Highest Divergence**: "Subj." in early layers (0–14) shows the strongest variability. - **Stability**: "Attn." and "Last." exhibit consistently low divergence, suggesting stability. - **Transition**: "Subj." divergence decreases sharply after layer 15, implying a shift in focus or representation. --- ### Interpretation The data suggests that subject-related features ("Subj.") are highly variable in early layers but stabilize later, potentially reflecting a model's initial exploration of subject-specific patterns followed by refinement. Attention ("Attn.") and final layer representations ("Last.") remain stable, indicating consistent processing or aggregation of information. The sharp drop in "Subj." divergence after layer 15 may signal a transition from exploration to exploitation in the model's architecture. This pattern could inform layer-wise analysis of neural network behavior, particularly in tasks requiring subject-specific attention. </details> (c) Unassociated Hallucinations Figure 2: Effect of interventions across layers of LLaMA-3-8B. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the model’s predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token. Dataset Construction | Factual Association Associated Hallucination Unassociated Hallucination | 3,506 1,406 7,381 | 3,354 1,284 7,655 | | --- | --- | --- | | Total | 12,293 | 12,293 | Table 1: Dataset statistics across categories. Our study is conducted under a basic knowledge-based question answering setting. The model is given a prompt containing a subject and relation (e.g., “ Barack Obama was born in the city of ”) and is expected to predict the corresponding object (e.g., “ Honolulu ”). To build the dataset, we collect knowledge triples $(\text{subject},\text{relation},\text{object})$ form Wikidata. Each relation was paired with a handcrafted prompt template to convert triples into natural language queries. The details of relation selection and prompt templates are provided in Appendix A.1. We then apply the label scheme presented in Appendix A.2: correct predictions are labeled as FAs, while incorrect ones are classified as AHs or UHs depending on their subject representation reliance. Table 1 summarizes the final data statistics. Models We conduct the experiments on two widely-adopted open-source LLMs, LLaMA-3 Dubey et al. (2024) and Mistral-v0.3 Jiang et al. (2023). Due to the space limit, details are presented in Appendix A.3, and parallel experimental results on Mistral are summarized in Appendix B. 4 Analysis of Internal States in LLMs To focus our analysis, we first conduct causal interventions to identify hidden states that are crucial for eliciting factual associations (FAs). We then compare their behavior across associated hallucinations (AHs) and unassociated hallucinations (UHs). Prior studies Azaria and Mitchell (2023); Gottesman and Geva (2024); Yüksekgönül et al. (2024); Orgad et al. (2025) suggest that hidden states can reveal when a model hallucinates. This assumes that the model’s internal computations differ when producing correct versus incorrect outputs, causing their hidden states to occupy distinct subspaces. We revisit this claim by examining how hidden states update when recalling three categories of knowledge (i.e., FAs, AHs, and UHs). If hidden states primarily signal hallucination, AHs and UHs should behave similarly and diverge from FAs. Conversely, if hidden states reflect reliance on encoded knowledge, FAs and AHs should appear similar, and both should differ from UHs. 4.1 Causal Analysis of Information Flow We identify hidden states that are crucial for factual prediction. For each knowledge tuple (subject, relation, object), the model is prompted with a factual query (e.g., “ The name of the father of Joe Biden is ”). Correct predictions indicate that the model successfully elicits parametric knowledge. Using causal mediation analysis Vig et al. (2020); Finlayson et al. (2021); Meng et al. (2022); Geva et al. (2023), we intervene on intermediate computations and measure the change in output distribution via JS divergence. A large divergence indicates that the intervened computation is critical for producing the fact. Specifically, to test whether token $i$ ’s hidden states in the MLP at layer $\ell$ are crucial for eliciting knowledge, we replace the computation with a corrupted version and observe how the output distribution changes. Similarly, following Geva et al. (2023), we mask the attention flow between tokens at layer $\ell$ using a window size of 5 layers. To streamline implementation, interventions target only subject tokens, attention flow, and the last token. Notable observations are as follows: Obs1: Hidden states crucial for eliciting factual associations. The results in Figure 2(a) show that three components dominate factual predictions: (1) subject representations in early-layer MLPs, (2) mid-layer attention between subject tokens and the final token, and (3) the final token representations in later layers. These results trace a clear information flow: subject representation, attention flow from the subject to the last token, and last-token representation, consistent with Geva et al. (2023). These three types of internal states are discussed in detail respectively (§ 4.2 - 4.4). Obs2: Associated hallucinations follow the same information flow as factual associations. When generating AHs, interventions on these same components also produce large distribution shifts (Figure 2(b)). This indicates that, although outputs are factually wrong, the model still relies on encoded subject information. Obs3: Unassociated hallucinations present a different information flow. In contrast, interventions during UH generation cause smaller distribution shifts (Figure 2(c)), showing weaker reliance on the subject. This suggests that UHs emerge from computations not anchored in the subject representation, different from both FAs and AHs. 4.2 Analysis of Subject Representations The analysis in § 4.1 reveals that unassociated hallucinations (UHs) are processed differently from factual associations (FAs) and associated hallucinations (AHs) in the early layers of LLMs, which share a similar pattern. We examine how these differences emerge in the subject representations and why early-layer modules behave this way. 4.2.1 Norm of Subject Representations <details> <summary>x5.png Details</summary> ![c043955e](/v1/image/c043955eaa0c765ae94761b3a5d1a6256671e650619c3aef3bf30d78ee5194db) ### Visual Description ## Line Graph: Normalized Ratio of Hallucinations to Factual Associations Across Layers ### Overview The image is a line graph comparing two data series across 31 layers (0–30). The y-axis represents a "Norm Ratio" (normalized ratio), while the x-axis represents discrete layers. Two data series are plotted: - **Blue circles**: "Asso. Hallu./Factual Asso." (associated hallucinations to factual associations) - **Red squares**: "Unasso. Hallu./Factual Asso." (unassociated hallucinations to factual associations) ### Components/Axes - **X-axis (Layers)**: Discrete values from 0 to 30, labeled "Layers." - **Y-axis (Norm Ratio)**: Continuous scale from 0.94 to 1.02, labeled "Norm Ratio." - **Legend**: Located in the top-left corner, with blue circles for "Asso. Hallu./Factual Asso." and red squares for "Unasso. Hallu./Factual Asso." ### Detailed Analysis #### Blue Circles ("Asso. Hallu./Factual Asso."): - **Trend**: Starts at ~0.99 (Layer 0), peaks slightly above 1.00 (~1.005) at Layer 5, then stabilizes between 0.995–1.00 across Layers 10–30. - **Key Points**: - Layer 0: 0.99 - Layer 5: 1.005 - Layer 10: 0.995 - Layer 20: 0.998 - Layer 30: 0.999 #### Red Squares ("Unasso. Hallu./Factual Asso."): - **Trend**: Starts at ~0.97 (Layer 0), dips to a minimum of 0.94 at Layer 12, then rises sharply to 1.02 at Layer 30. - **Key Points**: - Layer 0: 0.97 - Layer 5: 0.96 - Layer 10: 0.965 - Layer 12: 0.94 - Layer 15: 0.96 - Layer 20: 0.985 - Layer 25: 0.98 - Layer 30: 1.02 ### Key Observations 1. **Stability vs. Volatility**: - The blue series ("Asso.") remains relatively stable, fluctuating within ±0.005 of 1.00. - The red series ("Unasso.") exhibits significant variability, with a sharp increase of 0.04 (from 0.98 to 1.02) between Layers 25 and 30. 2. **Outlier**: - The red series reaches its lowest point (0.94) at Layer 12, a 3% deviation from its initial value. 3. **Final Layer Anomaly**: - The red series surpasses the blue series at Layer 30, reaching 1.02, suggesting a critical shift in the unassociated ratio. ### Interpretation - **Stability of Associated Ratios**: The blue line’s consistency implies that associated hallucinations maintain a near-1:1 ratio with factual associations across most layers, indicating robustness in this relationship. - **Unassociated Volatility**: The red line’s erratic behavior, particularly the sharp rise at Layer 30, suggests that unassociated hallucinations are more sensitive to layer-specific changes. This could reflect systemic instability or a threshold effect at Layer 30. - **Layer 30 Significance**: The red series’ spike to 1.02 at Layer 30 may indicate a critical layer where unassociated factors dominate, potentially disrupting the balance observed in earlier layers. This data highlights the importance of layer-specific analysis in understanding the interplay between hallucinations and factual associations, with unassociated factors showing greater susceptibility to abrupt changes. </details> Figure 3: Norm ratio curves of subject representations in LLaMA-3-8B, comparing AHs and UHs against FAs as the baseline. To test whether subject representations differ across categories, we measure the average $L_{2}$ norm of subject-token hidden activations across layers. For subject tokens $t_{s_{1}},..,t_{s_{n}}$ at layer $\ell$ , the average norm is $||\mathbf{h}_{s}^{\ell}\|=\tfrac{1}{n}\sum_{i=1}^{n}\|\mathbf{h}_{s_{i}}^{\ell}\|_{2}$ , computed by Equation (1). We compare the norm ratio between hallucination samples (AHs or UHs) and correct predictions (FAs), where a ratio near 1 indicates similar norms. Figure 3 shows that in LLaMA-3-8B, AH norms closely match those of correct samples (ratio $≈$ 0.99), while UH norms are consistently smaller, starting at the first layer (ratio $≈$ 0.96) and diverging further through mid-layers. Findings: At early layers, UH subject representations exhibit weaker activations than FAs, whereas AHs exhibit norms similar to FAs. 4.2.2 Relation to Parametric Knowledge <details> <summary>x6.png Details</summary> ![4b291c52](/v1/image/4b291c52e98fde6e60a4361fbc2eab1b1dc83c5c02b36de5867aa900ef63fa77) ### Visual Description ## Bar Chart: Unasso. Hallu./Factual Asso. vs. Asso. Hallu./Factual Asso. ### Overview The chart compares two language models, **LLaMA-3-8B** and **Mistral-7B-v0.3**, across two categories: **Unasso. Hallu./Factual Asso.** (unassociated hallucination/factual association) and **Asso. Hallu./Factual Asso.** (associated hallucination/factual association). The y-axis represents a **Ratio** (0.0–1.0), while the x-axis lists the models. ### Components/Axes - **X-axis**: Models (`LLaMA-3-8B`, `Mistral-7B-v0.3`). - **Y-axis**: Ratio (0.0–1.0, increments of 0.2). - **Legend**: - **Red**: Unasso. Hallu./Factual Asso. - **Blue**: Asso. Hallu./Factual Asso. - **Title**: "Unasso. Hallu./Factual Asso. Asso. Hallu./Factual Asso." (positioned above the chart). ### Detailed Analysis 1. **LLaMA-3-8B**: - **Unasso. Hallu./Factual Asso.** (red): ~0.65. - **Asso. Hallu./Factual Asso.** (blue): ~1.05 (slightly exceeding the y-axis maximum of 1.0). 2. **Mistral-7B-v0.3**: - **Unasso. Hallu./Factual Asso.** (red): ~0.38. - **Asso. Hallu./Factual Asso.** (blue): ~0.8. ### Key Observations - Both models show **higher ratios for Asso. Hallu./Factual Asso.** compared to Unasso. Hallu./Factual Asso., indicating improved performance when associated. - **LLaMA-3-8B** exhibits a **larger increase** (~0.4 increase) in the Asso. category compared to Mistral-7B-v0.3 (~0.42 increase). - The blue bars (Asso.) consistently dominate the red bars (Unasso.) for both models. ### Interpretation The data suggests that **association improves factual performance** for both models, but **LLaMA-3-8B** demonstrates a more pronounced benefit from association. This could imply that LLaMA-3-8B is more sensitive to contextual or associative cues in factual tasks. The slight overflow of LLaMA-3-8B’s Asso. bar beyond 1.0 may indicate an outlier or measurement artifact, warranting further validation. ### Spatial Grounding & Verification - Legend is positioned at the **bottom-center**, clearly mapping colors to categories. - Red bars (Unasso.) are shorter than blue bars (Asso.) for both models, confirming the legend’s accuracy. - Y-axis labels and increments are evenly spaced, ensuring reliable ratio interpretation. ### Content Details - No additional text or embedded data tables are present. - All values are approximate due to the absence of exact numerical annotations on the bars. </details> Figure 4: Comparison of subspace overlap ratios. We next investigate why early layers encode subject representations differently across knowledge types by examining how inputs interact with the parametric knowledge stored in MLP modules. Inspired by Kang et al. (2024), the output norm of an MLP layer depends on how well its input aligns with the subspace spanned by the weight matrix: poorly aligned inputs yield smaller output norms. For each MLP layer $\ell$ , we analyze the down-projection weight matrix $W_{\text{down}}^{\ell}$ and its input $x^{\ell}$ . Given the input $x_{s}^{\ell}$ corresponding to the subject tokens, we compute its overlap ratio with the top singular subspace $V_{\text{top}}$ of $W_{\text{down}}^{\ell}$ : $$ r(x_{s}^{\ell})=\frac{\left\lVert{x_{s}^{\ell}}^{\top}V_{\text{top}}V_{\text{top}}^{\top}\right\rVert^{2}}{\left\lVert x_{s}^{\ell}\right\rVert^{2}}. \tag{2} $$ A higher overlap ratio $r(x_{s}^{\ell})$ indicates stronger alignment to the subspace spanned by $W_{\text{down}}^{\ell}$ , leading to larger output norms. To highlight relative deviations from the factual baseline (FA), we report the relative ratios between AH/FA and UH/FA. Focusing on the layer with the largest UH norm shift, Figure 4 shows that UHs have significantly lower $r(x_{s}^{\ell})$ than AHs in both LLaMA and Mistral. This reveals that early-layer parametric weights are more aligned with FA and AH subject representations than with UH subjects, producing higher norms for the former ones. These results also suggest that the model has sufficiently learned representations for FA and AH subjects during pretraining but not for UH subjects. Findings: Similar to FAs, AH hidden activations align closely with the weight subspace, while UHs do not. This indicates that the model has sufficiently encoded subject representations into parametric knowledge for FAs and AHs but not for UHs. 4.2.3 Correlation with Subject Popularity <details> <summary>x7.png Details</summary> ![ba87de73](/v1/image/ba87de73d465507f3dd6201d36179c5aa1ff9a3eebeede96bf67c95e72a8b3e6) ### Visual Description ## Bar Chart: Distribution of Associations and Hallucinations by Category ### Overview The chart displays the percentage distribution of three categories ("Low," "Mid," "High") across three data series: Factual Associations (green), Associated Hallucinations (blue), and Unassociated Hallucinations (red). Each category shows distinct patterns in the dominance of these series. ### Components/Axes - **X-axis**: Categories labeled "Low," "Mid," "High" (left to right). - **Y-axis**: Percentage (%) ranging from 0% to 100% in 20% increments. - **Legend**: Located at the bottom, with color coding: - Green: Factual Associations - Blue: Associated Hallucinations - Red: Unassociated Hallucinations - **Bar Structure**: Grouped bars for each category, with percentages labeled on top of each bar. ### Detailed Analysis 1. **Low Category**: - Factual Associations: 5% (green bar, shortest). - Associated Hallucinations: 1% (blue bar, barely visible). - Unassociated Hallucinations: 94% (red bar, tallest). 2. **Mid Category**: - Factual Associations: 27% (green bar, medium height). - Associated Hallucinations: 7% (blue bar, shorter than green). - Unassociated Hallucinations: 66% (red bar, tallest but shorter than in "Low"). 3. **High Category**: - Factual Associations: 52% (green bar, tallest). - Associated Hallucinations: 14% (blue bar, shorter than green). - Unassociated Hallucinations: 34% (red bar, shortest in this category). ### Key Observations - **Dominance Shifts**: Unassociated Hallucinations dominate in "Low" (94%) but decrease significantly in "High" (34%). Factual Associations grow from 5% ("Low") to 52% ("High"). - **Associated Hallucinations**: Consistently the smallest across all categories (1% → 7% → 14%), showing a slight upward trend but remaining minor. - **Color Consistency**: Legend colors match bar colors exactly (green = Factual, blue = Associated, red = Unassociated). ### Interpretation The data suggests a correlation between category severity ("Low" to "High") and the prevalence of factual associations versus hallucinations. As categories increase: 1. **Factual Associations** become more dominant, indicating higher reliability or accuracy in higher categories. 2. **Unassociated Hallucinations** decrease sharply, implying reduced errors or irrelevant associations in "High" categories. 3. **Associated Hallucinations** remain low but increase slightly, possibly reflecting minor contextual errors that grow marginally with category complexity. This pattern could indicate that higher-confidence systems or classifications (e.g., "High" category) prioritize factual accuracy while minimizing hallucinations, though associated errors persist at a low level. The stark drop in unassociated hallucinations from "Low" to "High" highlights the importance of category refinement in reducing noise. </details> Figure 5: Sample distribution across different subject popularity (low, mid, high) in LLaMA-3-8B, measured by monthly Wikipedia page views. We further investigate why AH representations align with weight subspaces as strongly as FAs, while UHs do not. A natural hypothesis is that this difference arises from subject popularity in the training data. We use average monthly Wikipedia page views as a proxy for subject popularity during pre-training and bin subjects by popularity, then measure the distribution of UHs, AHs, and FAs. Figure 5 shows a clear trend: UHs dominate among the least popular subjects (94% for LLaMA), while AHs are rare (1%). As subject popularity rises, UH frequency falls and both FAs and AHs become more common, with AHs rising to 14% in the high-popularity subjects. This indicates that subject representation norms reflect training frequency, not factual correctness. Findings: Popular subjects yield stronger early-layer activations. AHs arise mainly on popular subjects and are therefore indistinguishable from FAs by popularity-based heuristics, contradicting prior work Mallen et al. (2023a) that links popularity to hallucinations. 4.3 Analysis of Attention Flow Having examined how the model forms subject representations, we next study how this information is propagated to the last token of the input where the model generates the object of a knowledge tuple. In order to produce factually correct outputs at the last token, the model must process subject representation and propagate it via attention layers, so that it can be read from the last position to produce the outputs Geva et al. (2023). To quantify the specific contribution from subject tokens $(s_{1},...,s_{n})$ to the last token, we compute the attention contribution from subject tokens to the last position: $$ \mathbf{a}^{\ell}_{\text{last}}=\sum\nolimits_{k}\sum\nolimits_{h}A^{\ell,h}_{\text{last},s_{k}}(\mathbf{h}^{\ell-1}_{s_{k}}W^{\ell,h}_{V})W^{\ell,h}_{O}. \tag{3} $$ where $A^{\ell,h}_{i,j}$ denotes the attention weight assigned by the $h$ -th head in the layer $\ell$ from the last position $i$ to subjec token $j$ . Here, $\mathbf{a}^{\ell}_{\text{last}}$ represents the subject-to-last attention contribution at layer $\ell$ . Intuitively, if subject information is critical for prediction, this contribution should have a large norm; otherwise, the norm should be small. Figure 6 shows that in LLaMA-3-8B, both AHs and FAs exhibit large attention-contribution norms in mid-layers, indicating a strong information flow from subject tokens to the target token. In contrast, UHs show consistently lower norms, implying that their predictions rely far less on subject information. Yüksekgönül et al. (2024) previously argued that high attention flow from subject tokens signals factuality and proposed using attention-based hidden states to detect hallucinations. Our results challenge this view: the model propagates subject information just as strongly when generating AHs as when producing correct facts. Findings: Mid-layer attention flow from subject to last token is equally strong for AHs and FAs but weak for UHs. Attention-based heuristics can therefore separate UHs from FAs but cannot distinguish AHs from factual outputs, limiting their reliability for hallucination detection. <details> <summary>x8.png Details</summary> ![e0131347](/v1/image/e0131347483a0ee373dbabba3fe261148b76bac3a7d29222fd45589d28523850) ### Visual Description ## Line Graph: Normalized Metrics Across Layers ### Overview The image is a line graph depicting normalized values (0-2.0) of three metrics—Factual Association, Associated Hallucination, and Unassociated Hallucination—across 31 layers (0-30). The graph uses distinct markers (green triangles, blue circles, red squares) to differentiate the series, with a legend in the top-left corner. ### Components/Axes - **X-axis (Layer)**: Ranges from 0 to 30 in increments of 5. - **Y-axis (Norm)**: Normalized values from 0.0 to 2.0 in increments of 0.5. - **Legend**: Top-left corner, with: - Green triangles: Factual Association - Blue circles: Associated Hallucination - Red squares: Unassociated Hallucination ### Detailed Analysis 1. **Factual Association (Green)**: - Starts near 0.0 at layer 0, gradually increasing to ~0.4 by layer 15. - Peaks at ~1.8 near layer 18, then declines to ~0.3 by layer 30. - Shows secondary peaks at layers 22 (~1.2) and 27 (~0.9). 2. **Associated Hallucination (Blue)**: - Remains near 0.0 until layer 15, then spikes to ~2.0 at layer 18. - Drops to ~1.4 at layer 20, rises again to ~1.3 at layer 22, and declines to ~0.1 by layer 30. - Exhibits volatility, with sharp rises and falls. 3. **Unassociated Hallucination (Red)**: - Stays consistently low (~0.1-0.3) until layer 15. - Peaks at ~0.6 near layer 18, then fluctuates between 0.3-0.5 until layer 30. - Shows a secondary peak at layer 27 (~0.6). ### Key Observations - **Associated Hallucination (Blue)** exhibits the most dramatic spikes, particularly at layers 18 and 22, reaching the maximum normalized value (~2.0). - **Factual Association (Green)** closely follows the peaks of the blue series but with slightly lower magnitudes. - **Unassociated Hallucination (Red)** remains the most stable, with smaller, less frequent peaks. - All series decline after layer 25, with blue dropping most sharply. ### Interpretation The data suggests a strong correlation between **Factual Association** and **Associated Hallucination**, as their peaks align closely (e.g., layers 18, 22). The **Unassociated Hallucination** metric remains relatively stable, indicating it is less influenced by layer-specific variations. The sharp spikes in the blue series may reflect critical layers where model behavior or data processing undergoes significant changes. The decline after layer 25 could imply diminishing returns or stabilization in later layers. The red series’ consistency might highlight a baseline level of unassociated hallucinations unaffected by layer dynamics. </details> Figure 6: Subject-to-last attention contribution norms across layers in LLaMA-3-8B. Values show the norm of the attention contribution from subject tokens to the last token at each layer. 4.4 Analysis of Last Token Representations Our earlier analysis showed strong subject-to-last token information transfer for both FAs and AHs, but minimal transfer for UHs. We now examine how this difference shapes the distribution of last-token representations. When subject information is weakly propagated (UHs), last-token states receive little subject-specific update. For UH samples sharing the same prompt template, these states should therefore cluster in the representation space. In contrast, strong subject-driven propagation in FAs and AHs produces diverse last-token states that disperse into distinct subspaces. To test this, we compute cosine similarity among last-token representations $\mathbf{h}_{T}^{\ell}$ . As shown in Figure 7, similarity is high ( $≈$ 0.9) for all categories in early layers, when little subject information is transferred. From mid-layers onward, FAs and AHs diverge sharply, dropping to $≈$ 0.2 by layer 25. UHs remain moderately clustered, with similarity only declining to $≈$ 0.5. Figure 8 shows the t-SNE visualization of the last token’s representations at layer 25 of LLaMA-3-8B. The hidden representations of UH are clearly separated from FA, whereas AH substantially overlap with FA. These results indicate that the model processes UH differently from FA, while processing AH in a manner similar to FA. More visualization can be found in Appendix C. <details> <summary>x9.png Details</summary> ![f4ccc38d](/v1/image/f4ccc38df400b22793504548cd2ddc15b81f06749bc46b8fc910ae8f88975901) ### Visual Description ## Line Graph: Cosine Similarity Across Layers ### Overview The image is a line graph depicting the cosine similarity of three categories—Factual Associations, Associated Hallucinations, and Unassociated Hallucinations—across 30 layers. The y-axis represents cosine similarity (0.3–0.9), and the x-axis represents layers (0–30). Three distinct lines (green, blue, red) correspond to the categories, with trends showing initial high similarity, a decline, and partial recovery in later layers. ### Components/Axes - **X-axis (Layers)**: Labeled "Layers," ranging from 0 to 30 in increments of 5. - **Y-axis (Cosine Similarity)**: Labeled "Cosine Similarity," ranging from 0.3 to 0.9 in increments of 0.1. - **Legend**: Located in the bottom-left corner, with three entries: - Green line (triangle markers): Factual Associations - Blue line (circle markers): Associated Hallucinations - Red line (square markers): Unassociated Hallucinations ### Detailed Analysis 1. **Factual Associations (Green Line)**: - Starts at ~0.9 at layer 0. - Declines gradually to ~0.85 by layer 10. - Drops sharply to ~0.4 at layer 20. - Recovers to ~0.5 by layer 30. 2. **Associated Hallucinations (Blue Line)**: - Starts at ~0.9 at layer 0. - Declines to ~0.85 by layer 10. - Drops sharply to ~0.3 at layer 25. - Recovers to ~0.4 by layer 30. 3. **Unassociated Hallucinations (Red Line)**: - Starts at ~0.9 at layer 0. - Declines slightly to ~0.85 by layer 10. - Remains relatively flat (~0.55–0.65) from layer 15 to 30. - Ends at ~0.65 at layer 30. ### Key Observations - All three categories exhibit a decline in cosine similarity between layers 10–20, with the steepest drops occurring around layer 15–20. - Factual Associations and Associated Hallucinations show a more pronounced recovery in later layers (20–30) compared to Unassociated Hallucinations. - Unassociated Hallucinations maintain higher similarity in later layers (~0.55–0.65) than the other two categories. - The green and blue lines converge closely until layer 15, then diverge significantly. ### Interpretation The data suggests that cosine similarity decreases as layers increase, indicating potential divergence or loss of alignment in the modeled relationships. The recovery in later layers (20–30) for Factual Associations and Associated Hallucinations may reflect stabilization or compensatory mechanisms. Unassociated Hallucinations, however, retain higher similarity in later layers, possibly due to structural differences or reduced sensitivity to layer-specific changes. The divergence between the green and blue lines after layer 15 highlights distinct behaviors between factual and associated hallucinations, while the red line’s stability suggests a more consistent performance across layers. </details> Figure 7: Cosine similarity of target-token hidden states across layers in LLaMA-3-8B. <details> <summary>x10.png Details</summary> ![ec18fa6a](/v1/image/ec18fa6af00ab28778feb25d8b29fe39f6701678e501c0e82b3120eeb93525c6) ### Visual Description ## Scatter Plot: Distribution of Association Types ### Overview The image is a scatter plot visualizing the distribution of three categories: Factual Association (green), Associated Hallucination (blue), and Unassociated Hallucination (red). The plot spans X and Y axes from -30 to 30, with data points densely clustered in specific regions. ### Components/Axes - **X-axis**: Labeled "X" with ticks at -30, -20, -10, 0, 10, 20, 30. - **Y-axis**: Labeled "Y" with ticks at -30, -20, -10, 0, 10, 20, 30. - **Legend**: Located in the top-left corner, mapping: - Green: Factual Association - Blue: Associated Hallucination - Red: Unassociated Hallucination ### Detailed Analysis 1. **Factual Association (Green)**: - **Positioning**: Predominantly clustered in the left half of the plot (X ≈ -25 to -5, Y ≈ -20 to 10). - **Distribution**: Dense grouping near the origin, with some outliers extending toward the bottom-left (X ≈ -30, Y ≈ -30). 2. **Associated Hallucination (Blue)**: - **Positioning**: Spread across the central region (X ≈ -15 to 15, Y ≈ -25 to 15). - **Distribution**: Overlaps with green and red points in the central area, with a secondary cluster near the top-center (X ≈ 5–10, Y ≈ 5–10). 3. **Unassociated Hallucination (Red)**: - **Positioning**: Concentrated in the top-right quadrant (X ≈ 5–30, Y ≈ 5–30). - **Distribution**: High density near the upper-right corner (X ≈ 20–30, Y ≈ 20–30), with fewer points extending toward the center-right (X ≈ 10–20, Y ≈ 5–15). ### Key Observations - **Trend Verification**: - Factual Associations (green) show a clear leftward bias, with minimal presence in the right half of the plot. - Unassociated Hallucinations (red) exhibit a strong rightward and upward trend, dominating the top-right quadrant. - Associated Hallucinations (blue) are more dispersed, with no dominant directional trend but significant overlap with other categories in the central region. - **Outliers/Anomalies**: - A small cluster of red points (Unassociated Hallucinations) appears near the bottom-left (X ≈ -10, Y ≈ -10), deviating from the primary top-right trend. - Blue points (Associated Hallucinations) occasionally extend into the top-right quadrant, overlapping with red points. ### Interpretation The plot suggests a clear distinction between Factual Associations (left-leaning) and Unassociated Hallucinations (right-leaning), with Associated Hallucinations occupying an intermediate, overlapping space. The red points’ dominance in the top-right quadrant may indicate a correlation between higher X/Y values and unassociated hallucinations. The blue points’ central clustering implies variability in associated hallucinations, potentially reflecting mixed or transitional states. The outlier red points in the bottom-left could represent edge cases or measurement noise. This distribution might reflect a classification system where factual data is distinct from hallucinated content, with associated hallucinations acting as a transitional category. </details> Figure 8: t-SNE visualization of last token’s representations at layer 25 of LLaMA-3-8B. <details> <summary>x11.png Details</summary> ![100b6925](/v1/image/100b6925569c6145d228b98dcb79b2e794e09fd330cb3058031e246404e3dab0) ### Visual Description ## Violin Plot: Token Probability Distribution Across Model Variants ### Overview The chart compares token probability distributions for two language models (LLaMA-3-8B and Mistral-7B-v0.3) across three categories: Factual Associations, Associated Hallucinations, and Unassociated Hallucinations. Violin plots visualize the probability density of token occurrences, with medians marked by horizontal lines. ### Components/Axes - **X-axis**: Model variants (LLaMA-3-8B, Mistral-7B-v0.3) - **Y-axis**: Token Probability (0.0–1.0) - **Legend**: - Green: Factual Associations - Blue: Associated Hallucinations - Red: Unassociated Hallucinations - **Spatial Grounding**: - Legend positioned at bottom center - Model labels centered below each pair of violin plots - Y-axis label left-aligned vertically ### Detailed Analysis 1. **LLaMA-3-8B**: - **Factual Associations (Green)**: - Median ≈ 0.4 (IQR: 0.3–0.5) - Distribution skewed toward lower probabilities - **Associated Hallucinations (Blue)**: - Median ≈ 0.3 (IQR: 0.2–0.4) - Narrower spread than factual associations - **Unassociated Hallucinations (Red)**: - Median ≈ 0.1 (IQR: 0.05–0.15) - Tightest distribution 2. **Mistral-7B-v0.3**: - **Factual Associations (Green)**: - Median ≈ 0.5 (IQR: 0.4–0.6) - Broader spread than LLaMA - **Associated Hallucinations (Blue)**: - Median ≈ 0.4 (IQR: 0.3–0.5) - Similar spread to factual associations - **Unassociated Hallucinations (Red)**: - Median ≈ 0.1 (IQR: 0.05–0.15) - Identical to LLaMA ### Key Observations - Mistral-7B-v0.3 demonstrates **higher median token probabilities** for both factual associations (+25%) and associated hallucinations (+33%) compared to LLaMA-3-8B. - **Unassociated hallucinations** show no significant difference between models (both ≈0.1 median). - Mistral's distributions exhibit **greater variability** (wider violins) across all categories, suggesting less consistent token generation. ### Interpretation The data indicates Mistral-7B-v0.3 outperforms LLaMA-3-8B in maintaining factual associations while reducing associated hallucinations. The similar unassociated hallucination rates suggest both models struggle equally with spurious token generation unrelated to input context. The wider distribution in Mistral's factual associations may reflect improved contextual understanding but could also indicate overconfidence in certain outputs. These findings align with Mistral's architectural optimizations for reasoning tasks, though the increased variability warrants further investigation into output reliability. </details> Figure 9: Distribution of last token probabilities. This separation also appears in the entropy of the output distribution (Figure 9). Strong subject-to-last propagation in FAs and AHs yields low-entropy predictions concentrated on the correct or associated entity. In contrast, weak propagation in UHs produces broad, high-entropy distributions, spreading probability mass across many plausible candidates (e.g., multiple possible names for “ The name of the father of <subject> is ”). Finding: From mid-layers onward, UHs retain clustered last-token representations and high-entropy outputs, while FAs and AHs diverge into subject-specific subspaces with low-entropy outputs. This provides a clear signal to separate UHs from FAs and AHs, but not for FAs and AHs. 5 Revisiting Hallucination Detection The mechanistic analysis in § 4 reveals that Internal states of LLMs primarily capture how the model recalls and utilizes its parametric knowledge, not whether the output is truthful. As both factual associations (FAs) and associated hallucinations (AHs) rely on the same subject-driven knowledge recall, their internal states show no clear separation. We therefore hypothesize that internal or black-box signals cannot effectively distinguish AHs from FAs, even though they could be effective in distinguishing unassociated hallucinations (UHs), which do not rely on parametric knowledge, from FAs. Experimental Setups To verify this, we revisit the effectiveness of widely-adopted white-box hallucination detection approaches that use internal state probing as well as black-box approaches that rely on scalar features. We evaluate on three settings: 1) AH Only (1,000 FAs and 1,000 AHs for training; 200 of each for testing), 2) UH Only (1,000 FAs and 1,000 UHs for training; 200 of each for testing), and 3) Full (1,000 FAs and 1,000 hallucination samples mixed of AHs and UHs for training; 200 of each for testing). For each setting, we use five random seeds to construct the training and testing datasets. We report the mean AUROC along with its standard deviation across seeds. White-box methods: We extract and normalize internal features and then train a probe. - Subject representations: last subject token hidden state from three consecutive layers Gottesman and Geva (2024). - Attention flow: attention weights from the last token to subject tokens across all layers Yüksekgönül et al. (2024). - Last-token representations: final token hidden state from the last layer Orgad et al. (2025). Black-box methods: We test two commonly used scalar features, including answer token probability (Orgad et al., 2025) and subject popularity (average monthly Wikipedia page views) (Mallen et al., 2023a). As discussed in § 4.2.3 and § 4.4, these features are also related to whether the model relies on encoded knowledge to produce outputs rather than with truthfulness itself. Experimental Results | Subject Attention Last Token | $0.65± 0.02$ $0.58± 0.04$ $\mathbf{0.69± 0.03}$ | $0.91± 0.01$ $0.92± 0.02$ $\mathbf{0.93± 0.01}$ | $0.57± 0.02$ $0.58± 0.07$ $\mathbf{0.63± 0.02}$ | $0.81± 0.02$ $0.87± 0.01$ $\mathbf{0.92± 0.01}$ | | --- | --- | --- | --- | --- | | Probability | $0.49± 0.01$ | $0.86± 0.01$ | $0.46± 0.00$ | $0.89± 0.00$ | | Subject Pop. | $0.48± 0.01$ | $0.87± 0.01$ | $0.52± 0.01$ | $0.84± 0.01$ | Table 2: Hallucination detection performance on AH Only and UH Only settings. <details> <summary>x12.png Details</summary> ![ab4b54d2](/v1/image/ab4b54d2da98ce3d21a1cd0f12645da2d60db3b4973071566b093c6d8de4d159) ### Visual Description ## Bar Chart: AUROC Comparison by Representation Type and Hallucination Association ### Overview The chart compares the Area Under the Receiver Operating Characteristic curve (AUROC) for two types of hallucination ("Unassociated" and "Associated") across three representation types: "Subject," "Attention," and "Last Token." AUROC values are plotted on a y-axis (0.4–0.9), with error bars indicating uncertainty. ### Components/Axes - **X-axis (Representation Type)**: Categorical axis with three labels: "Subject," "Attention," and "Last Token." - **Y-axis (AUROC)**: Continuous scale from 0.4 to 0.9, labeled "AUROC." - **Legend**: Located at the bottom, with red representing "Unassociated Hallucination" and blue representing "Associated Hallucination." - **Error Bars**: Vertical lines atop each bar, representing measurement uncertainty. ### Detailed Analysis 1. **Subject Representation Type**: - **Unassociated Hallucination**: AUROC ≈ 0.83 (±0.03). - **Associated Hallucination**: AUROC ≈ 0.60 (±0.04). 2. **Attention Representation Type**: - **Unassociated Hallucination**: AUROC ≈ 0.84 (±0.03). - **Associated Hallucination**: AUROC ≈ 0.57 (±0.03). 3. **Last Token Representation Type**: - **Unassociated Hallucination**: AUROC ≈ 0.87 (±0.03). - **Associated Hallucination**: AUROC ≈ 0.59 (±0.03). ### Key Observations - **Trend Verification**: - Unassociated Hallucination consistently outperforms Associated Hallucination across all representation types (e.g., 0.83 vs. 0.60 for "Subject"). - AUROC values for Unassociated Hallucination increase slightly from "Subject" to "Last Token" (0.83 → 0.87), while Associated Hallucination remains relatively stable (0.60 → 0.59). - **Error Bars**: Uncertainty ranges are narrow (0.03–0.04), suggesting precise measurements. - **Color Consistency**: Red bars (Unassociated) and blue bars (Associated) align perfectly with the legend. ### Interpretation The data demonstrates that **Unassociated Hallucination models achieve significantly higher AUROC values** than Associated Hallucination models across all representation types. This suggests that models trained without hallucination associations are more robust in distinguishing between classes. The marginal improvement in Unassociated Hallucination from "Subject" to "Last Token" may indicate that later-stage representations (e.g., final token outputs) retain higher discriminative power. The consistent underperformance of Associated Hallucination could reflect challenges in handling data leakage or overfitting when hallucination is explicitly modeled. The narrow error bars reinforce the reliability of these findings, though the small sample size (implied by limited categories) warrants caution in generalizing results. </details> Figure 10: Hallucination detection performance on the Full setting (LLaMA-3-8B). Table 2 shows that hallucination detection methods behave very differently in the AH Only and UH Only settings. For white-box probes, all approaches effectively distinguish UHs from FAs, with last-token hidden states reaching AUROC scores of about 0.93 for LLaMA and 0.92 for Mistral. In contrast, performance drops sharply on the AH Only setting, where the last-token probe falls to 0.69 for LLaMA and 0.63 for Mistral. Black-box methods follow the same pattern. Figure 10 further highlights this disparity under the Full setting: detection is consistently stronger on UH samples than on AH samples, and adding AHs to the training set significantly dilutes performance on UHs (AUROC $≈$ 0.9 on UH Only vs. $≈$ 0.8 on Full). These results confirm that both internal probes and black-box methods capture whether a model draws on parametric knowledge, not whether its outputs are factually correct. Unassociated hallucinations are easier to detect because they bypass this knowledge, while associated hallucinations are produced through the same recall process as factual answers, leaving no internal cues to distinguish them. As a result, LLMs lack intrinsic awareness of their own truthfulness, and detection methods relying on these signals risk misclassifying associated hallucinations as correct, fostering harmful overconfidence in model outputs. 6 Challenges of Refusal Tuning A common strategy to mitigate potential hallucination in the model’s responses is to fine-tune LLMs to refuse answering when they cannot provide a factual response, e.g., Refusal Tuning Zhang et al. (2024). For such refusal capability to generalize, the training data must contain a shared feature pattern across hallucinated outputs, allowing the model to learn and apply it to unseen cases. Our analysis in the previous sections shows that this prerequisite is not met. The structural mismatch between UHs and AHs suggests that refusal tuning on UHs may generalize to other UHs, because their hidden states occupy a common activation subspace, but will not transfer to AHs. Refusal tuning on AHs is even less effective, as their diverse representations prevent generalization to either unseen AHs or UHs. Experimental Setups To verify the hypothesis, we conduct refusal tuning on LLMs under two settings: 1) UH Only, where 1,000 UH samples are paired with 10 refusal templates, and 1,000 FA samples are preserved with their original answers. 2) AH Only, where 1,000 AH samples are paired with refusal templates, with 1,000 FA samples again leave unchanged. We then evaluate both models on 200 samples each of FAs, UHs, and AHs. A response matching any refusal template is counted as a refusal, and we report the Refusal Ratio as the proportion of samples eliciting refusals. This measures not only whether the model refuses appropriately on UHs and AHs, but also whether it “over-refuses” on FA samples. Experimental Results <details> <summary>x13.png Details</summary> ![d2eb8f59](/v1/image/d2eb8f591f5bfcbff360dd2b0ff070c733f7ec68dba54b93752a128cbb7cb7b4) ### Visual Description ## Bar Chart: Refusal Ratio by Training Set and Hallucination Type ### Overview The chart compares refusal ratios (%) across three hallucination types (Factual Association, Associative Hallucination, Unassociated Hallucination) for two training sets (UH Only and AH Only). Refusal ratios are measured on a 0-100% scale. ### Components/Axes - **X-axis (Training Set)**: Two categories: "UH Only" (left) and "AH Only" (right). - **Y-axis (Refusal Ratio)**: Labeled "%", with gridlines at 0, 20, 40, 60, 80, 100. - **Legend**: Located in the top-right corner, mapping colors to categories: - Green: Factual Association - Blue: Associative Hallucination - Red: Unassociated Hallucination ### Detailed Analysis #### UH Only Training Set - **Factual Association (Green)**: ~30% refusal ratio. - **Associative Hallucination (Blue)**: ~28% refusal ratio. - **Unassociated Hallucination (Red)**: ~80% refusal ratio (tallest bar). #### AH Only Training Set - **Factual Association (Green)**: ~20% refusal ratio. - **Associative Hallucination (Blue)**: ~32% refusal ratio (tallest in AH Only). - **Unassociated Hallucination (Red)**: ~22% refusal ratio. ### Key Observations 1. **UH Only Dominance**: Unassociated Hallucination refusal ratio is significantly higher (~80%) compared to other categories. 2. **AH Only Shift**: Associative Hallucination becomes the dominant refusal category (~32%), while Factual Association drops to ~20%. 3. **Consistency**: Factual Association refusal ratios decrease by ~10% from UH to AH training. 4. **Anomaly**: Unassociated Hallucination refusal ratio collapses from ~80% (UH) to ~22% (AH), a ~58% drop. ### Interpretation The data suggests that: - **UH-Only Training** prioritizes rejecting **Unassociated Hallucinations** (80% refusal), likely due to stricter constraints on unrelated outputs. - **AH-Only Training** shifts focus to **Associative Hallucinations** (32% refusal), indicating a trade-off where the model becomes more cautious about contextually related but incorrect outputs. - The drastic drop in Unassociated Hallucination refusal under AH-Only training implies reduced robustness against irrelevant outputs when trained exclusively on associative data. - Factual Association refusal ratios decrease across both training sets, suggesting a general trend of reduced caution toward factually grounded outputs regardless of training focus. This pattern highlights how training set composition directly influences model behavior, with UH-Only favoring strict rejection of irrelevant outputs and AH-Only prioritizing context-aware corrections. </details> Figure 11: Refusal tuning performance across three types of samples (LLaMA-3-8B). Figure 11 shows that training with UHs leads to strong generalization across UHs, with refusal ratios of 82% for LLaMA. However, this effect does not transfer to AHs, where refusal ratios fall to 28%, respectively. Moreover, some FA cases are mistakenly refused (29.5%). These results confirm that UHs share a common activation subspace, supporting generalization within the category, while AHs and FAs lie outside this space. By contrast, training with AHs produces poor generalization. On AH test samples, refusal ratio is only 33%, validating that their subject-specific hidden states prevent consistent refusal learning. Generalization to UHs is also weak (23.5%), again reflecting the divergence between AH and UH activation spaces. Overall, these findings show that the generalizability of refusal tuning is fundamentally limited by the heterogeneous nature of hallucinations. UH representations are internally consistent enough to support refusal generalization, but AH representations are too diverse for either UH-based or AH-based training to yield a broadly applicable and reliable refusal capability. 7 Conclusions and Future Work In this work, we revisit the widely accepted claim that hallucinations can be detected from a model’s internal states. Our mechanistic analysis reveals that hidden states encode whether models are reliance on their parametric knowledge rather than truthfulness. As a result, detection methods succeed only when outputs are detached from the input but fail when hallucinations arise from the same knowledge-recall process as correct answers. These findings lead to three key implications. First, future evaluations should report detection performance separately for Associated Hallucinations (AHs) and Unassociated Hallucinations (UHs), as they stem from fundamentally different internal processes and require distinct detection strategies. Second, relying solely on hidden states is insufficient for reliable hallucination detection. Future research should integrate LLMs with external feedback mechanisms, such as fact-checking modules or retrieval-based verifiers, to assess factuality more robustly. Third, future studies should prioritize improving AH detection. Because AHs occur more frequently in widely known or highly popular topics (§ 4.2.3), their undetected errors pose greater risks to user trust and the practical reliability of LLMs. Limitations We identify several limitations of our work. Focus on Factual Knowledge While our analysis identifies failure cases of hallucination detection methods, our study is primarily limited to factual completion prompts. It does not extend to long-form or open-ended text generation tasks Wei et al. (2024); Min et al. (2023); Huang and Chen (2024). Future work should broaden this investigation to these tasks in order to draw more comprehensive conclusions. Lack of Analysis on Prompt-based Hallucination Detection Approaches Our analysis focuses on white-box hallucination detection methods based on internal states and two black-box approaches based on external features. We do not include verbalization-based strategies Lin et al. (2022a); Tian et al. (2023); Xiong et al. (2024); Yang et al. (2024b); Ni et al. (2024); Zhao et al. (2024), such as prompting the model to report or justify its confidence explicitly, which constitute a different line of approach. Exploring such approaches may offer complementary insights into how models internally represent and express uncertainty. Applicability to Black-box LLMs or Large Reasoning Models Our study is limited to open-source LLMs. Conducting mechanistic analyses on commercial black-box LLMs is not permitted due to access restrictions. Future work could explore alternative evaluation protocols or collaboration frameworks that enable partial interpretability analyses on such systems. In addition, recent studies Mei et al. (2025); Zhang et al. (2025) have begun examining the internal states of large reasoning models for hallucination detection, suggesting a promising direction for extending our methodology to models with multi-step reasoning capabilities. Ethical Considerations This work analyzes the internal mechanisms of large language models using data constructed from Wikidata Vrandecic and Krötzsch (2014), which is released under the Creative Commons CC0 1.0 Universal license, allowing unrestricted use and redistribution of its data. All data are derived from publicly available resources, and no private or sensitive information about individuals is included. We employ the LLM tools for polishing. References - Azaria and Mitchell (2023) Amos Azaria and Tom M. Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976. - Cheang et al. (2023) Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li, Yanming Sun, Shudong Liu, and Lidia S. Chao. 2023. Can lms generalize to future data? an empirical analysis on text summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 16205–16217. Association for Computational Linguistics. - Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. - Daniel Han and team (2023) Michael Han Daniel Han and Unsloth team. 2023. Unsloth. - Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - Ding et al. (2024) Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2024. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. CoRR, abs/2402.10612. - Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, and 82 others. 2024. The llama 3 herd of models. CoRR, abs/2407.21783. - Finlayson et al. (2021) Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart M. Shieber, Tal Linzen, and Yonatan Belinkov. 2021. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1828–1843. Association for Computational Linguistics. - Gekhman et al. (2025) Zorik Gekhman, Eyal Ben-David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. 2025. Inside-out: Hidden factual knowledge in llms. CoRR, abs/2503.15299. - Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12216–12235. Association for Computational Linguistics. - Gottesman and Geva (2024) Daniela Gottesman and Mor Geva. 2024. Estimating knowledge in large language models without generating a single token. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, pages 3994–4019. - Guerreiro et al. (2023) Nuno Miguel Guerreiro, Elena Voita, and André F. T. Martins. 2023. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 1059–1075. Association for Computational Linguistics. - Huang and Chen (2024) Chao-Wei Huang and Yun-Nung Chen. 2024. Factalign: Long-form factuality alignment of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16363–16375. - Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2):42:1–42:55. - Ji et al. (2024) Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. 2024. LLM internal states reveal hallucination risk faced with a query. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 88–104, Miami, Florida, US. Association for Computational Linguistics. - Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825. - Kang and Choi (2023) Cheongwoong Kang and Jaesik Choi. 2023. Impact of co-occurrence on factual knowledge of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7721–7735. - Kang et al. (2024) Katie Kang, Amrith Setlur, Claire J. Tomlin, and Sergey Levine. 2024. Deep neural networks tend to extrapolate predictably. In The Twelfth International Conference on Learning Representations, ICLR 2024. - Kapoor et al. (2024) Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. 2024. Large language models must be taught to know what they don’t know. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. - Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. - Li et al. (2023) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36:41451–41530. - Li et al. (2025) Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. 2025. Knowledge boundary of large language models: A survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, pages 5131–5157. - Lin et al. (2022a) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022a. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022. - Lin et al. (2022b) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022b. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214–3252. - Mallen et al. (2023a) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023a. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, pages 9802–9822. - Mallen et al. (2023b) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9802–9822. Association for Computational Linguistics. - Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9004–9017. Association for Computational Linguistics. - Mei et al. (2025) Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, and Anirudha Majumdar. 2025. Reasoning about uncertainty: Do reasoning models know when they don’t know? CoRR, abs/2506.18183. - Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372. - Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 12076–12100. - Ni et al. (2024) Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. 2024. When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11375–11388. Association for Computational Linguistics. - Ni et al. (2025) Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. 2025. Towards fully exploiting LLM internal states to enhance knowledge boundary perception. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24315–24329. Association for Computational Linguistics. - Orgad et al. (2025) Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. 2025. Llms know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, ICLR 2025. - Sciavolino et al. (2021) Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6138–6148. Association for Computational Linguistics. - Su et al. (2024) Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 14379–14391. Association for Computational Linguistics. - Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5433–5442. Association for Computational Linguistics. - Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. CoRR, abs/2307.03987. - Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401. - Vrandecic and Krötzsch (2014) Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85. - Wei et al. (2024) Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V. Le. 2024. Long-form factuality in large language models. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024. - Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771. - Xiao et al. (2025) Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, and Yu Rong. 2025. Analyzing llms’ knowledge boundary cognition across languages through the lens of internal representations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24099–24115. Association for Computational Linguistics. - Xiong et al. (2024) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. - Yang et al. (2024a) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others. 2024a. Qwen2.5 technical report. CoRR, abs/2412.15115. - Yang et al. (2024b) Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. 2024b. Alignment for honesty. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. - Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8653–8665. Association for Computational Linguistics. - Yona et al. (2024) Gal Yona, Roee Aharoni, and Mor Geva. 2024. Narrowing the knowledge evaluation gap: Open-domain question answering with multi-granularity answers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 6737–6751. Association for Computational Linguistics. - Yüksekgönül et al. (2024) Mert Yüksekgönül, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, and Besmira Nushi. 2024. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, ICLR 2024. - Zhang et al. (2024) Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large language models to say ’i don’t know’. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, pages 7113–7139. - Zhang et al. (2023a) Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, and Kumar Sricharan. 2023a. Sac ${}^{\mbox{3}}$ : Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. CoRR, abs/2311.01740. - Zhang et al. (2025) Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. 2025. On the self-awareness of large reasoning models’ capability boundaries. Preprint, arXiv:2509.24711. - Zhang et al. (2023b) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023b. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219. - Zhao et al. (2024) Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. 2024. Knowing what llms DO NOT know: A simple yet effective self-detection method. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, pages 7051–7063. Appendix Appendix A Datasets and Implementations A.1 Selected Relations and Prompt Templates We employed a set of criteria to select relations from Wikidata in order to construct our dataset. Our criteria largely follow the framework proposed by Gekhman et al. (2025). Specifically, we require that each factual query in the dataset be unambiguous: given a subject–relation pair, the object should be unique and easy verifiable. The criteria are as follows: - Avoid granularity ambiguity. We exclude relations whose answers can vary in their level of detail. For example, in location queries, the response could be expressed as a city, state, or country, making it ill-defined Yona et al. (2024). - Avoid surface-level guessing. We exclude relations whose correct answers can often be inferred from shallow patterns. For instance, country of citizenship can frequently be guessed from shallow lexical patterns, rather then reflecting actual memorization Mallen et al. (2023b). Following these criteria, Gekhman et al. (2025) narrowed the 24 relations introduced by Sciavolino et al. (2021) down to four. However, we observe that their filtering primarily addresses ambiguity at the relation and object levels, but does not consider ambiguity at the subject level. In practice, some relations involve subjects that are inherently ambiguous. For example, the relation record label can be problematic because many songs share identical names, leading to unclear subject–object mappings. To mitigate such cases, we apply an additional subject-level filtering step and restrict our dataset to relations where the subject is a person, thereby reducing ambiguity. In addition, we manually include certain relations to strengthen the dataset. Concretely, we use the following four relations: P22 (father), P25 (mother), P26 (spouse), and P569 (date of birth). We show the list of the templates used to create our dataset in Table 3. | father mother spouse | The name of the father of [subject] is The name of the mother of [subject] is The name of the spouse of [subject] is | | --- | --- | | date of birth | The birth date of [subject] is | Table 3: Relations and prompt templates for querying factual knowledge of models. [subject] is a placeholder replaced with subject entities. | I will give you a factual query (e.g., “The name of the father of <subj>”), a gold answer to the factual query, and a proposed answer generated by an LLM. You need to compare the proposed answer to the gold answer and assign it one of the possible grades using the steps below. | | --- | | Possible grades are: | | A: CORRECT | | B: INCORRECT | | C: WRONG GOLD | | D: ERROR | | Spelling errors, synonyms, abbreviations, or hedging expressions (e.g., “it is possible that”) should not alter the grade if the person referred to in the proposed answer matches the gold answer. | | Steps: | | Step 1: If the gold answer does not correspond to an answer for the question, output “C” and finish. Otherwise, proceed to Step 2. | | Step 2: Extract all predicted entities from the proposed answer. Proceed to Step 3. | | Step 3: If each predicted entity refers to the answer mentioned in the gold answer, output “A” and finish. Otherwise, proceed to Step 4. | | Step 4: If the predicted entity does not refer to the gold answer, output “B” and finish. Otherwise, proceed to Step 5. | | Step 5: Double-check whether the proposed answer refers to a different answer from the gold answer. If it does, output “B.” Otherwise, output “D” and finish. | | Input format: | | Question: {question} | | Gold answer: {gold_answer} | | Proposed answer: {proposed_answer} | | Instruction: Output your reasoning steps. After that, conclude your response with “Output:” followed by the letter (A, B, C, or D). Do not provide any further explanation. | Figure 12: LLM Judge prompt used for evaluation. A.2 Labeling Scheme We follow the criteria in § 3 to label the data samples into different categories: - Factual Correctness: We construct correctness labels through a two-stage process. First, we use spaCy https://spacy.io/ Named Entity Recognizer to extract the target entity from the model’s output. If it matches the ground truth, the answer is marked correct. Otherwise, or if extraction fails, we rely on Qwen2.5-14B-Instruct Yang et al. (2024a) as an automatic judge to compare the predicted answer with the ground truth. Following Gekhman et al. (2025), we design the evaluation prompt, which is shown in Figure 12. - Subject Representation Reliance: We assess whether a prediction relies on the subject’s representation by blocking attention from subject tokens and measuring the resulting distribution shift. If the subject is crucial, masking disrupts information flow and yields a large shift; if not, the effect is minimal. Concretely, we compare the output distributions of the original prompt and the masked prompt (e.g., with “ Barack Obama ” masked), using Jensen–Shannon (JS) divergence to quantify the difference. A high JS divergence indicates strong reliance on the subject, while a low value suggests limited contribution. We then set a threshold based on the average JS divergence across all correct answers, assuming these inherently depend on subject representations. <details> <summary>x14.png Details</summary> ![b5025033](/v1/image/b502503303795dd82f5f1ba30610df2bca9ad67de373e4edcdfbc8cdde6f81b9) ### Visual Description ## Heatmap: Average JS Divergence Across Layers and Categories ### Overview The image is a heatmap visualizing the average JS divergence values across three categories ("Subj.", "Attn.", "Last.") and 31 layers (0–30). The color intensity represents divergence magnitude, with darker blues indicating higher values (up to 0.6) and lighter blues/whites indicating lower values (down to 0.1). ### Components/Axes - **Y-Axis (Categories)**: - "Subj." (Subject) - "Attn." (Attention) - "Last." (Last) - **X-Axis (Layers)**: - Layer indices from 0 to 30 (inclusive). - **Color Legend**: - Positioned on the right, labeled "Avg JS Divergence." - Gradient from light blue (0.1) to dark blue (0.6). ### Detailed Analysis 1. **Subject (Subj.)**: - Dark blue bars dominate the top section. - Values start near 0.6 at Layer 0 and gradually decrease to ~0.4 by Layer 30. - Consistent high divergence across all layers. 2. **Attention (Attn.)**: - Middle section with lighter blue shades. - Values start near 0.3 at Layer 0 and decrease to ~0.15 by Layer 30. - Moderate divergence, lower than Subject but higher than Last. 3. **Last (Last.)**: - Bottom section with the lightest blue/white shades. - Values start near 0.1 at Layer 0 and decrease to ~0.05 by Layer 30. - Lowest divergence across all layers. ### Key Observations - **Trend**: All categories show a **decreasing divergence trend** as layer indices increase. - **Dominance**: "Subj." consistently exhibits the highest divergence, followed by "Attn." and "Last." - **Layer Sensitivity**: Early layers (0–10) show the strongest divergence for all categories, with values dropping sharply in later layers (20–30). ### Interpretation The data suggests that **Subject features** are the most distinct and discriminative across all layers, likely due to their role in encoding specific information. **Attention features** show moderate divergence, indicating their importance in modulating focus but with less specificity than Subject features. **Last features** exhibit the lowest divergence, implying they may represent more generalized or abstracted information. The uniform decrease in divergence with increasing layers across all categories suggests that higher layers prioritize integration or abstraction over fine-grained distinctions. This pattern aligns with typical neural network behavior, where early layers capture raw features and later layers synthesize higher-level representations. </details> (a) Factual Associations <details> <summary>x15.png Details</summary> ![7a595892](/v1/image/7a5958920fe30f6d0dda23b4074813ab7494ca869c2cd867ca0c3da1e25952de) ### Visual Description ## Heatmap: Average JS Divergence Across Layers and Categories ### Overview The image is a heatmap visualizing the average JS divergence across three categories ("Subj.", "Attn.", "Last.") and 31 layers (0–30). The color intensity corresponds to divergence values, with darker blue indicating higher divergence (0.6) and lighter blue indicating lower divergence (0.1). The legend on the right maps color intensity to divergence values. --- ### Components/Axes - **Y-Axis (Categories)**: - "Subj." (Subject) - "Attn." (Attention) - "Last." (Last) - **X-Axis (Layers)**: - Labeled "Layer" with integer values from 0 to 30. - **Legend**: - Positioned on the right, labeled "Avg JS Divergence." - Color gradient: Dark blue (0.6) to light blue (0.1). --- ### Detailed Analysis 1. **"Subj." (Subject)**: - Layers 0–14: Dark blue (high divergence, ~0.5–0.6). - Layers 15–30: Gradual lightening (diminishing divergence, ~0.3–0.5). - **Trend**: Sharp decline in divergence after layer 14. 2. **"Attn." (Attention)**: - Layers 0–14: Light blue (low divergence, ~0.1–0.2). - Layers 15–20: Gradual darkening (increasing divergence, ~0.2–0.4). - Layers 21–30: Light blue again (diminishing divergence, ~0.1–0.2). - **Trend**: Peak divergence around layers 15–20. 3. **"Last." (Last)**: - Layers 0–14: Light blue (low divergence, ~0.1–0.2). - Layers 15–30: Gradual darkening (increasing divergence, ~0.2–0.4). - **Trend**: Steady increase in divergence across layers. --- ### Key Observations - **"Subj."** exhibits the highest divergence in early layers (0–14), with a sharp drop afterward. - **"Attn."** shows a bimodal pattern: low divergence in early and late layers, peaking mid-layers (15–20). - **"Last."** demonstrates a consistent upward trend in divergence from layer 15 onward. - The color bar confirms that darker shades correspond to higher divergence values. --- ### Interpretation The heatmap suggests that: - **Early layers (0–14)** are dominated by subject-related divergence ("Subj."), while attention and last-layer divergence are minimal. - **Mid-layers (15–20)** show a shift: attention divergence peaks, and last-layer divergence begins to rise. - **Late layers (21–30)** revert to lower divergence for "Attn." but maintain elevated "Last." divergence. This pattern may indicate that subject-related features dominate early processing, while attention and final-layer representations become more significant in later layers. The divergence trends could reflect hierarchical processing in a neural network or similar system, where early layers focus on raw subject features, and later layers integrate attention and final outputs. No explicit textual data or tables are present beyond the axis labels and legend. All values are inferred from color intensity and spatial positioning. </details> (b) Associated Hallucinations <details> <summary>x16.png Details</summary> ![49f7adcd](/v1/image/49f7adcdb57f1bb590dc9fa818bec895be96192b652e6ab19f10bc915a9e95a1) ### Visual Description ## Heatmap: Average JS Divergence Across Layers and Categories ### Overview The image is a heatmap visualizing the average JS divergence across three categories ("Subj.", "Attn.", "Last.") and 31 layers (0–30). The color intensity represents divergence values, with darker blues indicating higher divergence (0.6) and lighter blues/white indicating lower divergence (0.1). ### Components/Axes - **Y-Axis (Categories)**: - "Subj." (Subject) - "Attn." (Attention) - "Last." (Last) - **X-Axis (Layers)**: - Labeled "Layer" with integer markers from 0 to 30 in increments of 2. - **Color Bar (Legend)**: - Positioned on the right, labeled "Avg JS Divergence." - Gradient from light blue (0.1) to dark blue (0.6). ### Detailed Analysis - **Subj. (Subject)**: - Dark blue bars dominate layers 0–10, indicating high divergence (0.4–0.6). - Gradual lightening from layer 10 onward, with values dropping to ~0.2 by layer 30. - **Attn. (Attention)**: - Uniformly light blue across all layers, suggesting consistently low divergence (~0.1–0.2). - **Last. (Last)**: - Light blue in layers 0–10, transitioning to medium blue (0.3–0.4) from layer 10–20. - Peaks at dark blue (~0.5) in layers 20–25, then fades to light blue by layer 30. ### Key Observations 1. **Subj.** shows the highest divergence in early layers (0–10), with a sharp decline afterward. 2. **Attn.** maintains the lowest divergence across all layers, with minimal variation. 3. **Last.** exhibits a mid-layer peak (20–25) with the highest divergence values, followed by a decline. ### Interpretation - The heatmap suggests that **Subject** variability is most pronounced in early layers, possibly indicating initial processing or feature extraction stages. - **Attention** remains stable, implying consistent focus or weighting across layers. - **Last.** divergence peaks in mid-layers (20–25), which could reflect a critical phase of integration or decision-making in the modeled system. - The divergence patterns may correlate with architectural design choices (e.g., layer depth, attention mechanisms) in neural networks or similar computational models. </details> (c) Unassociated Hallucinations Figure 13: Effect of interventions across layers of Mistral-7B-v0.3. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the model’s predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token. A.3 Implementation Details Checkpoints and GPU resources. All the checkpoints used in our experiments are provided by the Hugging Face Transformers library Wolf et al. (2019). Specifically, we use the checkpoint “meta-llama/Meta-Llama-3-8B” https://huggingface.co/meta-llama/Meta-Llama-3-8B and “mistralai/Mistral-7B-v0.3” https://huggingface.co/mistralai/Mistral-7B-v0.3 for the experiments of response generation (§ 3), hidden-state analysis (§ 4) and accessing the performance of hallucination detection methods (§ 5). For refusal tuning (§ 6), we use checkpoints provided by the Unsloth framework Daniel Han and team (2023), namely “unsloth/llama-3-8b” https://huggingface.co/unsloth/llama-3-8b and “unsloth/mistral-7b-v0.3” https://huggingface.co/unsloth/mistral-7b-v0.3, which enable more efficient fine-tuning. All experiments are conducted on 4 NVIDIA L40S GPUs. <details> <summary>x17.png Details</summary> ![0275e832](/v1/image/0275e83248b07e9a6cfa4ec24f88d12a8fadc998c1a82903068b042016d95323) ### Visual Description ## Line Graph: Norm Ratio vs. Layers ### Overview The image depicts a line graph comparing two data series across 31 layers (0–30). The blue line (circles) represents "Asso. Hallu./Factual Asso." and the red line (squares) represents "Unasso. Hallu./Factual Asso." Both axes are labeled, with the y-axis showing "Norm Ratio" (0.95–1.01) and the x-axis showing "Layers" (0–30). The legend is positioned in the top-right corner. ### Components/Axes - **X-axis (Layers)**: Incremented by 5, ranging from 0 to 30. - **Y-axis (Norm Ratio)**: Incremented by 0.01, ranging from 0.95 to 1.01. - **Legend**: Top-right corner, with blue circles for "Asso. Hallu./Factual Asso." and red squares for "Unasso. Hallu./Factual Asso." ### Detailed Analysis #### Asso. Hallu./Factual Asso. (Blue Line) - **Trend**: Starts at 1.00 (Layer 0), dips slightly to ~0.995 at Layer 5, then stabilizes near 1.00 with minor fluctuations (e.g., 1.002 at Layer 15, 1.005 at Layer 30). - **Key Data Points**: - Layer 0: 1.00 - Layer 5: ~0.995 - Layer 10: ~0.998 - Layer 15: ~1.002 - Layer 20: ~1.001 - Layer 25: ~1.003 - Layer 30: ~1.005 #### Unasso. Hallu./Factual Asso. (Red Line) - **Trend**: Starts at ~0.99 (Layer 0), drops sharply to ~0.96 at Layer 5, fluctuates between 0.96–0.99 until Layer 15, then rises sharply to ~1.01 at Layer 30. - **Key Data Points**: - Layer 0: ~0.99 - Layer 5: ~0.96 - Layer 10: ~0.97 - Layer 15: ~0.98 - Layer 20: ~0.995 - Layer 25: ~0.992 - Layer 30: ~1.01 ### Key Observations 1. **Blue Line Stability**: The "Asso. Hallu./Factual Asso." ratio remains relatively stable, with minor deviations (±0.005) across all layers. 2. **Red Line Volatility**: The "Unasso. Hallu./Factual Asso." ratio exhibits significant fluctuations, dropping to a minimum of ~0.96 at Layer 5 and peaking at ~1.01 at Layer 30. 3. **Divergence at Layer 30**: Both lines converge near 1.00–1.01 at the final layer, suggesting a potential normalization effect. ### Interpretation - The blue line's stability implies that "Asso. Hallu./Factual Asso." maintains consistency across layers, possibly indicating robust performance or equilibrium. - The red line's volatility suggests "Unasso. Hallu./Factual Asso." experiences instability, with a sharp recovery at Layer 30. This could indicate a corrective mechanism or external influence at the final layer. - The divergence at Layer 30 highlights a critical point where both metrics align, potentially signaling a threshold or system adjustment. The red line's sharp rise may reflect an outlier or a designed response to earlier instability. </details> Figure 14: Norm ratio curves of subject representations in Mistral-7B-v0.3, comparing AHs and UHs against FAs as the baseline. At earlier layers, the norm of UH samples is significantly lower than that of AH samples. <details> <summary>x18.png Details</summary> ![675afe07](/v1/image/675afe07b43e437b93a20aa71869db0df219d73d4743c02df547208080e53cb6) ### Visual Description ## Bar Chart: Distribution of Factual Associations and Hallucinations Across Categories ### Overview The image is a grouped bar chart comparing three metrics—**Factual Associations**, **Associated Hallucinations**, and **Unassociated Hallucinations**—across three categories: **Low**, **Mid**, and **High**. The y-axis represents percentages (0–100%), and the x-axis lists the categories. Each group contains three bars, color-coded per legend. ### Components/Axes - **X-axis (Categories)**: Labeled "Low", "Mid", "High" (left to right). - **Y-axis (Percentage)**: Ranges from 0% to 100% in 20% increments. - **Legend**: Located at the bottom, with: - **Green**: Factual Associations - **Blue**: Associated Hallucinations - **Red**: Unassociated Hallucinations - **Bar Groups**: Each category (Low, Mid, High) has three adjacent bars (green, blue, red). ### Detailed Analysis #### Low Category - **Factual Associations**: 5% (green bar, shortest in group). - **Associated Hallucinations**: 2% (blue bar, shortest overall). - **Unassociated Hallucinations**: 93% (red bar, tallest overall). #### Mid Category - **Factual Associations**: 25% (green bar, medium height). - **Associated Hallucinations**: 6% (blue bar, medium height). - **Unassociated Hallucinations**: 70% (red bar, tallest in group). #### High Category - **Factual Associations**: 48% (green bar, tallest in group). - **Associated Hallucinations**: 12% (blue bar, medium height). - **Unassociated Hallucinations**: 40% (red bar, shortest in group). ### Key Observations 1. **Unassociated Hallucinations Dominance**: - In **Low** and **Mid** categories, red bars (Unassociated Hallucinations) dominate, accounting for 93% and 70%, respectively. 2. **Factual Associations Growth**: - Green bars (Factual Associations) increase steadily from 5% (Low) to 48% (High), becoming the tallest in the **High** category. 3. **Associated Hallucinations Trend**: - Blue bars (Associated Hallucinations) show a moderate increase from 2% (Low) to 12% (High), but remain the smallest in all groups. 4. **Inverse Relationship**: - As Factual Associations rise, Unassociated Hallucinations decline sharply (93% → 40% across Low → High). ### Interpretation The data suggests a **category-dependent performance metric**: - **Low Category**: Poor factual grounding (5%) and high unassociated hallucinations (93%), indicating significant irrelevance or error. - **Mid Category**: Moderate improvement in factual associations (25%) but still reliant on unassociated hallucinations (70%). - **High Category**: Strong factual grounding (48%) with reduced unassociated hallucinations (40%), though associated hallucinations persist at 12%. **Notable Patterns**: - The **inverse correlation** between Factual Associations and Unassociated Hallucinations implies that higher categories prioritize accuracy over speculative or irrelevant outputs. - **Associated Hallucinations** remain a consistent but minor factor across all categories, suggesting they are less impactful than unassociated errors. **Implications**: - The chart likely reflects a system's performance (e.g., AI model, data analysis tool) where "High" categories represent optimized or advanced settings, while "Low" categories indicate baseline or unrefined outputs. The decline in unassociated hallucinations with increasing category suggests improved relevance or precision in higher tiers. </details> Figure 15: Sample distribution across different subject popularity (low, mid, high) in Mistral-7B-v0.3, measured by monthly Wikipedia page views. Decoding algorithm. We employ greedy decoding ( $\text{temperature}=0$ ) for response generation, with models run in BF16 precision. PEFT settings for refusal tuning. For refusal tuning, we fine-tune with both models using QLoRA Dettmers et al. (2023), implemented with the Unsloth framework Daniel Han and team (2023), with rank $r=8$ , and $\alpha=8$ . QLoRA adapters are applied to all attention and MLP modules, and each model is fine-tuned for one epoch. <details> <summary>x19.png Details</summary> ![80837cdc](/v1/image/80837cdc2bc485b6ed67dc0c7c11e40dcf90aae30d376225cfd837343e71e58c) ### Visual Description ## Line Graph: Normalized Metrics Across Layers ### Overview The image depicts a line graph comparing three normalized metrics across 30 layers: **Factual Association** (green triangles), **Associated Hallucination** (blue circles), and **Unassociated Hallucination** (red squares). The y-axis represents normalized values (0–5), and the x-axis represents layers (0–30). All three metrics exhibit distinct trends, with notable peaks and fluctuations. --- ### Components/Axes - **X-axis (Layer)**: Ranges from 0 to 30, labeled "Layer." - **Y-axis (Norm)**: Normalized values from 0 to 5, labeled "Norm." - **Legend**: Located in the top-left corner, with three entries: - Green triangles: **Factual Association** - Blue circles: **Associated Hallucination** - Red squares: **Unassociated Hallucination** --- ### Detailed Analysis #### Factual Association (Green Triangles) - **Trend**: Starts near 0 at layer 0, gradually increases to a peak of ~4.8 at layer 20, then declines to ~0.2 at layer 25, and rises again to ~1.8 by layer 30. - **Key Points**: - Layer 20: Peak (~4.8) - Layer 25: Trough (~0.2) - Layer 30: ~1.8 #### Associated Hallucination (Blue Circles) - **Trend**: Remains near 0 until layer 15, spikes to ~4.6 at layer 20, drops sharply to ~0.5 at layer 25, and fluctuates between 0.5–1.5 by layer 30. - **Key Points**: - Layer 20: Peak (~4.6) - Layer 25: Trough (~0.5) - Layer 30: ~1.5 #### Unassociated Hallucination (Red Squares) - **Trend**: Stays near 0 until layer 20, then increases gradually to ~1.2 by layer 30. - **Key Points**: - Layer 20: ~0.8 - Layer 30: ~1.2 --- ### Key Observations 1. **Peak at Layer 20**: Both **Factual Association** and **Associated Hallucination** reach their highest values (~4.8 and ~4.6, respectively) at layer 20, suggesting a critical layer for these metrics. 2. **Post-Layer 20 Divergence**: - **Factual Association** declines sharply after layer 20 but recovers by layer 30. - **Associated Hallucination** drops to near-zero levels after layer 20, indicating a potential reset or correction. - **Unassociated Hallucination** begins to rise significantly after layer 20, suggesting delayed emergence of unrelated errors. 3. **Normalization**: All values are normalized (0–5), so absolute magnitudes are relative. The y-axis scaling emphasizes proportional changes rather than absolute values. --- ### Interpretation - **Layer 20 Significance**: The synchronized peak at layer 20 implies this layer plays a pivotal role in both factual associations and associated hallucinations. This could reflect a critical decision-making or representation layer in a neural network. - **Post-Layer 20 Behavior**: - The decline in **Factual Association** and **Associated Hallucination** after layer 20 might indicate overcorrection or saturation effects. - The gradual rise in **Unassociated Hallucination** suggests that later layers introduce noise or errors unrelated to the primary task, possibly due to overfitting or model complexity. - **Normalization Implications**: The y-axis normalization (0–5) obscures absolute values but highlights relative trends. For example, the ~4.8 peak for **Factual Association** at layer 20 represents the highest relative strength of factual associations across all layers. --- ### Spatial Grounding - **Legend**: Top-left corner, clearly labeled with color-marker associations. - **Data Series Placement**: - Green triangles (**Factual Association**) dominate the upper half of the graph. - Blue circles (**Associated Hallucination**) peak sharply at layer 20. - Red squares (**Unassociated Hallucination**) remain near the baseline until layer 20, then rise gradually. --- ### Final Notes The graph underscores the dynamic interplay between factual accuracy, associated errors, and unassociated errors across layers. The sharp peak at layer 20 and subsequent divergence highlight the importance of layer-specific analysis in understanding model behavior. Further investigation into layer 20’s architecture (e.g., attention mechanisms, activation functions) could clarify its role in these trends. </details> Figure 16: Subject-to-last attention contribution norms across layers in Mistral-7B-v0.3. Values show the norm of the attention contribution from subject tokens to the last token at each layer. <details> <summary>x20.png Details</summary> ![c17f953a](/v1/image/c17f953a5874810dbbaa4e7b98afb38266cfb5ca334c1948a5cb43672ceb5209) ### Visual Description ## Line Graph: Cosine Similarity Across Layers ### Overview The image depicts a line graph comparing three metrics—Factual Associations, Associated Hallucinations, and Unassociated Hallucinations—across 30 layers. Cosine similarity values range from 0.3 to 0.9, with all three metrics starting near 0.9 at layer 0 and diverging significantly by layer 30. ### Components/Axes - **X-axis (Layers)**: Labeled "Layers," with ticks at intervals of 5 (0, 5, 10, ..., 30). - **Y-axis (Cosine Similarity)**: Labeled "Cosine Similarity," scaled from 0.3 to 0.9 in increments of 0.1. - **Legend**: Located in the bottom-left corner, with: - Green triangles: Factual Associations - Blue circles: Associated Hallucinations - Red squares: Unassociated Hallucinations ### Detailed Analysis 1. **Factual Associations (Green)**: - Starts at ~0.9 at layer 0. - Declines gradually to ~0.8 by layer 15. - Drops sharply to ~0.3 by layer 25. - Recovers to ~0.45 by layer 30. 2. **Associated Hallucinations (Blue)**: - Mirrors Factual Associations closely until layer 15 (~0.8). - Declines to ~0.3 by layer 25. - Recovers to ~0.45 by layer 30. 3. **Unassociated Hallucinations (Red)**: - Starts at ~0.9 at layer 0. - Declines to ~0.85 by layer 15. - Drops sharply to ~0.6 by layer 20. - Reaches a trough of ~0.3 at layer 25. - Recovers to ~0.7 by layer 30. ### Key Observations - All three metrics exhibit a **sharp decline** between layers 15–20, with Unassociated Hallucinations showing the most pronounced drop. - By layer 25, all metrics converge near the lowest similarity value (~0.3). - A **recovery phase** occurs between layers 25–30, with Unassociated Hallucinations rebounding more steeply than the others. - The green (Factual) and blue (Associated) lines remain closely aligned throughout, while the red (Unassociated) line diverges significantly after layer 15. ### Interpretation The data suggests that **cosine similarity decreases as layers increase**, indicating reduced alignment or consistency in associations across deeper layers. The sharp decline around layer 15–20 may reflect a critical transition point in the model's behavior, such as a shift in feature representation or loss of interpretability. The recovery in later layers (25–30) could imply stabilization or compensatory mechanisms. Notably, Unassociated Hallucinations exhibit greater volatility, hinting at distinct dynamics in these associations compared to factual or associated ones. This divergence might highlight challenges in disentangling spurious correlations in deeper layers. </details> Figure 17: Cosine similarity of target-token hidden states across layers in Mistral-7B-v0.3. From mid-layers onward, FAs and AHs diverge sharply as subject information propagates, while UHs remain more clustered, confirming weaker subject-dependent updates. <details> <summary>x21.png Details</summary> ![e5c74010](/v1/image/e5c740108b5eb7dbd296b55e5cf2da57d4bbf9a57bebb93f9392d5410b10cb49) ### Visual Description ## Scatter Plot: Data Point Distribution by Association Type ### Overview The image is a scatter plot displaying three distinct data series represented by colored points: green (Factual Asso.), blue (Asso. Hallu.), and red (Unasso. Hallu.). The plot spans an x-axis from -20 to 30 and a y-axis from -20 to 30. Data points are distributed across the plane with varying densities and overlaps. ### Components/Axes - **X-axis**: Labeled with numerical values from -20 to 30, no explicit title provided. - **Y-axis**: Labeled with numerical values from -20 to 30, no explicit title provided. - **Legend**: Located in the top-right corner, with three entries: - **Green**: "Factual Asso." - **Blue**: "Asso. Hallu." - **Red**: "Unasso. Hallu." - **Data Points**: Circular markers with no explicit size or shape variations. ### Detailed Analysis - **Factual Asso. (Green)**: - Concentrated in the **top-left quadrant** (x ≈ -20 to 0, y ≈ 10 to 20). - Some points extend into the **middle-right** (x ≈ 0 to 10, y ≈ 0 to 10). - Approximate density: ~30-40% of total points. - **Asso. Hallu. (Blue)**: - Widely dispersed across the plot, with clusters in: - **Middle** (x ≈ -10 to 10, y ≈ -10 to 10). - **Top-right** (x ≈ 10 to 20, y ≈ 10 to 20). - Approximate density: ~30-40% of total points. - **Unasso. Hallu. (Red)**: - Dominant in the **top-right quadrant** (x ≈ 10 to 30, y ≈ 10 to 20). - Some points overlap with blue in the **middle** (x ≈ 0 to 10, y ≈ 0 to 10). - Approximate density: ~20-30% of total points. ### Key Observations 1. **Factual Asso. (Green)** shows a clear clustering in the top-left, suggesting a consistent pattern in this category. 2. **Asso. Hallu. (Blue)** exhibits the most variability, with points spread across multiple regions, indicating less consistency. 3. **Unasso. Hallu. (Red)** is predominantly in the top-right, but its overlap with blue in the middle suggests potential ambiguity or shared characteristics. 4. **Overlap**: Significant overlap between blue and red in the middle region (x ≈ 0 to 10, y ≈ 0 to 10), possibly indicating overlapping criteria or misclassification. 5. **Outliers**: A few green points appear in the bottom-left (x ≈ -20 to -10, y ≈ -10 to 0), deviating from the main cluster. ### Interpretation The data suggests a distinction between **Factual Associations** (green), which are tightly grouped, and **Hallucinations** (blue and red), which are more dispersed. The red points (Unasso. Hallu.) in the top-right may represent instances where unassociated hallucinations exhibit higher values in both axes, while their presence in the middle could indicate overlapping or ambiguous cases. The blue points (Asso. Hallu.) show the greatest variability, possibly reflecting a broader range of associated hallucinations. The overlap between blue and red in the middle region raises questions about classification boundaries or potential errors in data labeling. This distribution highlights the need for further analysis to clarify the criteria distinguishing these categories. </details> Figure 18: t-SNE visualization of last token’s representations at layer 25 of Mistral-7B-v0.3. <details> <summary>x22.png Details</summary> ![3d9dad99](/v1/image/3d9dad9919f499bfd0482fc7efefa4ac82d941755a9697c76271d87c6169532e) ### Visual Description ## Bar Chart: AUROC Comparison by Representation Type and Hallucination Type ### Overview The chart compares Area Under the Receiver Operating Characteristic Curve (AUROC) values for three representation types ("Subject," "Attention," "Last Token") across two hallucination conditions: "Unassociated Hallucination" (red bars) and "Associated Hallucination" (blue bars). Error bars indicate measurement uncertainty. ### Components/Axes - **X-axis**: Representation Type (Subject, Attention, Last Token) - **Y-axis**: AUROC (0.4–0.9) - **Legend**: - Red = Unassociated Hallucination - Blue = Associated Hallucination - **Error Bars**: Vertical lines atop bars showing ± uncertainty ### Detailed Analysis 1. **Subject Representation Type**: - Unassociated Hallucination: 0.89 ± 0.02 - Associated Hallucination: 0.59 ± 0.03 2. **Attention Representation Type**: - Unassociated Hallucination: 0.78 ± 0.03 - Associated Hallucination: 0.56 ± 0.04 3. **Last Token Representation Type**: - Unassociated Hallucination: 0.84 ± 0.02 - Associated Hallucination: 0.55 ± 0.03 ### Key Observations - Unassociated Hallucination consistently outperforms Associated Hallucination across all representation types. - The largest AUROC gap occurs in the "Subject" representation type (0.89 vs. 0.59). - Error margins are smaller for Unassociated Hallucination (avg. ±0.025) vs. Associated Hallucination (avg. ±0.035). - AUROC values for Associated Hallucination cluster tightly between 0.55–0.59. ### Interpretation The data demonstrates that Unassociated Hallucination maintains significantly higher discriminative power (AUROC) across all representation types compared to Associated Hallucination. The "Subject" representation type shows the strongest performance for Unassociated Hallucination, suggesting it captures more robust features for this condition. The smaller error margins for Unassociated Hallucination imply more reliable measurements, potentially due to clearer signal separation. The consistent underperformance of Associated Hallucination across all categories indicates a fundamental limitation in its ability to distinguish between classes, regardless of representation type. This pattern suggests that representation type optimization may be more impactful for Unassociated Hallucination than for its associated counterpart. </details> Figure 19: Hallucination detection performance on the Full setting (Mistral-v0.3-7B). <details> <summary>x23.png Details</summary> ![8953ade9](/v1/image/8953ade9eaa64986ae17f9dfbcc0e2daaf27e0fdb5bb631f019ce8bd8f3ea15c) ### Visual Description ## Bar Chart: Refusal Ratio by Training Set and Hallucination Type ### Overview The chart compares refusal ratios (%) across three hallucination types (Factual Asso., Asso. Hallu., Unasso. Halluc.) for two training sets (UH Only, AH Only). Refusal ratios are visualized as grouped bars with distinct colors. ### Components/Axes - **X-axis (Training Set)**: Two categories - "UH Only" (left) and "AH Only" (right). - **Y-axis (Refusal Ratio %)**: Scaled from 0 to 100 in 20% increments. - **Legend**: Located in the top-right corner, mapping colors to hallucination types: - Green: Factual Asso. - Blue: Asso. Hallu. - Red: Unasso. Halluc. ### Detailed Analysis 1. **UH Only Training Set**: - **Unasso. Halluc. (Red)**: ~85% refusal ratio (tallest bar). - **Asso. Hallu. (Blue)**: ~15% refusal ratio. - **Factual Asso. (Green)**: ~10% refusal ratio. 2. **AH Only Training Set**: - **Unasso. Halluc. (Red)**: ~50% refusal ratio. - **Asso. Hallu. (Blue)**: ~20% refusal ratio. - **Factual Asso. (Green)**: ~15% refusal ratio. ### Key Observations - **Dominance of Unasso. Halluc.**: Red bars (Unasso. Halluc.) consistently show the highest refusal ratios in both training sets. - **Training Set Impact**: - UH Only achieves ~85% refusal for Unasso. Halluc., while AH Only drops to ~50%. - Factual Asso. refusal ratios are lowest across all categories (~10-15%). - **Color Consistency**: Legend colors match bar colors exactly (green=green, blue=blue, red=red). ### Interpretation The data suggests that training models exclusively on unassociated hallucinations (UH Only) significantly improves their ability to refuse unassociated hallucinations compared to training on associated hallucinations (AH Only). However, both training approaches struggle with factual associations, indicating a potential gap in handling contextually accurate but non-hallucinatory outputs. The stark contrast between UH and AH training for Unasso. Halluc. refusal (~85% vs. ~50%) highlights the importance of targeted training data composition in mitigating specific hallucination types. </details> Figure 20: Refusal tuning performance across three types of samples (Mistral-v0.3-7B). Appendix B Parallel Experiments on Mistral This section is for documenting parallel experiments conducted on the Mistral-7B-v0.3 model under the same settings as described in the main text (Figures 13 – 20). The results from Mistral exhibit similar patterns to those observed in LLaMA, as described before. Specifically, we find consistent patterns in the model’s internal computations, hidden-state behaviors, and the performance of hallucination detection and refusal tuning experiments. <details> <summary>x24.png Details</summary> ![184e52e7](/v1/image/184e52e7de6e4a471da6fe8ff329710d56b3634435178e629821b64212002621) ### Visual Description ## Scatter Plot: Distribution of Association Categories ### Overview The image is a scatter plot displaying three distinct categories of data points: "Factual Asso." (green), "Asso. Hallu." (blue), and "Unasso. Hallu." (red). The plot spans a coordinate system with x and y axes ranging from -20 to 20. The legend is positioned in the bottom-left corner, clearly labeling each category with its corresponding color. ### Components/Axes - **Axes**: - X-axis: Labeled with numerical values from -20 to 20 (no explicit title). - Y-axis: Labeled with numerical values from -20 to 20 (no explicit title). - **Legend**: - Located in the bottom-left corner. - Labels: - Green: "Factual Asso." - Blue: "Asso. Hallu." - Red: "Unasso. Hallu." ### Detailed Analysis - **Factual Asso. (Green)**: - Primarily located in the left and central regions of the plot. - X-axis range: Approximately -20 to 10. - Y-axis range: Approximately -10 to 20. - Distribution: Scattered but with a slight concentration in the upper-left quadrant. - **Asso. Hallu. (Blue)**: - Concentrated in the central and right regions. - X-axis range: Approximately -10 to 20. - Y-axis range: Approximately -10 to 20. - Distribution: Overlaps with green points in the central area but dominates the right side. - **Unasso. Hallu. (Red)**: - Dominates the right and lower-right regions. - X-axis range: Approximately 0 to 20. - Y-axis range: Approximately -10 to 10. - Distribution: Highly clustered in the lower-right quadrant, with fewer points in the upper-right. ### Key Observations 1. **Color-Category Alignment**: All data points match the legend colors exactly (green = Factual Asso., blue = Asso. Hallu., red = Unasso. Hallu.). 2. **Spatial Clustering**: - Red points (Unasso. Hallu.) are the most concentrated, forming a dense cluster in the lower-right quadrant. - Blue points (Asso. Hallu.) show a broader spread but are more prevalent in the right half of the plot. - Green points (Factual Asso.) are more dispersed, with a notable presence in the left and central areas. 3. **Overlap**: Blue and green points overlap significantly in the central region, suggesting potential ambiguity or similarity between these categories. ### Interpretation The plot suggests a spatial distribution of three categories with distinct clustering patterns. The red points (Unasso. Hallu.) are tightly grouped in the lower-right, indicating a possible correlation with higher x-values and lower y-values. The blue and green points, while overlapping in the center, show different spatial tendencies: blue points extend further right, while green points are more spread out in the left. This could imply that "Unasso. Hallu." is a more homogeneous or constrained category compared to the others. The lack of axis titles limits direct interpretation of the axes' meanings, but the relative positions of the points suggest a possible relationship between the categories and their coordinates. </details> Figure 21: t-SNE visualization of subject tokens’ representations at layer 11 of LLaMA-3-8B. <details> <summary>x25.png Details</summary> ![5f660f38](/v1/image/5f660f38e3759fc386efec4d0a68894c3cf98a54ce113d956de48879a6b7bd7b) ### Visual Description ## Scatter Plot: Distribution of Association Categories ### Overview The image is a scatter plot displaying three distinct categories of data points: "Factual Asso." (green), "Asso. Hallu." (blue), and "Unasso. Hallu." (red). The plot spans a 2D coordinate system with X and Y axes ranging from -20 to 30. Data points are distributed across the plot, with varying densities and spatial groupings. ### Components/Axes - **X-axis**: Labeled "X" with a range from -20 to 30. No explicit units or labels beyond the axis title. - **Y-axis**: Labeled "Y" with a range from -20 to 30. No explicit units or labels beyond the axis title. - **Legend**: Located in the **top-right corner** of the plot. It includes: - **Green**: "Factual Asso." - **Blue**: "Asso. Hallu." - **Red**: "Unasso. Hallu." ### Detailed Analysis - **Factual Asso. (Green)**: - Clustered primarily in the **top-left quadrant** (X ≈ -10 to 0, Y ≈ 10 to 30). - Some points extend into the **center** (X ≈ 0 to 10, Y ≈ 0 to 10). - Density decreases toward the **bottom-right** (X ≈ 10 to 20, Y ≈ -10 to 0). - **Asso. Hallu. (Blue)**: - Concentrated in the **top-right quadrant** (X ≈ 0 to 20, Y ≈ 10 to 30). - Some points spread into the **center** (X ≈ -10 to 0, Y ≈ 0 to 10). - Fewer points in the **bottom-left** (X ≈ -20 to -10, Y ≈ -20 to -10). - **Unasso. Hallu. (Red)**: - Widely distributed across the entire plot, with no clear clustering. - High density in the **center** (X ≈ -10 to 10, Y ≈ -10 to 10). - Scattered points in all quadrants, including the **bottom-left** (X ≈ -20 to -10, Y ≈ -20 to -10) and **top-right** (X ≈ 10 to 20, Y ≈ 10 to 30). ### Key Observations 1. **Factual Asso.** exhibits a strong spatial bias toward the **top-left**, suggesting a potential correlation with higher Y-values and lower X-values. 2. **Asso. Hallu.** is predominantly in the **top-right**, indicating a possible inverse relationship with Factual Asso. in terms of X and Y coordinates. 3. **Unasso. Hallu.** shows the most dispersed distribution, with no clear pattern, implying a lack of strong association with specific X/Y ranges. 4. **Overlap**: All three categories overlap in the **center** (X ≈ -10 to 10, Y ≈ -10 to 10), suggesting ambiguity or mixed associations in this region. ### Interpretation The plot likely represents a classification or clustering of entities based on two dimensions (X and Y). The distinct spatial groupings of "Factual Asso." and "Asso. Hallu." suggest that these categories may be influenced by opposing factors (e.g., X and Y values). The "Unasso. Hallu." category, being more dispersed, might represent outliers or entities with weaker or undefined associations. The overlap in the center highlights potential ambiguity or transitional states between categories. This could reflect a dataset where certain attributes (X and Y) are used to differentiate between factual, associated, and unassociated hallucinations, though the exact nature of the axes remains unspecified. </details> Figure 22: t-SNE visualization of subject tokens’ representations at layer 11 of Mistral-7B-v0.3. Appendix C More Visualization on Hidden States In this section, we provide t-SNE visualization of subject tokens’ hidden states in Figure 21 and Figure 22. Compared to the last-token representations, the t-SNE visualization of subject-token hidden states shows that unassociated hallucinations (UHs) are moderately separated from factual and associated samples, but the separation is less distinct than in the last-token representations. This observation aligns with the results in § 5, where the hallucination detection performance using last-token hidden states outperforms that based on subject-token representations.

Rendering Paper...