2510.09033

Model: gemini-3-flash-free

# Large Language Models Do NOT Really Know What They Don’t Know Abstract Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may “ know what they don’t know ”. However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that LLMs don’t really know what they don’t know. Large Language Models Do NOT Really Know What They Don’t Know Chi Seng Cheang 1 Hou Pong Chan 2 Wenxuan Zhang 3 Yang Deng 1 1 Singapore Management University 2 DAMO Academy, Alibaba Group 3 Singapore University of Technology and Design cs.cheang.2025@phdcs.smu.edu.sg, houpong.chan@alibaba-inc.com wxzhang@sutd.edu.sg, ydeng@smu.edu.sg 1 Introduction Large language models (LLMs) demonstrate remarkable proficiency in generating coherent and contextually relevant text, yet they remain plagued by hallucination Zhang et al. (2023b); Huang et al. (2025), a phenomenon where outputs appear plausible but are factually inaccurate or entirely fabricated, raising concerns about their reliability and trustworthiness. To this end, researchers suggest that the internal states of LLMs (e.g., hidden representations Azaria and Mitchell (2023); Gottesman and Geva (2024), attention weights Yüksekgönül et al. (2024), output token logits Orgad et al. (2025); Varshney et al. (2023), etc.) can be used to detect hallucinations, indicating that LLMs themselves may actually know what they don’t know. These methods typically assume that when a model produces hallucinated outputs (e.g., “ Barack Obama was born in the city of Tokyo ” in Figure 1), its internal computations for the outputs (“ Tokyo ”) are detached from the input information (“ Barack Obama ”), thereby differing from those used to generate factually correct outputs. Thus, the hidden states are expected to capture this difference and serve as indicators of hallucinations. <details> <summary>x1.png Details</summary> ![a2963ce1](/v1/image/a2963ce105d0fb633c8334fd486727df123e9608f50a8ca420e7c73399f42ddc) ### Visual Description # Technical Document Extraction: LLM Internal States and Hallucination Analysis ## 1. Overview This image is a conceptual technical diagram illustrating how a Large Language Model (LLM) processes factual queries and the relationship between its internal latent states and the resulting generated output. It specifically categorizes outputs into factual associations, associated hallucinations, and unassociated hallucinations based on their proximity within the model's internal representation space. --- ## 2. Component Segmentation ### Region A: Factual Query (Input) Located on the far left, this section represents the prompts fed into the model. * **Icon:** 🔍 (Magnifying glass) * **Label:** Factual Query * **Transcribed Text (Input Prompts):** 1. *Barack Obama studied in the city of* 2. *Barack Obama was born in the city of* 3. *Barack Obama was born in the city of* * **Visual Flow:** Three black arrows point from these text prompts toward the central LLM block. ### Region B: Internal States (Processing) Located in the center, this section visualizes the model's latent space. * **Icon:** 🧠 (Brain) * **Label:** Internal States * **Components:** * **LLM Block:** A grey vertical rounded rectangle labeled "**LLM**". * **Latent Space Projection:** A dashed-line square box connected to the LLM by diverging dashed lines, indicating a "zoom-in" on internal activations. * **Data Distribution:** Inside the box is a scatter plot of colored dots arranged in a roughly circular/annular distribution. * **Top Half:** Contains a mix of **Green** and **Blue** dots. * **Bottom Half:** Contains primarily **Red** dots. * **Center:** A void or empty space in the middle of the distribution. ### Region C: Generated Output (Legend and Results) Located on the right, this section defines the categories of the model's response. * **Icon:** 💬 (Speech bubble) * **Label:** Generated Output * **Legend and Classification:** 1. **Green Dot + ✅ Factual Associations** * *Example:* e.g., *Chicago* * *Spatial Grounding:* Corresponds to the green dots in the top half of the internal state plot. 2. **Blue Dot + ❌ Associated Hallucinations** * *Example:* e.g., *Chicago* * *Spatial Grounding:* Corresponds to the blue dots intermingled with green dots in the top half of the internal state plot. 3. **Red Dot + ❌ Unassociated Hallucinations** * *Example:* e.g., *Tokyo* * *Spatial Grounding:* Corresponds to the cluster of red dots in the bottom half of the internal state plot. --- ## 3. Logic and Trend Analysis ### Data Relationship Logic The diagram establishes a spatial correlation between the "correctness" of an answer and its position in the LLM's internal state: * **Clustering of Truth and Related Errors:** The **Green** (Factual) and **Blue** (Associated Hallucination) dots are spatially clustered together. This suggests that when the model hallucinates a "related" but incorrect fact (e.g., saying Obama was born in Chicago because he is strongly associated with that city), the internal state is nearly identical to the state for a factual truth. * **Isolation of Unrelated Errors:** The **Red** dots (Unassociated Hallucinations, like "Tokyo") are clustered in a completely different region of the latent space. This indicates that "random" or unassociated hallucinations represent a distinct internal state compared to factual or contextually relevant information. ### Summary of Mappings | Category | Internal State Region | Example Output | Status | | :--- | :--- | :--- | :--- | | **Factual Association** | Top Cluster (Mixed) | Chicago (as study location) | Correct | | **Associated Hallucination** | Top Cluster (Mixed) | Chicago (as birth location) | Incorrect (but related) | | **Unassociated Hallucination** | Bottom Cluster | Tokyo (as birth location) | Incorrect (unrelated) | --- ## 4. Language Declaration * **Primary Language:** English (100%). * **Note:** No other languages are present in the document. All text is transcribed directly as seen. </details> Figure 1: Illustration of three categories of knowledge. Associated hallucinations follow similar internal knowledge recall processes with factual associations, while unassociated hallucinations arise when the model’s output is detached from the input. However, other research (Lin et al., 2022b; Kang and Choi, 2023; Cheang et al., 2023) shows that models can also generate false information that is closely associated with the input information. In particular, models may adopt knowledge shortcuts, favoring tokens that frequently co-occur in the training corpus over factually correct answers Kang and Choi (2023). As shown in Figure 1, given the prompt: “Barack Obama was born in the city of”, an LLM may rely on the subject tokens’ representations (i.e., “Barack Obama”) to predict a hallucinated output (e.g., “Chicago”), which is statistically associated with the subject entity but under other contexts (e.g., “ Barack Obama studied in the city of Chicago ”). Therefore, we suspect that the internal computations may not exhibit distinguishable patterns between correct predictions and input-associated hallucinations, as LLMs rely on the input information to produce both of them. Only when the model produces hallucinations unassociated with the input do the hidden states exhibit distinct patterns that can be reliably identified. To this end, we conduct a mechanistic analysis of how LLMs internally process factual queries. We first perform causal analysis to identify hidden states crucial for generating Factual Associations (FAs) — factually correct outputs grounded in subject knowledge. We then examine how these hidden states behave when the model produces two types of factual errors: Associated Hallucinations (AHs), which remain grounded in subject knowledge, and Unassociated Hallucinations (UHs), which are detached from it. Our analysis shows that when generating both FAs and AHs, LLMs propagate information encoded in subject representations to the final token during output generation, resulting in overlapping hidden-state geometries that cannot reliably distinguish AHs from FAs. In contrast, UHs exhibit distinct internal computational patterns, producing clearly separable hidden-state geometries from FAs. Building on the analysis, we revisit several widely-used hallucination detection approaches Gottesman and Geva (2024); Yüksekgönül et al. (2024); Orgad et al. (2025) that adopt internal state probing. The results show that these representations cannot reliably distinguish AHs from FAs due to their overlapping hidden-state geometries, though they can effectively separate UHs from FAs. Moreover, this geometry also shapes the limits the effectiveness of Refusal Tuning Zhang et al. (2024), which trains LLMs to refuse uncertain queries using refusal-aware dataset. Because UH samples exhibit consistent and distinctive patterns, refusal tuning generalizes well to unseen UHs but fails to generalize to unseen AHs. We also find that AH hidden states are more diverse, and thus refusal tuning with AH samples prevents generalization across both AH and UH samples. Together, these findings highlight a central limitation: LLMs do not encode truthfulness in their hidden states but only patterns of knowledge recall and utilization, showing that LLMs don’t really know what they don’t know. 2 Related Work Existing hallucination detection methods can be broadly categorized into two types: representation-based and confidence-based. Representation-based methods assume that an LLM’s internal hidden states can reflect the correctness of its generated responses. These approaches train a classifier (often a linear probe) using the hidden states from a set of labeled correct/incorrect responses to predict whether a new response is hallucinatory Li et al. (2023); Azaria and Mitchell (2023); Su et al. (2024); Ji et al. (2024); Chen et al. (2024); Ni et al. (2025); Xiao et al. (2025). Confidence-based methods, in contrast, assume that a lower confidence during the generation led to a higher probability of hallucination. These methods quantify uncertainty through various signals, including: (i) token-level output probabilities (Guerreiro et al., 2023; Varshney et al., 2023; Orgad et al., 2025); (ii) directly querying the LLM to verbalize its own confidence (Lin et al., 2022a; Tian et al., 2023; Xiong et al., 2024; Yang et al., 2024b; Ni et al., 2024; Zhao et al., 2024); or (iii) measuring the semantic consistency across multiple outputs sampled from the same prompt (Manakul et al., 2023; Kuhn et al., 2023; Zhang et al., 2023a; Ding et al., 2024). A response is typically flagged as a hallucination if its associated confidence metric falls below a predetermined threshold. However, a growing body of work reveals a critical limitation: even state-of-the-art LLMs are poorly calibrated, meaning their expressed confidence often fails to align with the factual accuracy of their generations (Kapoor et al., 2024; Xiong et al., 2024; Tian et al., 2023). This miscalibration limits the effectiveness of confidence-based detectors and raises a fundamental question about the extent of LLMs’ self-awareness of their knowledge boundary, i.e., whether they can “ know what they don’t know ” Yin et al. (2023); Li et al. (2025). Despite recognizing this problem, prior work does not provide a mechanistic explanation for its occurrence. To this end, our work addresses this explanatory gap by employing mechanistic interpretability techniques to trace the internal computations underlying knowledge recall within LLMs. 3 Preliminary Transformer Architecture Given an input sequence of $T$ tokens $t_{1},...,t_{T}$ , an LLM is trained to model the conditional probability distribution of the next token $p(t_{T+1}|t_{1},...,t_{T})$ conditioned on the preceding $T$ tokens. Each token is first mapped to a continuous vector by an embedding layer. The resulting sequence of hidden states is then processed by a stack of $L$ Transformer layers. At layer $\ell∈{1,...,L}$ , each token representation is updated by a Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (MLP) module: $$ \mathbf{h}^{\ell}=\mathbf{h}^{\ell-1}+\mathbf{a}^{\ell}+\mathbf{m}^{\ell}, \tag{1} $$ where $\mathbf{a}^{\ell}$ and $\mathbf{m}^{\ell}$ correspond to the MHSA and MLP outputs, respectively, at the $\ell$ -layer. Internal Process of Knowledge Recall Prior work investigates the internal activations of LLMs to study the mechanics of knowledge recall. For example, an LLM may encode many attributes that are associated with a subject (e.g., Barack Obama) (Geva et al., 2023). Given a prompt like “ Barack Obama was born in the city of ”, if the model has correctly encoded the fact, the attribute “ Honolulu ” propagates through self-attention to the last token, yielding the correct answer. We hypothesize that non-factual predictions follow the same mechanism: spurious attributes such as “ Chicago ” are also encoded and propagated, leading the model to generate false outputs. Categorization of Knowledge To investigate how LLMs internally process factual queries, we define three categories of knowledge, according to two criteria: 1) factual correctness, and 2) subject representation reliance. - Factual Associations (FA) refer to factual knowledge that is reliably stored in the parameters or internal states of an LLM and can be recalled to produce correct, verifiable outputs. - Associated Hallucinations (AH) refer to non-factual content produced when an LLM relies on input-triggered parametric associations. - Unassociated Hallucinations (UH) refer to non-factual content produced without reliance on parametric associations to the input. <details> <summary>x2.png Details</summary> ![fc8ec2be](/v1/image/fc8ec2be6044218868b0df8f97675e3ba291951b3b1792fd98f73369a32ba643) ### Visual Description # Technical Data Extraction: Average JS Divergence Heatmap ## 1. Image Overview This image is a heatmap visualization representing the **Avg JS Divergence** (Average Jensen-Shannon Divergence) across different layers of a neural network model. The data is categorized by three distinct components or methods across 32 layers. ## 2. Component Isolation ### A. Header/Axes * **Y-Axis (Left):** Categorical labels representing different components: * `Subj.` (Top row) * `Attn.` (Middle row) * `Last.` (Bottom row) * **X-Axis (Bottom):** Numerical labels representing `Layer` index, ranging from `0` to `30`. The labels are placed at intervals of 2 (0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30). There are 32 vertical columns in total (0-31). * **Legend/Color Bar (Right):** A vertical gradient scale labeled `Avg JS Divergence`. * **Range:** 0.2 (Lightest blue/white) to 0.6 (Darkest blue). * **Markers:** 0.2, 0.3, 0.4, 0.5, 0.6. ### B. Main Chart Area (Heatmap Data) The heatmap consists of a grid of 3 rows by 32 columns. #### Row 1: `Subj.` (Subject-related divergence) * **Trend:** High divergence in early layers, sharp drop-off in middle layers, near-zero divergence in late layers. * **Data Points:** * **Layers 0–15:** Darkest blue (~0.6). Indicates maximum divergence. * **Layers 16–17:** Medium-light blue (~0.35 - 0.4). Transition phase. * **Layers 18–31:** Very light blue to white (~0.2). Indicates minimal divergence. #### Row 2: `Attn.` (Attention-related divergence) * **Trend:** Near-zero divergence throughout most of the model, with a localized peak in the middle layers. * **Data Points:** * **Layers 0–10:** White (~0.2). * **Layers 11–15:** Light blue (~0.3). * **Layers 16–17:** Very light blue (~0.25). * **Layers 18–31:** White (~0.2). #### Row 3: `Last.` (Last-token/Final-state divergence) * **Trend:** Near-zero divergence in early and middle layers, followed by a steady increase in the final third of the model. * **Data Points:** * **Layers 0–10:** White (~0.2). * **Layers 11–17:** Very light blue (~0.22 - 0.25). * **Layers 18–30:** Gradual increase in blue saturation (~0.3 - 0.4). * **Layer 31:** Noticeable jump to a darker blue (~0.45 - 0.5). ## 3. Summary of Key Findings * **Phase Separation:** The three components show high divergence at different stages of the model's depth. `Subj.` dominates the first half (Layers 0-15), `Attn.` has a minor peak in the middle (Layers 11-15), and `Last.` increases in the final third (Layers 18-31). * **Maximum Divergence:** The highest recorded divergence (~0.6) occurs in the `Subj.` component during the initial 16 layers. * **Minimum Divergence:** All components reach the baseline of ~0.2 at various points, specifically `Attn.` and `Last.` in the earliest layers, and `Subj.` in the final layers. </details> (a) Factual Associations <details> <summary>x3.png Details</summary> ![838aa7da](/v1/image/838aa7da3dbf437c97bd64863035e67b9f55d57aa9144eeb7b5b66ec57e29728) ### Visual Description # Technical Data Extraction: Heatmap Analysis of JS Divergence across Layers ## 1. Image Overview This image is a heatmap visualization representing the **Average Jensen-Shannon (JS) Divergence** across different layers of a neural network model (likely a Transformer-based model with 32 layers). The data is segmented by three distinct categories or components of the model. ## 2. Component Isolation ### A. Header / Axis Labels * **Y-Axis (Categories):** Located on the left. Contains three labels: * `Subj.` (Top row) * `Attn.` (Middle row) * `Last.` (Bottom row) * **X-Axis (Layers):** Located at the bottom. Represents layer indices from 0 to 31. * Markers are labeled every 2 units: `0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30`. * Axis Title: `Layer` ### B. Legend (Color Bar) * **Spatial Placement:** Located on the far right. * **Label:** `Avg JS Divergence` * **Scale:** Linear gradient from light blue/white to dark navy blue. * **Numerical Markers:** `0.2, 0.3, 0.4, 0.5, 0.6`. * **Interpretation:** Darker blue indicates higher JS Divergence (~0.6), while white/light blue indicates lower JS Divergence (~0.2). ## 3. Data Extraction and Trend Verification The heatmap is organized into three horizontal series. Each cell represents a specific layer for that category. ### Series 1: `Subj.` (Subject) * **Visual Trend:** High divergence (dark blue) in the early layers, followed by a sharp decline (fading to white) in the middle-to-late layers. * **Data Points:** * **Layers 0–15:** Consistently high divergence, appearing at the maximum value of approximately **0.6**. * **Layer 16:** Slight decrease (~0.5). * **Layer 17:** Moderate decrease (~0.45). * **Layer 18:** Significant drop (~0.35). * **Layers 19–21:** Low divergence (~0.25–0.3). * **Layers 22–31:** Minimum divergence, appearing near the baseline of **0.2**. ### Series 2: `Attn.` (Attention) * **Visual Trend:** Low divergence throughout most of the model, with a localized "bump" or increase in divergence specifically in the middle layers. * **Data Points:** * **Layers 0–10:** Minimum divergence (~0.2). * **Layers 11–18:** Increased divergence. The peak occurs around layers 13–15, reaching approximately **0.35 to 0.4**. * **Layers 19–31:** Returns to minimum divergence (~0.2). ### Series 3: `Last.` (Last/Final) * **Visual Trend:** Low divergence in the early and middle layers, with a steady increase starting from the middle and peaking at the very final layer. * **Data Points:** * **Layers 0–10:** Minimum divergence (~0.2). * **Layers 11–16:** Very slight, gradual increase (~0.22–0.25). * **Layers 17–30:** Sustained moderate divergence, plateauing around **0.4**. * **Layer 31:** Sharp increase to the highest value for this series, approximately **0.55**. ## 4. Summary Table of Key Observations | Category | Peak Divergence Phase | Peak Value (Approx) | Low Divergence Phase | | :--- | :--- | :--- | :--- | | **Subj.** | Early Layers (0-15) | 0.6 | Late Layers (22-31) | | **Attn.** | Middle Layers (11-18) | 0.4 | Early & Late Layers | | **Last.** | Final Layers (17-31) | 0.55 (at Layer 31) | Early Layers (0-10) | ## 5. Technical Conclusion The visualization demonstrates a clear transition of information or "divergence" through the model's depth. The **Subject** component is most active/divergent in the initial half of the model, the **Attention** component shows a specific localized divergence in the center, and the **Last** component (likely referring to final token or output representations) becomes dominant in the latter half of the network. </details> (b) Associated Hallucinations <details> <summary>x4.png Details</summary> ![8a89494a](/v1/image/8a89494a9775ad9a078183fbd252d7582130f311af3fd60a7b5511a7354786e7) ### Visual Description # Technical Data Extraction: Heatmap Analysis of JS Divergence ## 1. Image Overview This image is a heatmap visualization representing the **Avg JS Divergence** (Jensen-Shannon Divergence) across different layers of a neural network model, categorized by three specific components or methods. ## 2. Component Isolation ### A. Header / Title * No explicit title is present within the image frame. ### B. Main Chart (Heatmap) * **Type:** Heatmap with a grid structure. * **X-Axis (Horizontal):** Labeled "**Layer**". It contains 32 discrete units, indexed from **0 to 31**. Major numerical markers are provided every 2 units (0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30). * **Y-Axis (Vertical):** Contains three categories: 1. **Subj.** (Top row) 2. **Attn.** (Middle row) 3. **Last.** (Bottom row) ### C. Legend / Color Bar * **Location:** Right side of the plot. * **Label:** "**Avg JS Divergence**" (oriented vertically). * **Scale:** Linear gradient from light blue to dark blue. * **Numerical Markers:** 0.2, 0.3, 0.4, 0.5, 0.6. * **Color Mapping:** * **0.2 (Minimum):** Near-white/very light blue. * **0.6 (Maximum):** Deep navy blue. --- ## 3. Data Extraction and Trend Verification ### Row 1: "Subj." (Subject) * **Visual Trend:** This row shows the highest divergence values in the dataset. It starts with a moderate-to-high blue intensity in the early layers, maintains this through the mid-layers, and then sharply drops to the minimum value (near-white) in the later layers. * **Data Points (Approximate JS Divergence):** * **Layers 0 - 15:** Values range between approximately **0.35 and 0.45**. The color is a consistent medium blue. * **Layers 16 - 31:** Values drop significantly to approximately **0.20 - 0.22**. The color is nearly white. ### Row 2: "Attn." (Attention) * **Visual Trend:** This row is consistently flat and low. There is almost no variation across the layers. * **Data Points (Approximate JS Divergence):** * **Layers 0 - 31:** Values are consistently at the minimum baseline of approximately **0.20**. The entire row appears as a very light, near-white band. ### Row 3: "Last." (Last) * **Visual Trend:** Similar to the "Attn." row, this row remains at the baseline for almost the entire duration, with a very slight, negligible increase at the final layer. * **Data Points (Approximate JS Divergence):** * **Layers 0 - 30:** Values are at the minimum baseline of approximately **0.20**. * **Layer 31:** There is a very slight darkening, suggesting a value of approximately **0.22 - 0.25**, though it remains much lighter than the "Subj." early layers. --- ## 4. Summary Table of Extracted Information | Category (Y-Axis) | Layer Range (X-Axis) | Visual Intensity | Estimated Avg JS Divergence | | :--- | :--- | :--- | :--- | | **Subj.** | 0 - 15 | Medium Blue | 0.35 - 0.45 | | **Subj.** | 16 - 31 | Near White | ~0.20 | | **Attn.** | 0 - 31 | Near White | ~0.20 (Constant) | | **Last.** | 0 - 30 | Near White | ~0.20 | | **Last.** | 31 | Very Light Blue | ~0.23 | ## 5. Technical Observations The data indicates that the "Subj." component experiences significantly higher Jensen-Shannon Divergence in the first half of the model's layers (0-15) compared to the "Attn." and "Last." components. After Layer 15, the divergence for "Subj." converges to the same low baseline seen in the other two categories. This suggests that the specific behavior or information captured by "Subj." is most distinct or volatile in the earlier stages of processing. </details> (c) Unassociated Hallucinations Figure 2: Effect of interventions across layers of LLaMA-3-8B. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the model’s predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token. Dataset Construction | Factual Association Associated Hallucination Unassociated Hallucination | 3,506 1,406 7,381 | 3,354 1,284 7,655 | | --- | --- | --- | | Total | 12,293 | 12,293 | Table 1: Dataset statistics across categories. Our study is conducted under a basic knowledge-based question answering setting. The model is given a prompt containing a subject and relation (e.g., “ Barack Obama was born in the city of ”) and is expected to predict the corresponding object (e.g., “ Honolulu ”). To build the dataset, we collect knowledge triples $(\text{subject},\text{relation},\text{object})$ form Wikidata. Each relation was paired with a handcrafted prompt template to convert triples into natural language queries. The details of relation selection and prompt templates are provided in Appendix A.1. We then apply the label scheme presented in Appendix A.2: correct predictions are labeled as FAs, while incorrect ones are classified as AHs or UHs depending on their subject representation reliance. Table 1 summarizes the final data statistics. Models We conduct the experiments on two widely-adopted open-source LLMs, LLaMA-3 Dubey et al. (2024) and Mistral-v0.3 Jiang et al. (2023). Due to the space limit, details are presented in Appendix A.3, and parallel experimental results on Mistral are summarized in Appendix B. 4 Analysis of Internal States in LLMs To focus our analysis, we first conduct causal interventions to identify hidden states that are crucial for eliciting factual associations (FAs). We then compare their behavior across associated hallucinations (AHs) and unassociated hallucinations (UHs). Prior studies Azaria and Mitchell (2023); Gottesman and Geva (2024); Yüksekgönül et al. (2024); Orgad et al. (2025) suggest that hidden states can reveal when a model hallucinates. This assumes that the model’s internal computations differ when producing correct versus incorrect outputs, causing their hidden states to occupy distinct subspaces. We revisit this claim by examining how hidden states update when recalling three categories of knowledge (i.e., FAs, AHs, and UHs). If hidden states primarily signal hallucination, AHs and UHs should behave similarly and diverge from FAs. Conversely, if hidden states reflect reliance on encoded knowledge, FAs and AHs should appear similar, and both should differ from UHs. 4.1 Causal Analysis of Information Flow We identify hidden states that are crucial for factual prediction. For each knowledge tuple (subject, relation, object), the model is prompted with a factual query (e.g., “ The name of the father of Joe Biden is ”). Correct predictions indicate that the model successfully elicits parametric knowledge. Using causal mediation analysis Vig et al. (2020); Finlayson et al. (2021); Meng et al. (2022); Geva et al. (2023), we intervene on intermediate computations and measure the change in output distribution via JS divergence. A large divergence indicates that the intervened computation is critical for producing the fact. Specifically, to test whether token $i$ ’s hidden states in the MLP at layer $\ell$ are crucial for eliciting knowledge, we replace the computation with a corrupted version and observe how the output distribution changes. Similarly, following Geva et al. (2023), we mask the attention flow between tokens at layer $\ell$ using a window size of 5 layers. To streamline implementation, interventions target only subject tokens, attention flow, and the last token. Notable observations are as follows: Obs1: Hidden states crucial for eliciting factual associations. The results in Figure 2(a) show that three components dominate factual predictions: (1) subject representations in early-layer MLPs, (2) mid-layer attention between subject tokens and the final token, and (3) the final token representations in later layers. These results trace a clear information flow: subject representation, attention flow from the subject to the last token, and last-token representation, consistent with Geva et al. (2023). These three types of internal states are discussed in detail respectively (§ 4.2 - 4.4). Obs2: Associated hallucinations follow the same information flow as factual associations. When generating AHs, interventions on these same components also produce large distribution shifts (Figure 2(b)). This indicates that, although outputs are factually wrong, the model still relies on encoded subject information. Obs3: Unassociated hallucinations present a different information flow. In contrast, interventions during UH generation cause smaller distribution shifts (Figure 2(c)), showing weaker reliance on the subject. This suggests that UHs emerge from computations not anchored in the subject representation, different from both FAs and AHs. 4.2 Analysis of Subject Representations The analysis in § 4.1 reveals that unassociated hallucinations (UHs) are processed differently from factual associations (FAs) and associated hallucinations (AHs) in the early layers of LLMs, which share a similar pattern. We examine how these differences emerge in the subject representations and why early-layer modules behave this way. 4.2.1 Norm of Subject Representations <details> <summary>x5.png Details</summary> ![c043955e](/v1/image/c043955eaa0c765ae94761b3a5d1a6256671e650619c3aef3bf30d78ee5194db) ### Visual Description # Technical Document Extraction: Norm Ratio by Model Layers ## 1. Component Isolation * **Header/Legend:** Located in the top-left quadrant of the chart area. * **Main Chart:** A line graph with markers plotted against a grid. * **Axes:** Y-axis (left) representing "Norm Ratio"; X-axis (bottom) representing "Layers". ## 2. Metadata and Axis Labels * **Y-Axis Title:** Norm Ratio * **Y-Axis Markers:** 0.94, 0.96, 0.98, 1.00, 1.02 * **X-Axis Title:** Layers * **X-Axis Markers:** 0, 5, 10, 15, 20, 25, 30 * **Legend [Spatial Grounding: Top-Left]:** * **Blue Line with Circle Markers:** `Asso. Hallu./Factual Asso.` * **Red Line with Square Markers:** `Unasso. Hallu./Factual Asso.` --- ## 3. Data Series Analysis and Trend Verification ### Series 1: Asso. Hallu./Factual Asso. (Blue Circles) * **Visual Trend:** This series is remarkably stable. It begins slightly below 1.00, peaks very early (around Layer 5), and then maintains a nearly horizontal trajectory just below the 1.00 mark for the remainder of the layers. * **Key Data Points (Approximate):** * **Layer 0:** ~0.994 * **Layer 5 (Peak):** ~1.003 * **Layers 10-30:** Fluctuates minimally between 0.995 and 0.998. * **Layer 31 (Final):** ~0.998 ### Series 2: Unasso. Hallu./Factual Asso. (Red Squares) * **Visual Trend:** This series exhibits high volatility. It starts significantly lower than the blue series, undergoes a deep "V" shaped dip reaching its lowest point at Layer 12, recovers sharply until Layer 19, plateaus, and then finishes with a dramatic upward spike at the final layer. * **Key Data Points (Approximate):** * **Layer 0:** ~0.968 * **Layer 5:** ~0.958 * **Layer 12 (Global Minimum):** ~0.941 * **Layer 19 (Recovery Plateau):** ~0.988 * **Layers 20-29:** Relatively stable around 0.984. * **Layer 31 (Global Maximum/Spike):** ~1.023 --- ## 4. Comparative Summary The chart compares two types of hallucination ratios across 32 layers (0-31) of a neural network. 1. **Stability vs. Volatility:** The "Asso. Hallu." (Associated Hallucination) ratio remains consistently close to 1.00, suggesting a stable relationship with factual association throughout the model's depth. In contrast, the "Unasso. Hallu." (Unassociated Hallucination) ratio varies significantly, indicating that the middle layers (specifically around Layer 12) have a much lower norm ratio compared to the initial and final layers. 2. **Final Layer Behavior:** At the final layer (Layer 31), the "Unasso. Hallu." ratio spikes aggressively, surpassing the "Asso. Hallu." ratio for the first time in the sequence, reaching the highest value recorded on the chart (~1.023). 3. **Convergence:** Between layers 20 and 30, the two metrics are at their closest proximity prior to the final layer divergence, though the blue line remains consistently higher than the red line during this interval. </details> Figure 3: Norm ratio curves of subject representations in LLaMA-3-8B, comparing AHs and UHs against FAs as the baseline. To test whether subject representations differ across categories, we measure the average $L_{2}$ norm of subject-token hidden activations across layers. For subject tokens $t_{s_{1}},..,t_{s_{n}}$ at layer $\ell$ , the average norm is $||\mathbf{h}_{s}^{\ell}\|=\tfrac{1}{n}\sum_{i=1}^{n}\|\mathbf{h}_{s_{i}}^{\ell}\|_{2}$ , computed by Equation (1). We compare the norm ratio between hallucination samples (AHs or UHs) and correct predictions (FAs), where a ratio near 1 indicates similar norms. Figure 3 shows that in LLaMA-3-8B, AH norms closely match those of correct samples (ratio $≈$ 0.99), while UH norms are consistently smaller, starting at the first layer (ratio $≈$ 0.96) and diverging further through mid-layers. Findings: At early layers, UH subject representations exhibit weaker activations than FAs, whereas AHs exhibit norms similar to FAs. 4.2.2 Relation to Parametric Knowledge <details> <summary>x6.png Details</summary> ![4b291c52](/v1/image/4b291c52e98fde6e60a4361fbc2eab1b1dc83c5c02b36de5867aa900ef63fa77) ### Visual Description # Technical Document Extraction: Model Hallucination Ratio Comparison ## 1. Component Isolation * **Header:** None present. * **Main Chart Area:** A grouped bar chart comparing two Large Language Models (LLMs) across two specific metrics. * **Footer/Legend:** Located at the bottom center, containing two color-coded categories. ## 2. Axis and Label Extraction * **Y-Axis Title:** "Ratio" (Vertical orientation). * **Y-Axis Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 (Linear scale with light grey horizontal grid lines). * **X-Axis Categories:** * **LLaMA-3-8B** * **Mistral-7B-v0.3** * **Legend Labels (Spatial Grounding: Bottom Center):** * **Red Bar:** "Unasso. Hallu./Factual Asso." (Unassociated Hallucination / Factual Association) * **Blue Bar:** "Asso. Hallu./Factual Asso." (Associated Hallucination / Factual Association) ## 3. Data Table Reconstruction Based on the visual alignment with the Y-axis grid lines, the following values are extracted: | Model | Unasso. Hallu./Factual Asso. (Red) | Asso. Hallu./Factual Asso. (Blue) | | :--- | :---: | :---: | | **LLaMA-3-8B** | ~0.68 | ~1.08 | | **Mistral-7B-v0.3** | ~0.38 | ~0.80 | ## 4. Trend Verification and Analysis * **Trend 1 (Inter-Model Comparison):** Mistral-7B-v0.3 exhibits lower ratios across both metrics compared to LLaMA-3-8B. Specifically, the red bar for Mistral is significantly lower (nearly half) than that of LLaMA. * **Trend 2 (Intra-Model Comparison):** For both models, the "Associated Hallucination" ratio (Blue) is consistently higher than the "Unassociated Hallucination" ratio (Red). * **Trend 3 (Magnitude):** LLaMA-3-8B's blue bar is the only data point exceeding the 1.0 ratio mark, indicating that associated hallucinations occur more frequently than factual associations in this specific context. ## 5. Detailed Description This image is a grouped bar chart titled by its axes as a "Ratio" comparison between two AI models: **LLaMA-3-8B** and **Mistral-7B-v0.3**. The chart uses a light grey background with horizontal grid lines every 0.2 units. The data is presented in two clusters. The first cluster (LLaMA-3-8B) shows a red bar reaching approximately 0.68 and a blue bar reaching approximately 1.08. The second cluster (Mistral-7B-v0.3) shows a red bar reaching approximately 0.38 and a blue bar reaching exactly 0.80. The legend indicates that the red bars represent the ratio of unassociated hallucinations to factual associations, while the blue bars represent the ratio of associated hallucinations to factual associations. The overall data suggests that Mistral-7B-v0.3 performs better (lower hallucination ratios) than LLaMA-3-8B in both measured categories. </details> Figure 4: Comparison of subspace overlap ratios. We next investigate why early layers encode subject representations differently across knowledge types by examining how inputs interact with the parametric knowledge stored in MLP modules. Inspired by Kang et al. (2024), the output norm of an MLP layer depends on how well its input aligns with the subspace spanned by the weight matrix: poorly aligned inputs yield smaller output norms. For each MLP layer $\ell$ , we analyze the down-projection weight matrix $W_{\text{down}}^{\ell}$ and its input $x^{\ell}$ . Given the input $x_{s}^{\ell}$ corresponding to the subject tokens, we compute its overlap ratio with the top singular subspace $V_{\text{top}}$ of $W_{\text{down}}^{\ell}$ : $$ r(x_{s}^{\ell})=\frac{\left\lVert{x_{s}^{\ell}}^{\top}V_{\text{top}}V_{\text{top}}^{\top}\right\rVert^{2}}{\left\lVert x_{s}^{\ell}\right\rVert^{2}}. \tag{2} $$ A higher overlap ratio $r(x_{s}^{\ell})$ indicates stronger alignment to the subspace spanned by $W_{\text{down}}^{\ell}$ , leading to larger output norms. To highlight relative deviations from the factual baseline (FA), we report the relative ratios between AH/FA and UH/FA. Focusing on the layer with the largest UH norm shift, Figure 4 shows that UHs have significantly lower $r(x_{s}^{\ell})$ than AHs in both LLaMA and Mistral. This reveals that early-layer parametric weights are more aligned with FA and AH subject representations than with UH subjects, producing higher norms for the former ones. These results also suggest that the model has sufficiently learned representations for FA and AH subjects during pretraining but not for UH subjects. Findings: Similar to FAs, AH hidden activations align closely with the weight subspace, while UHs do not. This indicates that the model has sufficiently encoded subject representations into parametric knowledge for FAs and AHs but not for UHs. 4.2.3 Correlation with Subject Popularity <details> <summary>x7.png Details</summary> ![ba87de73](/v1/image/ba87de73d465507f3dd6201d36179c5aa1ff9a3eebeede96bf67c95e72a8b3e6) ### Visual Description # Technical Document Extraction: Hallucination and Association Analysis ## 1. Image Overview This image is a grouped bar chart illustrating the relationship between three categories of data—**Factual Associations**, **Associated Hallucinations**, and **Unassociated Hallucinations**—across three distinct levels of a variable labeled as **Low**, **Mid**, and **High**. ## 2. Component Isolation ### A. Header / Title * No explicit title is present within the image frame. ### B. Main Chart Area * **Y-Axis Label:** Percentage (%) * **Y-Axis Scale:** 0 to 100, with major tick marks every 20 units (0, 20, 40, 60, 80, 100). * **X-Axis Categories:** Low, Mid, High. * **Grid:** Light gray horizontal and vertical grid lines are present. * **Data Labels:** Numerical percentages are printed directly above each bar for precision. ### C. Legend (Footer) * **Location:** Bottom center of the image. * **Green (Hex approx #58D68D):** Factual Associations * **Blue (Hex approx #5DADE2):** Associated Hallucinations * **Red/Salmon (Hex approx #EC7063):** Unassociated Hallucinations --- ## 3. Data Table Reconstruction | Category (X-Axis) | Factual Associations (Green) | Associated Hallucinations (Blue) | Unassociated Hallucinations (Red) | | :--- | :---: | :---: | :---: | | **Low** | 5% | 1% | 94% | | **Mid** | 27% | 7% | 66% | | **High** | 52% | 14% | 34% | --- ## 4. Trend Verification and Analysis ### Series 1: Factual Associations (Green) * **Visual Trend:** Slopes upward significantly from left to right. * **Description:** As the level moves from Low to High, Factual Associations increase more than tenfold, starting at a negligible 5% and reaching a majority share of 52%. ### Series 2: Associated Hallucinations (Blue) * **Visual Trend:** Slopes upward gradually. * **Description:** This series represents the smallest portion of the data in all categories but shows a consistent upward trend, increasing from 1% at the Low level to 14% at the High level. ### Series 3: Unassociated Hallucinations (Red) * **Visual Trend:** Slopes downward sharply. * **Description:** This series dominates the "Low" category at 94%. However, it decreases steadily as the level increases, dropping to 66% at Mid and further to 34% at High. --- ## 5. Key Findings * **Inverse Correlation:** There is a clear inverse relationship between "Factual Associations" and "Unassociated Hallucinations." As one increases, the other decreases. * **Dominance Shift:** At the **Low** level, Unassociated Hallucinations are the overwhelming majority (94%). By the **High** level, Factual Associations become the primary category (52%), though Unassociated Hallucinations still maintain a significant presence (34%). * **Hallucination Types:** "Unassociated Hallucinations" are consistently more prevalent than "Associated Hallucinations" across all three measured levels, though the gap narrows significantly at the "High" level. </details> Figure 5: Sample distribution across different subject popularity (low, mid, high) in LLaMA-3-8B, measured by monthly Wikipedia page views. We further investigate why AH representations align with weight subspaces as strongly as FAs, while UHs do not. A natural hypothesis is that this difference arises from subject popularity in the training data. We use average monthly Wikipedia page views as a proxy for subject popularity during pre-training and bin subjects by popularity, then measure the distribution of UHs, AHs, and FAs. Figure 5 shows a clear trend: UHs dominate among the least popular subjects (94% for LLaMA), while AHs are rare (1%). As subject popularity rises, UH frequency falls and both FAs and AHs become more common, with AHs rising to 14% in the high-popularity subjects. This indicates that subject representation norms reflect training frequency, not factual correctness. Findings: Popular subjects yield stronger early-layer activations. AHs arise mainly on popular subjects and are therefore indistinguishable from FAs by popularity-based heuristics, contradicting prior work Mallen et al. (2023a) that links popularity to hallucinations. 4.3 Analysis of Attention Flow Having examined how the model forms subject representations, we next study how this information is propagated to the last token of the input where the model generates the object of a knowledge tuple. In order to produce factually correct outputs at the last token, the model must process subject representation and propagate it via attention layers, so that it can be read from the last position to produce the outputs Geva et al. (2023). To quantify the specific contribution from subject tokens $(s_{1},...,s_{n})$ to the last token, we compute the attention contribution from subject tokens to the last position: $$ \mathbf{a}^{\ell}_{\text{last}}=\sum\nolimits_{k}\sum\nolimits_{h}A^{\ell,h}_{\text{last},s_{k}}(\mathbf{h}^{\ell-1}_{s_{k}}W^{\ell,h}_{V})W^{\ell,h}_{O}. \tag{3} $$ where $A^{\ell,h}_{i,j}$ denotes the attention weight assigned by the $h$ -th head in the layer $\ell$ from the last position $i$ to subjec token $j$ . Here, $\mathbf{a}^{\ell}_{\text{last}}$ represents the subject-to-last attention contribution at layer $\ell$ . Intuitively, if subject information is critical for prediction, this contribution should have a large norm; otherwise, the norm should be small. Figure 6 shows that in LLaMA-3-8B, both AHs and FAs exhibit large attention-contribution norms in mid-layers, indicating a strong information flow from subject tokens to the target token. In contrast, UHs show consistently lower norms, implying that their predictions rely far less on subject information. Yüksekgönül et al. (2024) previously argued that high attention flow from subject tokens signals factuality and proposed using attention-based hidden states to detect hallucinations. Our results challenge this view: the model propagates subject information just as strongly when generating AHs as when producing correct facts. Findings: Mid-layer attention flow from subject to last token is equally strong for AHs and FAs but weak for UHs. Attention-based heuristics can therefore separate UHs from FAs but cannot distinguish AHs from factual outputs, limiting their reliability for hallucination detection. <details> <summary>x8.png Details</summary> ![e0131347](/v1/image/e0131347483a0ee373dbabba3fe261148b76bac3a7d29222fd45589d28523850) ### Visual Description # Technical Document Extraction: Layer-wise Norm Analysis ## 1. Image Overview This image is a line graph illustrating the relationship between neural network layers and a "Norm" metric across three distinct categories of data. The chart uses a coordinate grid system with markers to denote specific data points. ## 2. Component Isolation ### Header / Legend * **Location:** Top-left quadrant (approximate [x, y] coordinates: [0.05, 0.95] relative to the plot area). * **Legend Items:** * **Green Line with Triangle Markers ($\blacktriangle$):** "Factual Asso." * **Blue Line with Circle Markers ($\bullet$):** "Asso. Hallu." * **Red Line with Square Markers ($\blacksquare$):** "Unasso. Hallu." ### Main Chart Area * **X-Axis Label:** "Layer" * **X-Axis Scale:** 0 to 32 (increments marked every 5 units: 0, 5, 10, 15, 20, 25, 30). * **Y-Axis Label:** "Norm" * **Y-Axis Scale:** 0.0 to 2.0 (increments marked every 0.5 units: 0.0, 0.5, 1.0, 1.5, 2.0). * **Grid:** Major grid lines are present for both X and Y axes. --- ## 3. Data Series Analysis and Trend Verification ### Series 1: Factual Asso. (Green, Triangles) * **Trend Description:** This series remains relatively flat and low (below 0.5) from Layer 0 to Layer 17. At Layer 18, it exhibits a massive spike, followed by a series of high-amplitude oscillations between Layers 18 and 32. * **Key Data Points (Approximate):** | Layer | Norm (Approx.) | | :--- | :--- | | 0–17 | 0.1 – 0.4 | | 18 | 1.8 | | 20 | 0.4 | | 22 | 1.25 | | 24 | 1.1 | | 27 | 1.0 | ### Series 2: Asso. Hallu. (Blue, Circles) * **Trend Description:** This series tracks very closely with "Factual Asso." throughout the entire range. It remains low until Layer 17, then experiences the highest peaks in the graph starting at Layer 18. * **Key Data Points (Approximate):** | Layer | Norm (Approx.) | | :--- | :--- | | 0–17 | 0.1 – 0.4 | | 18 | 1.95 | | 19 | 1.4 | | 22 | 1.35 | | 24 | 1.3 | | 27 | 1.15 | ### Series 3: Unasso. Hallu. (Red, Squares) * **Trend Description:** This series shows much lower variance than the other two. While it also experiences a rise in activity after Layer 17, the magnitude of its peaks is significantly dampened compared to the "Associated" categories. * **Key Data Points (Approximate):** | Layer | Norm (Approx.) | | :--- | :--- | | 0–17 | 0.1 – 0.3 | | 18 | 0.65 | | 24 | 0.65 | | 27 | 0.65 | | General | Rarely exceeds 0.7 | --- ## 4. Comparative Observations * **Phase Shift:** There is a clear behavioral shift in the model after **Layer 17**. Prior to this, all three categories behave similarly with low Norm values. * **Association Correlation:** The "Factual Asso." (Green) and "Asso. Hallu." (Blue) series are highly correlated, often peaking and dipping at the same layers with similar magnitudes. * **Unassociated Divergence:** The "Unasso. Hallu." (Red) series diverges significantly from the other two in the latter half of the network (Layers 18–32), maintaining a much lower Norm profile despite following a similar oscillatory pattern. * **Peak Layers:** Layers 18, 22, 24, and 27 represent significant "spikes" in Norm values for all categories, though the intensity varies by category. </details> Figure 6: Subject-to-last attention contribution norms across layers in LLaMA-3-8B. Values show the norm of the attention contribution from subject tokens to the last token at each layer. 4.4 Analysis of Last Token Representations Our earlier analysis showed strong subject-to-last token information transfer for both FAs and AHs, but minimal transfer for UHs. We now examine how this difference shapes the distribution of last-token representations. When subject information is weakly propagated (UHs), last-token states receive little subject-specific update. For UH samples sharing the same prompt template, these states should therefore cluster in the representation space. In contrast, strong subject-driven propagation in FAs and AHs produces diverse last-token states that disperse into distinct subspaces. To test this, we compute cosine similarity among last-token representations $\mathbf{h}_{T}^{\ell}$ . As shown in Figure 7, similarity is high ( $≈$ 0.9) for all categories in early layers, when little subject information is transferred. From mid-layers onward, FAs and AHs diverge sharply, dropping to $≈$ 0.2 by layer 25. UHs remain moderately clustered, with similarity only declining to $≈$ 0.5. Figure 8 shows the t-SNE visualization of the last token’s representations at layer 25 of LLaMA-3-8B. The hidden representations of UH are clearly separated from FA, whereas AH substantially overlap with FA. These results indicate that the model processes UH differently from FA, while processing AH in a manner similar to FA. More visualization can be found in Appendix C. <details> <summary>x9.png Details</summary> ![f4ccc38d](/v1/image/f4ccc38df400b22793504548cd2ddc15b81f06749bc46b8fc910ae8f88975901) ### Visual Description # Technical Document Extraction: Cosine Similarity across Model Layers ## 1. Image Overview This image is a line graph illustrating the relationship between model depth (Layers) and the Cosine Similarity of three distinct data categories. The chart uses a grid background for precise value estimation. ## 2. Component Isolation ### A. Header/Axes * **Y-Axis Label:** Cosine Similarity (Vertical, left side). * **Y-Axis Markers:** 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9. * **X-Axis Label:** Layers (Horizontal, bottom center). * **X-Axis Markers:** 0, 5, 10, 15, 20, 25, 30. ### B. Legend (Spatial Grounding: Bottom-Left [x≈0.1, y≈0.2]) The legend identifies three data series: 1. **Factual Associations:** Green line with triangle markers ($\triangle$). 2. **Associated Hallucinations:** Blue line with circle markers ($\bigcirc$). 3. **Unassociated Hallucinations:** Pink/Light Red line with square markers ($\square$). --- ## 3. Data Series Analysis and Trend Verification ### Series 1: Factual Associations (Green, Triangles) * **Trend:** Starts high (~0.95), remains relatively stable with a slight downward drift until Layer 15. It then experiences a sharp, steep decline reaching a nadir at Layer 25, followed by a moderate recovery in the final layers. * **Key Data Points:** * Layer 0: ~0.95 * Layer 15: ~0.78 * Layer 25: ~0.26 (Minimum) * Layer 31: ~0.48 ### Series 2: Associated Hallucinations (Blue, Circles) * **Trend:** Closely tracks the "Factual Associations" series for the first 15 layers. Between Layers 15 and 20, it maintains a slightly higher similarity than the factual series before following the same sharp decline. It reaches the lowest absolute similarity of all three groups around Layer 27. * **Key Data Points:** * Layer 0: ~0.95 * Layer 12: ~0.90 (Local peak) * Layer 27: ~0.24 (Absolute Minimum) * Layer 31: ~0.41 ### Series 3: Unassociated Hallucinations (Pink, Squares) * **Trend:** Starts at the same high point (~0.95). While it follows the general downward trend of the other two, its decline is significantly less severe. After Layer 17, it diverges upward from the other two series, maintaining a much higher cosine similarity through the middle and late layers. * **Key Data Points:** * Layer 0: ~0.95 * Layer 17: ~0.70 * Layer 25: ~0.52 (Minimum) * Layer 31: ~0.68 --- ## 4. Comparative Observations * **Initial Convergence:** All three categories begin with nearly identical cosine similarity (~0.95) at Layer 0 and stay tightly clustered until approximately Layer 13. * **The Divergence Point:** Significant divergence occurs after Layer 15. * **The "U-Shape" Phenomenon:** All three series exhibit a "U-shaped" curve, where similarity drops in the middle-to-late layers (20-27) and begins to rise again toward the final layer (31). * **Separation of Hallucinations:** "Unassociated Hallucinations" maintain a consistently higher similarity in the later layers (Layers 20-31) compared to both "Factual Associations" and "Associated Hallucinations." The "Associated Hallucinations" and "Factual Associations" remain closely coupled throughout the entire model depth. </details> Figure 7: Cosine similarity of target-token hidden states across layers in LLaMA-3-8B. <details> <summary>x10.png Details</summary> ![ec18fa6a](/v1/image/ec18fa6af00ab28778feb25d8b29fe39f6701678e501c0e82b3120eeb93525c6) ### Visual Description # Technical Document Extraction: t-SNE Visualization of Hallucination Types ## 1. Image Overview This image is a 2D scatter plot, likely generated using a dimensionality reduction technique such as t-SNE or UMAP. It visualizes the clustering behavior of three distinct categories of data points based on their semantic or feature-based embeddings. ## 2. Component Isolation ### Header / Legend * **Location:** Top-left quadrant (approximate [x, y] coordinates: [10, 85] in percentage of frame). * **Legend Content:** * **Green Circle ($\bullet$):** Factual Asso. (Factual Association) * **Blue Circle ($\bullet$):** Asso. Hallu. (Associative Hallucination) * **Red Circle ($\bullet$):** Unasso. Hallu. (Unassociated Hallucination) ### Axis Configuration * **X-Axis:** Numerical scale ranging from approximately **-30 to +35**. Major tick marks are labeled at **-20, -10, 0, 10, 20, 30**. * **Y-Axis:** Numerical scale ranging from approximately **-30 to +30**. Major tick marks are labeled at **-30, -20, -10, 0, 10, 20, 30**. * **Note:** The axes represent abstract dimensions typical of manifold learning visualizations and do not have specific units. --- ## 3. Data Series Analysis and Trends ### Series 1: Factual Asso. (Green) * **Visual Trend:** This series is primarily concentrated in a large, dense central cluster but is highly interleaved with the "Asso. Hallu." series. * **Spatial Distribution:** * **Primary Cluster:** Located between X: [-25, 5] and Y: [-30, 15]. * **Outliers:** A few green points are scattered toward the upper right (near X: 15, Y: 10 and X: 28, Y: 21), showing some overlap with the red cluster. * **Observation:** The high degree of overlap with blue points suggests that "Factual Association" and "Associative Hallucination" share very similar feature spaces. ### Series 2: Asso. Hallu. (Blue) * **Visual Trend:** This series follows a distribution almost identical to the green series, forming a unified central-left mass. * **Spatial Distribution:** * **Primary Cluster:** Concentrated between X: [-25, 5] and Y: [-30, 15]. * **Secondary Grouping:** A small, distinct "tail" or sub-cluster of blue points is visible on the right side between X: [5, 15] and Y: [-10, -5]. * **Observation:** The blue points act as a bridge between the main factual cluster and the lower-right region of the plot. ### Series 3: Unasso. Hallu. (Red) * **Visual Trend:** This series exhibits a distinct "bimodal" distribution. While some points are scattered within the main green/blue cluster, a significant majority forms a separate, isolated cluster in the top-right. * **Spatial Distribution:** * **Isolated Cluster:** A dense concentration of red points is located between X: [10, 32] and Y: [10, 28]. This cluster has very little interference from green or blue points. * **Scattered Points:** Several red points are interspersed within the main cluster (X: [-25, 0], Y: [-25, 10]). * **Observation:** The clear separation of the top-right red cluster indicates that "Unassociated Hallucinations" possess distinct features that differentiate them significantly from both factual data and associative hallucinations. --- ## 4. Summary of Findings | Category | Color | Primary Location (X, Y) | Clustering Behavior | | :--- | :--- | :--- | :--- | | **Factual Asso.** | Green | [-25 to 5, -30 to 15] | Highly integrated with Asso. Hallu. | | **Asso. Hallu.** | Blue | [-25 to 5, -30 to 15] | Highly integrated with Factual Asso.; small sub-cluster at [10, -8]. | | **Unasso. Hallu.** | Red | [10 to 32, 10 to 28] | Forms a distinct, isolated cluster in the upper-right quadrant. | **Conclusion:** The visualization demonstrates that while Factual Associations and Associative Hallucinations are difficult to distinguish in this feature space, Unassociated Hallucinations form a statistically separable group, particularly in the positive X and positive Y coordinate space. </details> Figure 8: t-SNE visualization of last token’s representations at layer 25 of LLaMA-3-8B. <details> <summary>x11.png Details</summary> ![100b6925](/v1/image/100b6925569c6145d228b98dcb79b2e794e09fd330cb3058031e246404e3dab0) ### Visual Description # Technical Document Extraction: Token Probability Analysis ## 1. Document Overview This image is a violin plot comparing the token probability distributions of two Large Language Models (LLMs) across three distinct categories of outputs. The chart visualizes the density, range, and median values of probabilities assigned to specific tokens. ## 2. Component Isolation ### A. Header / Axis Labels * **Y-Axis Title:** Token Probability * **Y-Axis Scale:** Linear, ranging from `0.0` to `1.0` with major tick marks and dashed horizontal grid lines at intervals of `0.2`. * **X-Axis Categories (Models):** 1. LLaMA-3-8B 2. Mistral-7B-v0.3 ### B. Legend (Footer) The legend is located at the bottom of the chart. Note: There are typographical errors in the original labels ("Associatied" instead of "Associated"). * **Green (Left):** Factual Associations * **Blue (Middle):** Associated Hallucinations * **Red/Pink (Right):** Unassociated Hallucinations --- ## 3. Data Extraction and Trend Analysis Each model group contains three violin plots. Each violin includes a vertical line representing the full range (min to max) and a horizontal crossbar representing the median value. ### Model 1: LLaMA-3-8B | Category | Color | Visual Trend/Shape | Median (Approx) | Range (Approx) | | :--- | :--- | :--- | :--- | :--- | | **Factual Associations** | Green | Wide base at 0.2, tapering to a long thin neck reaching near 1.0. | 0.35 | 0.05 to 0.96 | | **Associated Hallucinations** | Blue | Bimodal-leaning; wide at 0.2 and 0.5, reaching near 1.0. | 0.38 | 0.02 to 0.96 | | **Unassociated Hallucinations** | Red | Heavily bottom-weighted; bulbous at 0.1, sharp drop-off. | 0.12 | 0.02 to 0.50 | ### Model 2: Mistral-7B-v0.3 | Category | Color | Visual Trend/Shape | Median (Approx) | Range (Approx) | | :--- | :--- | :--- | :--- | :--- | | **Factual Associations** | Green | Similar to LLaMA; wide base at 0.2, long neck to 1.0. | 0.35 | 0.05 to 0.96 | | **Associated Hallucinations** | Blue | More concentrated density between 0.2 and 0.6. | 0.40 | 0.08 to 0.92 | | **Unassociated Hallucinations** | Red | Heavily bottom-weighted; very low density above 0.2. | 0.11 | 0.03 to 0.42 | --- ## 4. Key Observations and Data Patterns 1. **High-Confidence Hallucinations:** Both models exhibit "Associated Hallucinations" (Blue) with token probabilities reaching as high as ~0.95. This indicates that when a hallucination is contextually "associated," the models can be extremely confident in the incorrect output. 2. **Factual vs. Associated Hallucination Overlap:** The distributions for Green (Factual) and Blue (Associated Hallucinations) are remarkably similar in shape and median. This suggests that token probability alone is a poor discriminator for distinguishing factual statements from contextually relevant hallucinations. 3. **Unassociated Hallucinations:** The "Unassociated Hallucinations" (Red) consistently show the lowest token probabilities. The medians are near 0.1, and the maximum values rarely exceed 0.5. This suggests that completely random or irrelevant hallucinations are typically generated with lower model confidence. 4. **Model Comparison:** The behavior between `LLaMA-3-8B` and `Mistral-7B-v0.3` is highly consistent, suggesting these probability distribution patterns are a common characteristic of current transformer-based LLMs rather than a specific model quirk. </details> Figure 9: Distribution of last token probabilities. This separation also appears in the entropy of the output distribution (Figure 9). Strong subject-to-last propagation in FAs and AHs yields low-entropy predictions concentrated on the correct or associated entity. In contrast, weak propagation in UHs produces broad, high-entropy distributions, spreading probability mass across many plausible candidates (e.g., multiple possible names for “ The name of the father of <subject> is ”). Finding: From mid-layers onward, UHs retain clustered last-token representations and high-entropy outputs, while FAs and AHs diverge into subject-specific subspaces with low-entropy outputs. This provides a clear signal to separate UHs from FAs and AHs, but not for FAs and AHs. 5 Revisiting Hallucination Detection The mechanistic analysis in § 4 reveals that Internal states of LLMs primarily capture how the model recalls and utilizes its parametric knowledge, not whether the output is truthful. As both factual associations (FAs) and associated hallucinations (AHs) rely on the same subject-driven knowledge recall, their internal states show no clear separation. We therefore hypothesize that internal or black-box signals cannot effectively distinguish AHs from FAs, even though they could be effective in distinguishing unassociated hallucinations (UHs), which do not rely on parametric knowledge, from FAs. Experimental Setups To verify this, we revisit the effectiveness of widely-adopted white-box hallucination detection approaches that use internal state probing as well as black-box approaches that rely on scalar features. We evaluate on three settings: 1) AH Only (1,000 FAs and 1,000 AHs for training; 200 of each for testing), 2) UH Only (1,000 FAs and 1,000 UHs for training; 200 of each for testing), and 3) Full (1,000 FAs and 1,000 hallucination samples mixed of AHs and UHs for training; 200 of each for testing). For each setting, we use five random seeds to construct the training and testing datasets. We report the mean AUROC along with its standard deviation across seeds. White-box methods: We extract and normalize internal features and then train a probe. - Subject representations: last subject token hidden state from three consecutive layers Gottesman and Geva (2024). - Attention flow: attention weights from the last token to subject tokens across all layers Yüksekgönül et al. (2024). - Last-token representations: final token hidden state from the last layer Orgad et al. (2025). Black-box methods: We test two commonly used scalar features, including answer token probability (Orgad et al., 2025) and subject popularity (average monthly Wikipedia page views) (Mallen et al., 2023a). As discussed in § 4.2.3 and § 4.4, these features are also related to whether the model relies on encoded knowledge to produce outputs rather than with truthfulness itself. Experimental Results | Subject Attention Last Token | $0.65± 0.02$ $0.58± 0.04$ $\mathbf{0.69± 0.03}$ | $0.91± 0.01$ $0.92± 0.02$ $\mathbf{0.93± 0.01}$ | $0.57± 0.02$ $0.58± 0.07$ $\mathbf{0.63± 0.02}$ | $0.81± 0.02$ $0.87± 0.01$ $\mathbf{0.92± 0.01}$ | | --- | --- | --- | --- | --- | | Probability | $0.49± 0.01$ | $0.86± 0.01$ | $0.46± 0.00$ | $0.89± 0.00$ | | Subject Pop. | $0.48± 0.01$ | $0.87± 0.01$ | $0.52± 0.01$ | $0.84± 0.01$ | Table 2: Hallucination detection performance on AH Only and UH Only settings. <details> <summary>x12.png Details</summary> ![ab4b54d2](/v1/image/ab4b54d2da98ce3d21a1cd0f12645da2d60db3b4973071566b093c6d8de4d159) ### Visual Description # Technical Document Extraction: AUROC Performance by Representation Type ## 1. Image Classification and Overview This image is a **grouped bar chart** comparing the performance (measured in AUROC) of two different hallucination categories across three distinct representation types. The chart includes error bars representing variability or confidence intervals for each data point. ## 2. Component Isolation ### A. Header / Axis Labels * **Y-Axis Title:** `AUROC` (Area Under the Receiver Operating Characteristic curve). * **X-Axis Title:** `Representation Type`. * **Y-Axis Scale:** Numerical values ranging from `0.4` to `0.9` with major gridline increments of `0.1`. * **X-Axis Categories:** `Subject`, `Attention`, `Last Token`. ### B. Legend (Spatial Grounding: Bottom Center) The legend is located in a boxed area at the bottom of the chart. * **Red Bar (Left):** `Unassoiated Halluciation` [sic] (Note: The image contains a typo for "Unassociated Hallucination"). * **Blue Bar (Right):** `Associated Hallucination`. ### C. Main Chart Area The chart consists of three pairs of bars. In every pair, the red bar is significantly higher than the blue bar. --- ## 3. Data Extraction and Trend Verification ### Trend Analysis * **Unassociated Hallucination (Red):** Shows a consistent **upward trend** across the categories. Performance is high at "Subject," slightly higher at "Attention," and reaches its peak at "Last Token." * **Associated Hallucination (Blue):** Shows a **fluctuating/downward trend**. Performance starts at its highest in "Subject," drops to its lowest in "Attention," and recovers slightly in "Last Token," though it remains lower than the initial "Subject" value. * **Comparative Gap:** The performance gap between "Unassociated" and "Associated" hallucinations increases as we move from left to right across the X-axis. ### Reconstructed Data Table Values are estimated based on the Y-axis scale and gridlines. | Representation Type | Unassociated Hallucination (Red) | Associated Hallucination (Blue) | | :--- | :--- | :--- | | **Subject** | ~0.83 (Error: ±0.03) | ~0.60 (Error: ±0.05) | | **Attention** | ~0.84 (Error: ±0.03) | ~0.57 (Error: ±0.03) | | **Last Token** | ~0.88 (Error: ±0.03) | ~0.59 (Error: ±0.04) | --- ## 4. Detailed Component Description * **Subject Representation:** * The Red bar (Unassociated) sits between 0.8 and 0.9, centered at approximately 0.83. * The Blue bar (Associated) sits exactly on the 0.6 gridline. * **Attention Representation:** * The Red bar (Unassociated) shows a marginal increase compared to Subject, sitting at approximately 0.84. * The Blue bar (Associated) shows a decrease, sitting below the 0.6 line at approximately 0.57. * **Last Token Representation:** * The Red bar (Unassociated) shows the highest performance in the set, approaching the 0.9 mark (approx. 0.88). * The Blue bar (Associated) shows a slight recovery from the Attention phase but remains below the 0.6 mark (approx. 0.59). ## 5. Explicit Language Declaration * **Primary Language:** English. * **Note on Transcription:** The legend contains a typographical error: "Unassoiated Halluciation". This has been transcribed exactly as it appears in the image for technical accuracy. The intended meaning is "Unassociated Hallucination." </details> Figure 10: Hallucination detection performance on the Full setting (LLaMA-3-8B). Table 2 shows that hallucination detection methods behave very differently in the AH Only and UH Only settings. For white-box probes, all approaches effectively distinguish UHs from FAs, with last-token hidden states reaching AUROC scores of about 0.93 for LLaMA and 0.92 for Mistral. In contrast, performance drops sharply on the AH Only setting, where the last-token probe falls to 0.69 for LLaMA and 0.63 for Mistral. Black-box methods follow the same pattern. Figure 10 further highlights this disparity under the Full setting: detection is consistently stronger on UH samples than on AH samples, and adding AHs to the training set significantly dilutes performance on UHs (AUROC $≈$ 0.9 on UH Only vs. $≈$ 0.8 on Full). These results confirm that both internal probes and black-box methods capture whether a model draws on parametric knowledge, not whether its outputs are factually correct. Unassociated hallucinations are easier to detect because they bypass this knowledge, while associated hallucinations are produced through the same recall process as factual answers, leaving no internal cues to distinguish them. As a result, LLMs lack intrinsic awareness of their own truthfulness, and detection methods relying on these signals risk misclassifying associated hallucinations as correct, fostering harmful overconfidence in model outputs. 6 Challenges of Refusal Tuning A common strategy to mitigate potential hallucination in the model’s responses is to fine-tune LLMs to refuse answering when they cannot provide a factual response, e.g., Refusal Tuning Zhang et al. (2024). For such refusal capability to generalize, the training data must contain a shared feature pattern across hallucinated outputs, allowing the model to learn and apply it to unseen cases. Our analysis in the previous sections shows that this prerequisite is not met. The structural mismatch between UHs and AHs suggests that refusal tuning on UHs may generalize to other UHs, because their hidden states occupy a common activation subspace, but will not transfer to AHs. Refusal tuning on AHs is even less effective, as their diverse representations prevent generalization to either unseen AHs or UHs. Experimental Setups To verify the hypothesis, we conduct refusal tuning on LLMs under two settings: 1) UH Only, where 1,000 UH samples are paired with 10 refusal templates, and 1,000 FA samples are preserved with their original answers. 2) AH Only, where 1,000 AH samples are paired with refusal templates, with 1,000 FA samples again leave unchanged. We then evaluate both models on 200 samples each of FAs, UHs, and AHs. A response matching any refusal template is counted as a refusal, and we report the Refusal Ratio as the proportion of samples eliciting refusals. This measures not only whether the model refuses appropriately on UHs and AHs, but also whether it “over-refuses” on FA samples. Experimental Results <details> <summary>x13.png Details</summary> ![d2eb8f59](/v1/image/d2eb8f591f5bfcbff360dd2b0ff070c733f7ec68dba54b93752a128cbb7cb7b4) ### Visual Description # Technical Document Extraction: Refusal Ratio Analysis ## 1. Image Overview This image is a grouped bar chart illustrating the "Refusal Ratio (%)" of a model across different testing scenarios based on two distinct training configurations. The chart compares how training on specific types of data (UH vs. AH) affects the model's tendency to refuse to answer factual associations versus different types of hallucinations. ## 2. Component Isolation ### A. Header / Legend * **Location:** Top right quadrant of the chart area. * **Legend Title:** **Testing set** * **Categories & Color Mapping:** * **Factual Asso.** (Green): Represents factual association testing. * **Asso. Hallu.** (Blue): Represents associative hallucination testing. * **Unasso. Halluc.** (Light Red/Pink): Represents unassociated hallucination testing. ### B. Axis Definitions * **Y-Axis (Vertical):** * **Label:** Refusal Ratio (%) * **Scale:** 0 to 100 * **Markers:** 0, 20, 40, 60, 80, 100 * **Gridlines:** Horizontal dashed light-gray lines at every 20% interval. * **X-Axis (Horizontal):** * **Label:** Training Set * **Categories:** "UH Only" and "AH Only" ## 3. Data Extraction & Trend Verification ### Training Set: UH Only * **Trend:** In this group, the model shows a significantly higher refusal rate for unassociated hallucinations compared to factual associations or associative hallucinations. * **Data Points:** * **Factual Asso. (Green):** ~30% * **Asso. Hallu. (Blue):** ~28% * **Unasso. Halluc. (Red):** ~82% ### Training Set: AH Only * **Trend:** In this group, the refusal rates are much more balanced and lower overall. The model is most likely to refuse associative hallucinations, while the refusal rate for unassociated hallucinations drops drastically compared to the "UH Only" training set. * **Data Points:** * **Factual Asso. (Green):** ~22% * **Asso. Hallu. (Blue):** ~33% * **Unasso. Halluc. (Red):** ~24% ## 4. Data Table Reconstruction | Training Set | Factual Asso. (Green) | Asso. Hallu. (Blue) | Unasso. Halluc. (Red) | | :--- | :---: | :---: | :---: | | **UH Only** | ~30% | ~28% | ~82% | | **AH Only** | ~22% | ~33% | ~24% | ## 5. Key Observations & Technical Summary * **Impact of UH Training:** Training exclusively on "UH" (likely Unassociated Hallucinations) leads to a very high refusal rate (~82%) when encountering unassociated hallucinations in testing, but it does not effectively prepare the model to refuse associative hallucinations (~28%). * **Impact of AH Training:** Training exclusively on "AH" (likely Associative Hallucinations) results in a more uniform refusal profile. While it increases the refusal of associative hallucinations (~33%), it significantly reduces the refusal rate for unassociated hallucinations (~24%) compared to the UH training method. * **Factual Refusal:** Both training methods result in a baseline refusal of factual associations between 20% and 30%, with "AH Only" training showing a slightly lower (better) refusal rate for factual information. </details> Figure 11: Refusal tuning performance across three types of samples (LLaMA-3-8B). Figure 11 shows that training with UHs leads to strong generalization across UHs, with refusal ratios of 82% for LLaMA. However, this effect does not transfer to AHs, where refusal ratios fall to 28%, respectively. Moreover, some FA cases are mistakenly refused (29.5%). These results confirm that UHs share a common activation subspace, supporting generalization within the category, while AHs and FAs lie outside this space. By contrast, training with AHs produces poor generalization. On AH test samples, refusal ratio is only 33%, validating that their subject-specific hidden states prevent consistent refusal learning. Generalization to UHs is also weak (23.5%), again reflecting the divergence between AH and UH activation spaces. Overall, these findings show that the generalizability of refusal tuning is fundamentally limited by the heterogeneous nature of hallucinations. UH representations are internally consistent enough to support refusal generalization, but AH representations are too diverse for either UH-based or AH-based training to yield a broadly applicable and reliable refusal capability. 7 Conclusions and Future Work In this work, we revisit the widely accepted claim that hallucinations can be detected from a model’s internal states. Our mechanistic analysis reveals that hidden states encode whether models are reliance on their parametric knowledge rather than truthfulness. As a result, detection methods succeed only when outputs are detached from the input but fail when hallucinations arise from the same knowledge-recall process as correct answers. These findings lead to three key implications. First, future evaluations should report detection performance separately for Associated Hallucinations (AHs) and Unassociated Hallucinations (UHs), as they stem from fundamentally different internal processes and require distinct detection strategies. Second, relying solely on hidden states is insufficient for reliable hallucination detection. Future research should integrate LLMs with external feedback mechanisms, such as fact-checking modules or retrieval-based verifiers, to assess factuality more robustly. Third, future studies should prioritize improving AH detection. Because AHs occur more frequently in widely known or highly popular topics (§ 4.2.3), their undetected errors pose greater risks to user trust and the practical reliability of LLMs. Limitations We identify several limitations of our work. Focus on Factual Knowledge While our analysis identifies failure cases of hallucination detection methods, our study is primarily limited to factual completion prompts. It does not extend to long-form or open-ended text generation tasks Wei et al. (2024); Min et al. (2023); Huang and Chen (2024). Future work should broaden this investigation to these tasks in order to draw more comprehensive conclusions. Lack of Analysis on Prompt-based Hallucination Detection Approaches Our analysis focuses on white-box hallucination detection methods based on internal states and two black-box approaches based on external features. We do not include verbalization-based strategies Lin et al. (2022a); Tian et al. (2023); Xiong et al. (2024); Yang et al. (2024b); Ni et al. (2024); Zhao et al. (2024), such as prompting the model to report or justify its confidence explicitly, which constitute a different line of approach. Exploring such approaches may offer complementary insights into how models internally represent and express uncertainty. Applicability to Black-box LLMs or Large Reasoning Models Our study is limited to open-source LLMs. Conducting mechanistic analyses on commercial black-box LLMs is not permitted due to access restrictions. Future work could explore alternative evaluation protocols or collaboration frameworks that enable partial interpretability analyses on such systems. In addition, recent studies Mei et al. (2025); Zhang et al. (2025) have begun examining the internal states of large reasoning models for hallucination detection, suggesting a promising direction for extending our methodology to models with multi-step reasoning capabilities. Ethical Considerations This work analyzes the internal mechanisms of large language models using data constructed from Wikidata Vrandecic and Krötzsch (2014), which is released under the Creative Commons CC0 1.0 Universal license, allowing unrestricted use and redistribution of its data. All data are derived from publicly available resources, and no private or sensitive information about individuals is included. We employ the LLM tools for polishing. References - Azaria and Mitchell (2023) Amos Azaria and Tom M. Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976. - Cheang et al. (2023) Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li, Yanming Sun, Shudong Liu, and Lidia S. Chao. 2023. Can lms generalize to future data? an empirical analysis on text summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 16205–16217. Association for Computational Linguistics. - Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. INSIDE: llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. - Daniel Han and team (2023) Michael Han Daniel Han and Unsloth team. 2023. Unsloth. - Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - Ding et al. (2024) Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2024. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. CoRR, abs/2402.10612. - Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, and 82 others. 2024. The llama 3 herd of models. CoRR, abs/2407.21783. - Finlayson et al. (2021) Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart M. Shieber, Tal Linzen, and Yonatan Belinkov. 2021. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1828–1843. Association for Computational Linguistics. - Gekhman et al. (2025) Zorik Gekhman, Eyal Ben-David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. 2025. Inside-out: Hidden factual knowledge in llms. CoRR, abs/2503.15299. - Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12216–12235. Association for Computational Linguistics. - Gottesman and Geva (2024) Daniela Gottesman and Mor Geva. 2024. Estimating knowledge in large language models without generating a single token. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, pages 3994–4019. - Guerreiro et al. (2023) Nuno Miguel Guerreiro, Elena Voita, and André F. T. Martins. 2023. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 1059–1075. Association for Computational Linguistics. - Huang and Chen (2024) Chao-Wei Huang and Yun-Nung Chen. 2024. Factalign: Long-form factuality alignment of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16363–16375. - Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2):42:1–42:55. - Ji et al. (2024) Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. 2024. LLM internal states reveal hallucination risk faced with a query. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 88–104, Miami, Florida, US. Association for Computational Linguistics. - Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825. - Kang and Choi (2023) Cheongwoong Kang and Jaesik Choi. 2023. Impact of co-occurrence on factual knowledge of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7721–7735. - Kang et al. (2024) Katie Kang, Amrith Setlur, Claire J. Tomlin, and Sergey Levine. 2024. Deep neural networks tend to extrapolate predictably. In The Twelfth International Conference on Learning Representations, ICLR 2024. - Kapoor et al. (2024) Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. 2024. Large language models must be taught to know what they don’t know. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. - Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. - Li et al. (2023) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36:41451–41530. - Li et al. (2025) Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. 2025. Knowledge boundary of large language models: A survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, pages 5131–5157. - Lin et al. (2022a) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022a. Teaching models to express their uncertainty in words. Trans. Mach. Learn. Res., 2022. - Lin et al. (2022b) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022b. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pages 3214–3252. - Mallen et al. (2023a) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023a. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, pages 9802–9822. - Mallen et al. (2023b) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9802–9822. Association for Computational Linguistics. - Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9004–9017. Association for Computational Linguistics. - Mei et al. (2025) Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, and Anirudha Majumdar. 2025. Reasoning about uncertainty: Do reasoning models know when they don’t know? CoRR, abs/2506.18183. - Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372. - Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 12076–12100. - Ni et al. (2024) Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. 2024. When do llms need retrieval augmentation? mitigating llms’ overconfidence helps retrieval augmentation. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11375–11388. Association for Computational Linguistics. - Ni et al. (2025) Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, and Xueqi Cheng. 2025. Towards fully exploiting LLM internal states to enhance knowledge boundary perception. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24315–24329. Association for Computational Linguistics. - Orgad et al. (2025) Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. 2025. Llms know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, ICLR 2025. - Sciavolino et al. (2021) Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. 2021. Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6138–6148. Association for Computational Linguistics. - Su et al. (2024) Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 14379–14391. Association for Computational Linguistics. - Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5433–5442. Association for Computational Linguistics. - Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. CoRR, abs/2307.03987. - Vig et al. (2020) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401. - Vrandecic and Krötzsch (2014) Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85. - Wei et al. (2024) Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V. Le. 2024. Long-form factuality in large language models. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024. - Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771. - Xiao et al. (2025) Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, and Yu Rong. 2025. Analyzing llms’ knowledge boundary cognition across languages through the lens of internal representations. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 24099–24115. Association for Computational Linguistics. - Xiong et al. (2024) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2024. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. - Yang et al. (2024a) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 others. 2024a. Qwen2.5 technical report. CoRR, abs/2412.15115. - Yang et al. (2024b) Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. 2024b. Alignment for honesty. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. - Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8653–8665. Association for Computational Linguistics. - Yona et al. (2024) Gal Yona, Roee Aharoni, and Mor Geva. 2024. Narrowing the knowledge evaluation gap: Open-domain question answering with multi-granularity answers. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 6737–6751. Association for Computational Linguistics. - Yüksekgönül et al. (2024) Mert Yüksekgönül, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, and Besmira Nushi. 2024. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, ICLR 2024. - Zhang et al. (2024) Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2024. R-tuning: Instructing large language models to say ’i don’t know’. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, pages 7113–7139. - Zhang et al. (2023a) Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, and Kumar Sricharan. 2023a. Sac ${}^{\mbox{3}}$ : Reliable hallucination detection in black-box language models via semantic-aware cross-check consistency. CoRR, abs/2311.01740. - Zhang et al. (2025) Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, and Han Qiu. 2025. On the self-awareness of large reasoning models’ capability boundaries. Preprint, arXiv:2509.24711. - Zhang et al. (2023b) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023b. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219. - Zhao et al. (2024) Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Chong Meng, Shuaiqiang Wang, Zhicong Cheng, Zhaochun Ren, and Dawei Yin. 2024. Knowing what llms DO NOT know: A simple yet effective self-detection method. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, pages 7051–7063. Appendix Appendix A Datasets and Implementations A.1 Selected Relations and Prompt Templates We employed a set of criteria to select relations from Wikidata in order to construct our dataset. Our criteria largely follow the framework proposed by Gekhman et al. (2025). Specifically, we require that each factual query in the dataset be unambiguous: given a subject–relation pair, the object should be unique and easy verifiable. The criteria are as follows: - Avoid granularity ambiguity. We exclude relations whose answers can vary in their level of detail. For example, in location queries, the response could be expressed as a city, state, or country, making it ill-defined Yona et al. (2024). - Avoid surface-level guessing. We exclude relations whose correct answers can often be inferred from shallow patterns. For instance, country of citizenship can frequently be guessed from shallow lexical patterns, rather then reflecting actual memorization Mallen et al. (2023b). Following these criteria, Gekhman et al. (2025) narrowed the 24 relations introduced by Sciavolino et al. (2021) down to four. However, we observe that their filtering primarily addresses ambiguity at the relation and object levels, but does not consider ambiguity at the subject level. In practice, some relations involve subjects that are inherently ambiguous. For example, the relation record label can be problematic because many songs share identical names, leading to unclear subject–object mappings. To mitigate such cases, we apply an additional subject-level filtering step and restrict our dataset to relations where the subject is a person, thereby reducing ambiguity. In addition, we manually include certain relations to strengthen the dataset. Concretely, we use the following four relations: P22 (father), P25 (mother), P26 (spouse), and P569 (date of birth). We show the list of the templates used to create our dataset in Table 3. | father mother spouse | The name of the father of [subject] is The name of the mother of [subject] is The name of the spouse of [subject] is | | --- | --- | | date of birth | The birth date of [subject] is | Table 3: Relations and prompt templates for querying factual knowledge of models. [subject] is a placeholder replaced with subject entities. | I will give you a factual query (e.g., “The name of the father of <subj>”), a gold answer to the factual query, and a proposed answer generated by an LLM. You need to compare the proposed answer to the gold answer and assign it one of the possible grades using the steps below. | | --- | | Possible grades are: | | A: CORRECT | | B: INCORRECT | | C: WRONG GOLD | | D: ERROR | | Spelling errors, synonyms, abbreviations, or hedging expressions (e.g., “it is possible that”) should not alter the grade if the person referred to in the proposed answer matches the gold answer. | | Steps: | | Step 1: If the gold answer does not correspond to an answer for the question, output “C” and finish. Otherwise, proceed to Step 2. | | Step 2: Extract all predicted entities from the proposed answer. Proceed to Step 3. | | Step 3: If each predicted entity refers to the answer mentioned in the gold answer, output “A” and finish. Otherwise, proceed to Step 4. | | Step 4: If the predicted entity does not refer to the gold answer, output “B” and finish. Otherwise, proceed to Step 5. | | Step 5: Double-check whether the proposed answer refers to a different answer from the gold answer. If it does, output “B.” Otherwise, output “D” and finish. | | Input format: | | Question: {question} | | Gold answer: {gold_answer} | | Proposed answer: {proposed_answer} | | Instruction: Output your reasoning steps. After that, conclude your response with “Output:” followed by the letter (A, B, C, or D). Do not provide any further explanation. | Figure 12: LLM Judge prompt used for evaluation. A.2 Labeling Scheme We follow the criteria in § 3 to label the data samples into different categories: - Factual Correctness: We construct correctness labels through a two-stage process. First, we use spaCy https://spacy.io/ Named Entity Recognizer to extract the target entity from the model’s output. If it matches the ground truth, the answer is marked correct. Otherwise, or if extraction fails, we rely on Qwen2.5-14B-Instruct Yang et al. (2024a) as an automatic judge to compare the predicted answer with the ground truth. Following Gekhman et al. (2025), we design the evaluation prompt, which is shown in Figure 12. - Subject Representation Reliance: We assess whether a prediction relies on the subject’s representation by blocking attention from subject tokens and measuring the resulting distribution shift. If the subject is crucial, masking disrupts information flow and yields a large shift; if not, the effect is minimal. Concretely, we compare the output distributions of the original prompt and the masked prompt (e.g., with “ Barack Obama ” masked), using Jensen–Shannon (JS) divergence to quantify the difference. A high JS divergence indicates strong reliance on the subject, while a low value suggests limited contribution. We then set a threshold based on the average JS divergence across all correct answers, assuming these inherently depend on subject representations. <details> <summary>x14.png Details</summary> ![b5025033](/v1/image/b502503303795dd82f5f1ba30610df2bca9ad67de373e4edcdfbc8cdde6f81b9) ### Visual Description # Technical Document Extraction: Heatmap Analysis of JS Divergence ## 1. Image Classification and Overview This image is a **heatmap** visualizing the "Avg JS Divergence" (Average Jensen-Shannon Divergence) across different layers of a neural network model. The data is categorized by three distinct components or stages of the model across 32 layers. ## 2. Component Isolation ### A. Header / Legend (Right Side) * **Type:** Vertical Color Scale (Legend) * **Label:** "Avg JS Divergence" * **Scale Range:** 0.1 to 0.6 * **Color Gradient:** Light blue/white (low divergence, ~0.1) to dark navy blue (high divergence, ~0.6). * **Spatial Grounding:** Located on the far right of the image. ### B. Main Chart (Center) * **X-Axis Label:** "Layer" * **X-Axis Markers:** 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 (Total of 32 columns represented). * **Y-Axis Labels (Categories):** 1. **Subj.** (Top row) 2. **Attn.** (Middle row) 3. **Last.** (Bottom row) ## 3. Data Extraction and Trend Verification ### Row 1: "Subj." (Subject) * **Visual Trend:** High divergence (dark blue) in the early to middle layers, followed by a sharp drop-off to very low divergence (white) in the later layers. * **Detailed Data Points:** * **Layers 0–17:** Consistently high divergence, appearing at the maximum scale value of approximately **0.5 to 0.6**. * **Layer 18:** Moderate divergence (~0.4). * **Layers 19–21:** Light blue, indicating a transition (~0.2 to 0.3). * **Layers 22–31:** Very low divergence, appearing near the minimum scale value of **0.1**. ### Row 2: "Attn." (Attention) * **Visual Trend:** Consistently low divergence across almost all layers, with a very slight, subtle increase in the middle layers. * **Detailed Data Points:** * **Layers 0–13:** Near-minimum divergence (~0.1). * **Layers 14–17:** Very slight increase to a pale blue (~0.15 to 0.2). * **Layers 18–31:** Returns to near-minimum divergence (~0.1). ### Row 3: "Last." (Last/Final) * **Visual Trend:** Low divergence in early layers, gradually increasing and stabilizing at a moderate level in the latter half of the model. * **Detailed Data Points:** * **Layers 0–7:** Very low divergence (~0.1). * **Layers 8–14:** Gradual upward slope in divergence (transitioning from white to light blue). * **Layers 15–31:** Stabilizes at a moderate divergence level, appearing to be approximately **0.3 to 0.35** on the color scale. ## 4. Summary Table of Extracted Data | Category | Layers 0-10 | Layers 11-20 | Layers 21-31 | | :--- | :--- | :--- | :--- | | **Subj.** | High (~0.6) | High to Moderate Drop | Very Low (~0.1) | | **Attn.** | Very Low (~0.1) | Low (~0.15) | Very Low (~0.1) | | **Last.** | Very Low (~0.1) | Moderate Increase | Moderate (~0.35) | ## 5. Technical Observations The heatmap indicates that the "Subj." component is most active/divergent in the initial stages of the model, whereas the "Last." component gains divergence as the data progresses through the layers. The "Attn." component maintains the lowest JS Divergence throughout the entire architecture relative to the other two categories. </details> (a) Factual Associations <details> <summary>x15.png Details</summary> ![7a595892](/v1/image/7a5958920fe30f6d0dda23b4074813ab7494ca869c2cd867ca0c3da1e25952de) ### Visual Description # Technical Document Extraction: Heatmap Analysis of JS Divergence ## 1. Component Isolation The image is a heatmap visualization representing data across different layers of a neural network model. * **Header/Title:** None present. * **Main Chart Area:** A grid of 3 rows and 32 columns (representing layers 0 through 31). * **Y-Axis (Left):** Categorical labels representing different components or states. * **X-Axis (Bottom):** Numerical labels representing model layers. * **Legend (Right):** A vertical color scale bar indicating the magnitude of the measured metric. --- ## 2. Metadata and Labels * **X-Axis Title:** `Layer` * **X-Axis Markers:** 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 (Total of 32 columns visible). * **Y-Axis Labels:** * `Subj.` (Top row) * `Attn.` (Middle row) * `Last.` (Bottom row) * **Legend Title:** `Avg JS Divergence` (Jensen-Shannon Divergence) * **Legend Scale:** 0.1 (Lightest blue/white) to 0.6 (Darkest blue). --- ## 3. Legend and Color Mapping The legend is located at the far right of the image. * **Dark Blue (~0.6):** High Average JS Divergence. * **Medium Blue (~0.3 - 0.4):** Moderate Average JS Divergence. * **Light Blue/White (~0.1):** Low Average JS Divergence. --- ## 4. Trend Verification and Data Extraction ### Row 1: Subj. (Subject) * **Visual Trend:** Starts with very high intensity (dark blue) from Layer 0, maintains this intensity through the mid-layers, and then sharply fades to near-white in the final third of the layers. * **Data Points:** * **Layers 0 - 17:** High divergence (approx. 0.5 to 0.6). * **Layers 18 - 22:** Transition period; divergence drops from approx. 0.4 to 0.2. * **Layers 23 - 31:** Low divergence (approx. 0.1). ### Row 2: Attn. (Attention) * **Visual Trend:** Predominantly low intensity (near-white) across almost the entire spectrum, with a very slight, subtle increase in the middle layers. * **Data Points:** * **Layers 0 - 11:** Very low divergence (~0.1). * **Layers 12 - 16:** Slight increase to a very light blue (~0.15 - 0.2). * **Layers 17 - 31:** Returns to very low divergence (~0.1). ### Row 3: Last. (Last Token/State) * **Visual Trend:** Starts at very low intensity and gradually increases in saturation as the layer number increases, peaking and stabilizing in the latter half of the model. * **Data Points:** * **Layers 0 - 7:** Very low divergence (~0.1). * **Layers 8 - 15:** Gradual upward slope in divergence (approx. 0.15 to 0.25). * **Layers 16 - 31:** Sustained moderate divergence (approx. 0.3 to 0.35). --- ## 5. Summary of Findings The heatmap illustrates a clear shift in information processing (measured by JS Divergence) across the 32 layers of the model: 1. **Subject information (`Subj.`)** is most prominent and divergent in the early to middle stages (Layers 0-18). 2. **Attention mechanisms (`Attn.`)** show minimal divergence across all layers, with a negligible peak around Layer 14. 3. **The final state/token (`Last.`)** gains divergence progressively, becoming the dominant feature in the latter half of the model (Layers 16-31), coinciding with the drop-off of the Subject information. </details> (b) Associated Hallucinations <details> <summary>x16.png Details</summary> ![49f7adcd](/v1/image/49f7adcdb57f1bb590dc9fa818bec895be96192b652e6ab19f10bc915a9e95a1) ### Visual Description # Technical Document Extraction: Average JS Divergence Heatmap ## 1. Image Overview This image is a technical heatmap visualization representing the **Average Jensen-Shannon (JS) Divergence** across different layers of a neural network model (likely a Transformer-based model given the "Layer" and "Attn." labels). ## 2. Component Isolation ### A. Header / Metadata * **Language:** English. * **Content:** No explicit title is present within the image frame. ### B. Main Chart (Heatmap) * **Type:** Heatmap with a grid structure. * **X-Axis (Horizontal):** Labeled "**Layer**". It contains 32 discrete columns, indexed from **0 to 31**. Numerical markers are provided every two units: `0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30`. * **Y-Axis (Vertical):** Contains three categorical labels representing different components or methods: 1. **Subj.** (Top row) 2. **Attn.** (Middle row) 3. **Last.** (Bottom row) ### C. Legend (Color Bar) * **Spatial Placement:** Located on the far right of the image. * **Label:** "**Avg JS Divergence**" (oriented vertically). * **Scale:** Linear gradient from light blue/white to dark blue. * **Markers:** `0.1, 0.2, 0.3, 0.4, 0.5, 0.6`. * **Interpretation:** Darker blue indicates a higher JS Divergence (max ~0.6), while lighter/white indicates lower JS Divergence (min ~0.1). --- ## 3. Data Extraction and Trend Analysis ### Series 1: "Subj." (Subject) * **Visual Trend:** This row starts with moderate-to-high divergence in the early layers, peaking around layers 2-6, and then gradually fades (slopes downward in intensity) as the layer number increases. By layer 20, the values are very low (near white). * **Estimated Values:** * **Layers 0-8:** High intensity (~0.35 to 0.45). * **Layers 10-18:** Moderate-low intensity (~0.2 to 0.25). * **Layers 20-31:** Very low intensity (~0.1). ### Series 2: "Attn." (Attention) * **Visual Trend:** This row is consistently very light/white across all 32 layers. * **Estimated Values:** * **Layers 0-31:** Constant low divergence (~0.1). There is negligible variation across the depth of the model for this category. ### Series 3: "Last." (Last Token/Layer) * **Visual Trend:** This row shows the inverse pattern of the "Subj." row. It starts very light (low divergence) in the early layers and gradually increases in intensity (slopes upward) starting around layer 8. * **Estimated Values:** * **Layers 0-8:** Very low intensity (~0.1). * **Layers 10-18:** Light blue, increasing (~0.15 to 0.2). * **Layers 20-31:** Moderate intensity, stabilizing (~0.25). --- ## 4. Summary Data Table (Reconstructed) | Layer Range | Subj. (Divergence) | Attn. (Divergence) | Last. (Divergence) | | :--- | :--- | :--- | :--- | | **Early (0-8)** | High (~0.4) | Very Low (~0.1) | Very Low (~0.1) | | **Mid (10-20)** | Decreasing (~0.2) | Very Low (~0.1) | Increasing (~0.2) | | **Late (22-31)** | Very Low (~0.1) | Very Low (~0.1) | Moderate (~0.25) | ## 5. Technical Conclusion The visualization indicates a shift in information divergence as data passes through the model layers. The **"Subj."** component is most distinct in the early stages of the model, whereas the **"Last."** component becomes more prominent in the later stages. The **"Attn."** component maintains a consistently low JS Divergence throughout the entire architecture. </details> (c) Unassociated Hallucinations Figure 13: Effect of interventions across layers of Mistral-7B-v0.3. The heatmap shows JS divergence between the output distribution before and after intervention. Darker color indicates that the intervened hidden states are more causally influential on the model’s predictions. Top row: patching representations of subject tokens. Middle row: blocking attention flow from subject to the last token. Bottom row: patching representations of the last token. A.3 Implementation Details Checkpoints and GPU resources. All the checkpoints used in our experiments are provided by the Hugging Face Transformers library Wolf et al. (2019). Specifically, we use the checkpoint “meta-llama/Meta-Llama-3-8B” https://huggingface.co/meta-llama/Meta-Llama-3-8B and “mistralai/Mistral-7B-v0.3” https://huggingface.co/mistralai/Mistral-7B-v0.3 for the experiments of response generation (§ 3), hidden-state analysis (§ 4) and accessing the performance of hallucination detection methods (§ 5). For refusal tuning (§ 6), we use checkpoints provided by the Unsloth framework Daniel Han and team (2023), namely “unsloth/llama-3-8b” https://huggingface.co/unsloth/llama-3-8b and “unsloth/mistral-7b-v0.3” https://huggingface.co/unsloth/mistral-7b-v0.3, which enable more efficient fine-tuning. All experiments are conducted on 4 NVIDIA L40S GPUs. <details> <summary>x17.png Details</summary> ![0275e832](/v1/image/0275e83248b07e9a6cfa4ec24f88d12a8fadc998c1a82903068b042016d95323) ### Visual Description # Technical Data Extraction: Norm Ratio across Model Layers ## 1. Image Metadata & Component Isolation * **Type:** Line Graph with markers. * **Header/Title:** None present in the image. * **Main Chart Region:** Contains two data series plotted against 32 layers (0-31). * **Legend Location:** Top-center [approx. x=0.5, y=0.9]. * **Language:** English. ## 2. Axis Definitions * **X-Axis (Horizontal):** * **Label:** "Layers" * **Range:** 0 to 31. * **Major Markers:** 0, 5, 10, 15, 20, 25, 30. * **Y-Axis (Vertical):** * **Label:** "Norm Ratio" * **Range:** 0.96 to 1.01 (with data points extending slightly below 0.96 and above 1.01). * **Major Markers:** 0.96, 0.97, 0.98, 0.99, 1.00, 1.01. ## 3. Legend and Data Series Identification * **Series 1 (Blue Circles):** `Asso. Hallu./Factual Asso.` * **Visual Trend:** Relatively stable. It starts slightly above 1.00, dips marginally between layers 5-15, and then shows a very gradual upward slope toward the final layers, ending near 1.005. * **Series 2 (Red Squares):** `Unasso. Hallu./Factual Asso.` * **Visual Trend:** Highly volatile. It starts near 0.99, drops sharply to a local minimum around layer 5 (~0.955), fluctuates at a low level until layer 15, then rises sharply to peak at layer 20 (~0.998). It then declines gradually until layer 30 before a final sharp spike at layer 31. ## 4. Data Point Extraction (Estimated Values) | Layer | Asso. Hallu./Factual Asso. (Blue) | Unasso. Hallu./Factual Asso. (Red) | | :--- | :--- | :--- | | 0 | 1.002 | 0.989 | | 1 | 1.001 | 0.992 | | 2 | 1.002 | 0.984 | | 3 | 1.004 | 0.968 | | 4 | 1.000 | 0.963 | | 5 | 0.999 | 0.955 | | 6 | 0.996 | 0.961 | | 7 | 0.997 | 0.960 | | 8 | 0.996 | 0.960 | | 9 | 0.997 | 0.957 | | 10 | 0.998 | 0.963 | | 11 | 0.998 | 0.966 | | 12 | 0.997 | 0.962 | | 13 | 0.998 | 0.960 | | 14 | 0.998 | 0.957 | | 15 | 0.998 | 0.963 | | 16 | 0.999 | 0.975 | | 17 | 0.999 | 0.983 | | 18 | 1.001 | 0.992 | | 19 | 1.000 | 0.998 | | 20 | 1.000 | 0.999 | | 21 | 0.999 | 0.997 | | 22 | 1.000 | 0.996 | | 23 | 1.000 | 0.996 | | 24 | 1.001 | 0.995 | | 25 | 1.001 | 0.993 | | 26 | 1.002 | 0.992 | | 27 | 1.003 | 0.991 | | 28 | 1.003 | 0.991 | | 29 | 1.003 | 0.986 | | 30 | 1.003 | 0.986 | | 31 | 1.005 | 1.012 | ## 5. Key Observations * **Divergence:** The most significant divergence between the two metrics occurs in the "middle" layers (approximately layers 3 through 17), where the Unassociated Hallucination ratio drops significantly below the Associated Hallucination ratio. * **Convergence:** The two lines converge most closely around layer 20. * **Final Layer Behavior:** Both metrics show a sharp increase at the final layer (31), with the Red series (Unassociated) jumping from a low point to its maximum value on the chart, exceeding the Blue series for the first time. </details> Figure 14: Norm ratio curves of subject representations in Mistral-7B-v0.3, comparing AHs and UHs against FAs as the baseline. At earlier layers, the norm of UH samples is significantly lower than that of AH samples. <details> <summary>x18.png Details</summary> ![675afe07](/v1/image/675afe07b43e437b93a20aa71869db0df219d73d4743c02df547208080e53cb6) ### Visual Description # Technical Document Extraction: Hallucination and Association Analysis ## 1. Image Overview This image is a grouped bar chart illustrating the relationship between three categories of data—**Factual Associations**, **Associated Hallucinations**, and **Unassociated Hallucinations**—across three distinct levels of a variable: **Low**, **Mid**, and **High**. ## 2. Component Isolation ### A. Header / Axis Labels * **Y-Axis Title:** Percentage (%) * **Y-Axis Markers:** 0, 20, 40, 60, 80, 100 * **X-Axis Categories:** Low, Mid, High ### B. Main Chart Area The chart contains three groups of bars. Each group contains three bars color-coded to match the legend. Numerical data labels are placed directly above each bar for precision. ### C. Legend (Footer) * **Location:** Bottom of the image, centered horizontally. * **Green (Hex approx #58D68D):** Factual Associations * **Blue (Hex approx #5DADE2):** Associated Hallucinations * **Red/Salmon (Hex approx #EC7063):** Unassociated Hallucinations --- ## 3. Data Table Reconstruction | Category Level | Factual Associations (Green) | Associated Hallucinations (Blue) | Unassociated Hallucinations (Red) | | :--- | :---: | :---: | :---: | | **Low** | 5% | 2% | 93% | | **Mid** | 25% | 6% | 70% | | **High** | 48% | 12% | 40% | --- ## 4. Trend Verification and Analysis ### Series 1: Factual Associations (Green) * **Visual Trend:** The bars show a strong, consistent upward slope from left to right. * **Data Points:** Starts at a negligible 5% (Low), increases to 25% (Mid), and reaches its peak at 48% (High). * **Observation:** Factual associations correlate positively with the X-axis levels. ### Series 2: Associated Hallucinations (Blue) * **Visual Trend:** The bars show a gradual upward slope. While the values remain the lowest in every category, the growth is proportional. * **Data Points:** 2% (Low) $\rightarrow$ 6% (Mid) $\rightarrow$ 12% (High). * **Observation:** This series doubles at each step, indicating a geometric growth pattern, though it remains a minority component. ### Series 3: Unassociated Hallucinations (Red) * **Visual Trend:** The bars show a strong downward slope from left to right. * **Data Points:** Starts as the dominant majority at 93% (Low), drops to 70% (Mid), and falls to its lowest point of 40% (High). * **Observation:** Unassociated hallucinations correlate negatively with the X-axis levels. --- ## 5. Summary of Findings The chart demonstrates an inverse relationship between "Factual Associations" and "Unassociated Hallucinations." At the **Low** level, the system is dominated by unassociated hallucinations (93%). As the level progresses to **High**, the factual accuracy increases significantly (reaching 48%), while unassociated hallucinations drop by more than half (to 40%). "Associated Hallucinations" remain the smallest factor throughout but increase in frequency as factual associations increase. </details> Figure 15: Sample distribution across different subject popularity (low, mid, high) in Mistral-7B-v0.3, measured by monthly Wikipedia page views. Decoding algorithm. We employ greedy decoding ( $\text{temperature}=0$ ) for response generation, with models run in BF16 precision. PEFT settings for refusal tuning. For refusal tuning, we fine-tune with both models using QLoRA Dettmers et al. (2023), implemented with the Unsloth framework Daniel Han and team (2023), with rank $r=8$ , and $\alpha=8$ . QLoRA adapters are applied to all attention and MLP modules, and each model is fine-tuned for one epoch. <details> <summary>x19.png Details</summary> ![80837cdc](/v1/image/80837cdc2bc485b6ed67dc0c7c11e40dcf90aae30d376225cfd837343e71e58c) ### Visual Description # Technical Document Extraction: Layer-wise Norm Analysis ## 1. Image Overview This image is a line graph plotting a metric labeled **"Norm"** against **"Layer"** indices for a neural network model (likely a Large Language Model given the 32-layer architecture). The chart compares three distinct categories of data generation or association. ## 2. Component Isolation ### Header/Legend * **Location:** Top-left quadrant of the plot area. * **Legend Items:** 1. **Factual Asso.** (Green line, Triangle markers `▲`): Represents factual associations. 2. **Asso. Hallu.** (Blue line, Circle markers `●`): Represents associative hallucinations. 3. **Unasso. Hallu.** (Pink/Light Red line, Square markers `■`): Represents unassociated hallucinations. ### Main Chart Area * **X-Axis:** Labeled **"Layer"**. Numerical markers range from 0 to 30 with increments of 5 (0, 5, 10, 15, 20, 25, 30). The data points extend to layer 31. * **Y-Axis:** Labeled **"Norm"**. Numerical markers range from 0 to 5 with increments of 1 (0, 1, 2, 3, 4, 5). * **Grid:** A major grid is present for both X and Y axes. --- ## 3. Trend Verification and Data Extraction ### Series 1: Factual Asso. (Green, Triangle `▲`) * **Visual Trend:** Remains very low and stable (near 0) from Layer 0 to 14. It shows a slight bump at Layer 15, then a massive, sharp spike peaking at Layer 20. It remains high but volatile through Layer 23, drops sharply at Layer 24, and exhibits secondary peaks at Layers 25 and 29. * **Key Data Points (Approximate):** * Layers 0–14: ~0.1 to 0.4 * Layer 15: ~0.8 * Layer 20: **~4.8 (Global Maximum)** * Layer 22: ~3.7 * Layer 24: ~0.6 * Layer 29: ~2.0 ### Series 2: Asso. Hallu. (Blue, Circle `●`) * **Visual Trend:** This series tracks almost identically to the "Factual Asso." series. It stays low until Layer 14, spikes in tandem at Layer 20, and follows the same volatile pattern in the later layers. * **Key Data Points (Approximate):** * Layers 0–14: ~0.1 to 0.4 * Layer 15: ~0.8 * Layer 20: **~4.7 (Slightly lower than Factual)** * Layer 22: ~3.4 * Layer 24: ~0.55 * Layer 29: ~1.9 ### Series 3: Unasso. Hallu. (Pink, Square `■`) * **Visual Trend:** This series is significantly more stable than the other two. It follows the same baseline from Layer 0 to 18. While it shows a small increase at Layer 20, it does not participate in the massive spike seen in the "Associated" categories. It maintains a relatively flat profile between 0.5 and 1.3 for the remainder of the layers. * **Key Data Points (Approximate):** * Layers 0–18: ~0.1 to 0.7 * Layer 20: ~1.2 (Local peak, but much lower than others) * Layers 21–28: Fluctuates between 0.5 and 1.0 * Layer 31: ~1.2 --- ## 4. Comparative Analysis and Observations * **Correlation:** There is an extremely high correlation between "Factual Asso." and "Asso. Hallu." across all 32 layers. This suggests that the mechanism driving factual retrieval and associative hallucination is localized in the same layers (specifically layers 19–23 and 28–30). * **Divergence:** The "Unasso. Hallu." category diverges sharply from the other two starting at Layer 19. While the "Associated" norms skyrocket, the "Unassociated" norm remains suppressed. * **Critical Layer:** **Layer 20** is the most significant point of activity for associated content, reaching a Norm value nearly 5x higher than the baseline. * **Architecture:** The 0-31 layer range is characteristic of a 32-layer transformer model (e.g., Llama-2-7B or similar). </details> Figure 16: Subject-to-last attention contribution norms across layers in Mistral-7B-v0.3. Values show the norm of the attention contribution from subject tokens to the last token at each layer. <details> <summary>x20.png Details</summary> ![c17f953a](/v1/image/c17f953a5874810dbbaa4e7b98afb38266cfb5ca334c1948a5cb43672ceb5209) ### Visual Description # Technical Document Extraction: Cosine Similarity Across Model Layers ## 1. Component Isolation * **Header:** None present. * **Main Chart Area:** A line graph with three distinct data series plotted against a grid. * **Legend:** Located in the bottom-left quadrant of the chart area. * **Axes:** * **Y-Axis (Vertical):** Labeled "Cosine Similarity", ranging from 0.3 to 0.9. * **X-Axis (Horizontal):** Labeled "Layers", ranging from 0 to 30 (with data points extending to 31). ## 2. Legend and Data Series Identification The legend is positioned at approximately `[x=0.05, y=0.15]` relative to the chart axes. | Series Label | Color | Marker Shape | Visual Trend Description | | :--- | :--- | :--- | :--- | | **Factual Associations** | Green | Triangle ($\blacktriangle$) | High stability (0.9 to 0.8) until layer 17, followed by a sharp "V" shaped plunge and recovery. | | **Associated Hallucinations** | Blue | Circle ($\bullet$) | Closely tracks "Factual Associations" throughout the entire model depth, showing near-identical behavior. | | **Unassociated Hallucinations** | Red/Pink | Square ($\blacksquare$) | Maintains higher similarity than other series; follows the same downward trend after layer 17 but with a much shallower trough. | ## 3. Data Extraction and Trend Analysis ### General Trends * **Layers 0–17 (Initial Phase):** All three categories maintain high cosine similarity, starting near 0.93 and gradually drifting down to approximately 0.82. There is a minor "sawtooth" oscillation in the first 5 layers. * **Layers 18–26 (Divergence Phase):** A significant drop occurs. "Factual Associations" and "Associated Hallucinations" collapse rapidly from ~0.8 to a minimum of ~0.27. "Unassociated Hallucinations" also drop but remain significantly higher, bottoming out at ~0.6. * **Layers 27–31 (Recovery Phase):** All three series show an upward trajectory toward the final layer, suggesting a convergence or re-alignment of representations at the model's output. ### Estimated Data Points *Values are estimated based on axis markers and grid alignment.* | Layer | Factual Associations (Green $\blacktriangle$) | Associated Hallucinations (Blue $\bullet$) | Unassociated Hallucinations (Red $\blacksquare$) | | :--- | :--- | :--- | :--- | | 0 | 0.92 | 0.92 | 0.93 | | 5 | 0.88 | 0.88 | 0.90 | | 10 | 0.82 | 0.82 | 0.85 | | 15 | 0.82 | 0.80 | 0.85 | | 17 | 0.82 | 0.80 | 0.85 | | 20 | 0.55 | 0.54 | 0.72 | | 25 | 0.28 | 0.29 | 0.61 | | 26 | 0.27 (Min) | 0.28 (Min) | 0.60 (Min) | | 30 | 0.39 | 0.40 | 0.66 | | 31 | 0.47 | 0.48 | 0.72 | ## 4. Technical Observations * **High Correlation:** There is an extremely high correlation between "Factual Associations" and "Associated Hallucinations." Their lines are nearly overlapping for the duration of the 32 layers. * **The "Layer 17" Threshold:** Layer 17 marks a critical transition point where the internal representations of the model undergo a drastic transformation, specifically reducing the cosine similarity of factual and associated hallucinated content. * **Resilience of Unassociated Hallucinations:** The "Unassociated Hallucinations" series is the most "stable" in terms of cosine similarity, as it does not experience the same degree of collapse in the middle-to-late layers as the other two categories. </details> Figure 17: Cosine similarity of target-token hidden states across layers in Mistral-7B-v0.3. From mid-layers onward, FAs and AHs diverge sharply as subject information propagates, while UHs remain more clustered, confirming weaker subject-dependent updates. <details> <summary>x21.png Details</summary> ![e5c74010](/v1/image/e5c740108b5eb7dbd296b55e5cf2da57d4bbf9a57bebb93f9392d5410b10cb49) ### Visual Description # Technical Document Extraction: 2D Dimensionality Reduction Scatter Plot ## 1. Component Isolation * **Header:** None. * **Main Chart Area:** A scatter plot containing approximately 300-400 data points distributed across a 2D coordinate system. * **Legend:** Located in the upper right quadrant (approximate spatial grounding: [x=0.7, y=0.8] in normalized coordinates). * **Axes:** Numerical scales on both the X and Y axes. ## 2. Axis and Legend Extraction ### Axis Labels and Markers * **X-Axis (Horizontal):** Unlabeled. Numerical markers at intervals of 10: `[-20, -10, 0, 10, 20, 30]`. * **Y-Axis (Vertical):** Unlabeled. Numerical markers at intervals of 10: `[-20, -10, 0, 10, 20, 30]`. ### Legend Data The legend identifies three distinct categories represented by colored circular markers with partial transparency (alpha blending): | Color | Label | Description | | :--- | :--- | :--- | | **Green** | `Factual Asso.` | Likely "Factual Association" | | **Blue** | `Asso. Hallu.` | Likely "Associative Hallucination" | | **Red** | `Unasso. Hallu.` | Likely "Unassociated Hallucination" | --- ## 3. Data Distribution and Trend Analysis The image is a dimensionality reduction plot (such as t-SNE or UMAP), used to visualize high-dimensional data in 2D space. The spatial proximity of points indicates similarity. ### Category 1: Unasso. Hallu. (Red) * **Visual Trend:** This category shows the highest degree of clustering and separation from the other two groups. * **Spatial Placement:** * **Primary Cluster:** A dense concentration located in the upper-left quadrant, specifically between X: [-20, -5] and Y: [10, 28]. * **Secondary Cluster:** A very tight, dense cluster located at X: [-18, -12] and Y: [0, 5]. * **Outliers:** Very few red points are scattered in the lower-right quadrant, indicating this category is distinct from the others. ### Category 2: Factual Asso. (Green) * **Visual Trend:** This category is highly dispersed and shows significant overlap with the "Asso. Hallu." (Blue) group. * **Spatial Placement:** * **Primary Concentration:** Distributed across the bottom half of the plot, primarily between X: [-10, 30] and Y: [-25, 5]. * **Secondary Presence:** A small number of points are mixed within the upper-left red cluster (around X: -10, Y: 25). * **Observation:** The green points form a "cloud" that defines the lower-right region of the latent space. ### Category 3: Asso. Hallu. (Blue) * **Visual Trend:** Similar to the green points, these are widely dispersed and lack a single tight cluster. They appear "intermingled" with the factual data. * **Spatial Placement:** * **Primary Concentration:** Heavily overlapping with the green points in the lower-right region (X: [-5, 30], Y: [-20, 10]). * **Bridge Points:** Several blue points act as "bridges" or outliers between the main lower-right cloud and the upper-left red clusters (e.g., points at X: -25, Y: 12 and X: -5, Y: 15). --- ## 4. Summary of Findings The plot demonstrates a clear separation between **Unassociated Hallucinations (Red)** and the other two categories. The Red data points occupy a distinct region of the vector space, suggesting they have unique characteristics or features. Conversely, **Factual Associations (Green)** and **Associative Hallucinations (Blue)** are spatially co-located. This indicates that, within the context of this model's representation, factual information and associative hallucinations are mathematically similar and difficult to distinguish based solely on their position in this 2D projection. </details> Figure 18: t-SNE visualization of last token’s representations at layer 25 of Mistral-7B-v0.3. <details> <summary>x22.png Details</summary> ![3d9dad99](/v1/image/3d9dad9919f499bfd0482fc7efefa4ac82d941755a9697c76271d87c6169532e) ### Visual Description # Technical Data Extraction: AUROC Performance by Representation Type ## 1. Document Overview This image is a grouped bar chart comparing the performance (measured in AUROC) of two different hallucination categories across three distinct representation types. The chart includes error bars representing variability or confidence intervals for each data point. ## 2. Component Isolation ### A. Header / Axis Labels * **Y-Axis Title:** AUROC * **X-Axis Title:** Representation Type * **Y-Axis Scale:** 0.4 to 0.9 (increments of 0.1 marked, with grid lines every 0.05). * **X-Axis Categories:** Subject, Attention, Last Token. ### B. Legend (Spatial Grounding: Bottom Center) * **Red Bar (Left in group):** Unassociated Hallucination (Note: Corrected from "Unassoiated Halluciation" in original image). * **Blue Bar (Right in group):** Associated Hallucination (Note: Corrected from "Assoiated Halluciation" in original image). --- ## 3. Data Table Reconstruction The following values are estimated based on the visual alignment with the Y-axis grid lines. | Representation Type | Unassociated Hallucination (Red) | Associated Hallucination (Blue) | | :--- | :--- | :--- | | **Subject** | ~0.89 (Error: ±0.015) | ~0.59 (Error: ±0.03) | | **Attention** | ~0.78 (Error: ±0.025) | ~0.57 (Error: ±0.035) | | **Last Token** | ~0.84 (Error: ±0.02) | ~0.56 (Error: ±0.025) | --- ## 4. Trend Verification and Analysis ### Unassociated Hallucination (Red Series) * **Visual Trend:** The series starts at its highest point at "Subject," drops significantly at "Attention," and then recovers partially at "Last Token." * **Performance:** This category consistently outperforms the "Associated" category across all representation types, maintaining an AUROC above 0.75. * **Peak:** Highest performance is achieved using the **Subject** representation (~0.89). ### Associated Hallucination (Blue Series) * **Visual Trend:** The series shows a slight, steady downward slope from left to right. * **Performance:** This category performs significantly worse than the "Unassociated" category, hovering between 0.55 and 0.60 AUROC, which is closer to random chance (0.5) than the red series. * **Peak:** Highest performance is achieved using the **Subject** representation (~0.59). --- ## 5. Technical Observations * **Gap Analysis:** There is a substantial performance gap (approx. 0.2 to 0.3 AUROC) between Unassociated and Associated hallucinations across all tested representations. * **Error Bars:** The error bars for "Unassociated Hallucination" at the "Subject" position are the smallest, suggesting the highest precision/consistency in that specific measurement. The "Attention" and "Associated" categories generally show larger error bars, indicating higher variance. * **Textual Note:** There are spelling errors in the legend of the source image ("Unassoiated" and "Assoiated" instead of "Unassociated" and "Associated"). </details> Figure 19: Hallucination detection performance on the Full setting (Mistral-v0.3-7B). <details> <summary>x23.png Details</summary> ![8953ade9](/v1/image/8953ade9eaa64986ae17f9dfbcc0e2daaf27e0fdb5bb631f019ce8bd8f3ea15c) ### Visual Description # Technical Document Extraction: Refusal Ratio Analysis ## 1. Image Overview This image is a grouped bar chart illustrating the "Refusal Ratio (%)" of a system (likely a Large Language Model) across different training and testing conditions. It compares how training on specific types of data (UH vs. AH) affects the system's tendency to refuse prompts categorized by factual association and hallucination types. ## 2. Component Isolation ### A. Header/Axes * **Y-Axis Label:** Refusal Ratio (%) * **Y-Axis Scale:** 0 to 100, with major markers every 20 units (0, 20, 40, 60, 80, 100). * **X-Axis Label:** Training Set * **X-Axis Categories:** "UH Only" and "AH Only" * **Gridlines:** Horizontal dashed light-gray lines at 20, 40, 60, and 80 on the Y-axis. ### B. Legend The legend defines three categories for the **Testing set**: * **Green:** Factual Asso. * **Blue:** Asso. Hallu. * **Red/Salmon:** Unasso. Halluc. ## 3. Data Extraction and Trend Analysis ### Trend Verification 1. **Unasso. Halluc. (Red):** This series shows the highest refusal ratios in both training scenarios but drops significantly when moving from "UH Only" training to "AH Only" training. 2. **Asso. Hallu. (Blue):** This series shows a moderate increase in refusal ratio when moving from "UH Only" to "AH Only" training. 3. **Factual Asso. (Green):** This series shows the lowest refusal ratios overall, with a slight increase when moving from "UH Only" to "AH Only" training. ### Data Table (Reconstructed) Values are estimated based on the Y-axis scale and gridlines. | Training Set (X-Axis) | Factual Asso. (Green) | Asso. Hallu. (Blue) | Unasso. Halluc. (Red) | | :--- | :--- | :--- | :--- | | **UH Only** | ~11% | ~14% | ~87% | | **AH Only** | ~16% | ~22% | ~53% | ## 4. Detailed Observations * **Dominant Category:** The "Unasso. Halluc." (Unassociated Hallucination) testing set consistently triggers the highest refusal ratio regardless of the training set. * **Impact of Training:** * Training on **"UH Only"** (Unassociated Hallucination) results in an extremely high refusal rate for unassociated hallucinations (~87%) but very low refusal for factual associations (~11%). * Training on **"AH Only"** (Associated Hallucination) leads to a more balanced, though still skewed, refusal profile. It reduces the refusal of unassociated hallucinations to ~53% while slightly increasing the refusal of factual associations and associated hallucinations. * **Cross-Reference Check:** The red bar in the "UH Only" group nearly reaches the 90% mark, while the red bar in the "AH Only" group sits just above the 50% midline between the 40 and 60 gridlines, confirming the data points align with the visual representation. </details> Figure 20: Refusal tuning performance across three types of samples (Mistral-v0.3-7B). Appendix B Parallel Experiments on Mistral This section is for documenting parallel experiments conducted on the Mistral-7B-v0.3 model under the same settings as described in the main text (Figures 13 – 20). The results from Mistral exhibit similar patterns to those observed in LLaMA, as described before. Specifically, we find consistent patterns in the model’s internal computations, hidden-state behaviors, and the performance of hallucination detection and refusal tuning experiments. <details> <summary>x24.png Details</summary> ![184e52e7](/v1/image/184e52e7de6e4a471da6fe8ff329710d56b3634435178e629821b64212002621) ### Visual Description # Technical Document Extraction: t-SNE Visualization of Hallucination Types ## 1. Image Overview This image is a 2D scatter plot, likely generated using a dimensionality reduction technique such as t-SNE (t-distributed Stochastic Neighbor Embedding). It visualizes the clustering or distribution of three distinct categories of data points based on their semantic or factual properties. ## 2. Component Isolation ### A. Header/Title * **Content:** None present. ### B. Main Chart Area (Data Visualization) * **Type:** Scatter Plot. * **X-Axis:** Numerical scale ranging from approximately **-30 to +25**. Major tick marks are labeled at **-20, -10, 0, 10, 20**. * **Y-Axis:** Numerical scale ranging from approximately **-25 to +25**. Major tick marks are labeled at **-20, -10, 0, 10, 20**. * **Data Points:** Approximately 300-400 semi-transparent circular markers. ### C. Legend (Spatial Grounding: Bottom-Left [x≈-28, y≈-22]) The legend is enclosed in a white box with a grey border. It maps colors to specific categories: | Color | Label | Full Name | | :--- | :--- | :--- | | Green | `Factual Asso.` | Factual Association | | Blue | `Asso. Hallu.` | Associative Hallucination | | Red | `Unasso. Hallu.` | Unassociated Hallucination | --- ## 3. Data Series Analysis and Trends ### Series 1: Factual Asso. (Green) * **Visual Trend:** This series is widely dispersed across the upper and central regions of the plot. It shows a high degree of overlap with the "Asso. Hallu." (Blue) series. * **Spatial Distribution:** * Concentrated primarily between X: [-20, 5] and Y: [-5, 20]. * Outliers exist in the lower-right quadrant (X: 10, Y: -15). * **Observation:** The green points act as a "bridge" or background distribution for the other two types, indicating that factual associations share feature space with both types of hallucinations. ### Series 2: Asso. Hallu. (Blue) * **Visual Trend:** This series is predominantly located in the upper-left and center-left of the plot, with a secondary cluster in the bottom-center. * **Spatial Distribution:** * Primary cluster: X: [-25, 5], Y: [0, 20]. * Secondary cluster: X: [0, 10], Y: [-25, -15]. * **Observation:** There is significant intermingling with the Green series, suggesting that "Associative Hallucinations" are semantically close to "Factual Associations." ### Series 3: Unasso. Hallu. (Red) * **Visual Trend:** This series shows the most distinct clustering behavior. While some points are scattered in the center, there is a very dense, isolated cluster on the far right. * **Spatial Distribution:** * **Main Cluster:** X: [12, 23], Y: [-18, 2]. This cluster is relatively "clean" with very few green or blue points intermixed. * **Scattered Points:** A few points are located in the center-top (X: -7, Y: 23) and center-bottom (X: 5, Y: -25). * **Observation:** The distinct cluster on the right suggests that "Unassociated Hallucinations" possess unique features or vector representations that set them apart from both factual data and associative hallucinations. --- ## 4. Summary of Findings The visualization demonstrates a gradient of semantic similarity: 1. **Factual Associations (Green)** and **Associative Hallucinations (Blue)** are highly related and occupy a similar region in the high-dimensional space (projected to the left and center of the plot). 2. **Unassociated Hallucinations (Red)** exhibit a unique signature, forming a distinct cluster on the right side of the plot, indicating they are mathematically/semantically distant from the other two categories. </details> Figure 21: t-SNE visualization of subject tokens’ representations at layer 11 of LLaMA-3-8B. <details> <summary>x25.png Details</summary> ![5f660f38](/v1/image/5f660f38e3759fc386efec4d0a68894c3cf98a54ce113d956de48879a6b7bd7b) ### Visual Description # Technical Document Extraction: 2D Dimensionality Reduction Plot ## 1. Image Overview This image is a scatter plot representing a 2D dimensionality reduction (likely t-SNE or UMAP) of data points categorized into three distinct classes. The plot visualizes the clustering and overlap of different types of model outputs or associations. ## 2. Component Isolation ### A. Header/Legend * **Location:** Top-right corner [approx. x=0.7 to 0.95, y=0.75 to 0.95 in normalized coordinates]. * **Legend Items:** * **Green Circle ($\bullet$):** Factual Asso. (Factual Association) * **Blue Circle ($\bullet$):** Asso. Hallu. (Associative Hallucination) * **Red Circle ($\bullet$):** Unasso. Hallu. (Unassociative Hallucination) ### B. Main Chart Area (Axes) * **X-Axis:** Numerical scale ranging from approximately **-25 to +25**. Major tick marks are labeled at **-20, -10, 0, 10, 20**. * **Y-Axis:** Numerical scale ranging from approximately **-30 to +35**. Major tick marks are labeled at **-20, -10, 0, 10, 20, 30**. * **Data Points:** Semi-transparent circular markers. Overlapping points create darker, more opaque regions. --- ## 3. Data Series Analysis and Trends ### Series 1: Factual Asso. (Green) * **Visual Trend:** This series is widely dispersed across the lower half and right side of the plot. It shows significant overlap with the "Asso. Hallu." (Blue) series. * **Spatial Distribution:** * Concentrated primarily between $x \in [-15, 25]$ and $y \in [-25, 10]$. * A small, dense cluster is visible at the top center-left near $x \approx -8, y \approx 30$. * Scattered outliers exist on the far right edge ($x \approx 25, y \approx -5$). ### Series 2: Asso. Hallu. (Blue) * **Visual Trend:** Highly interleaved with the Green series, suggesting these two categories share similar feature spaces or embeddings. * **Spatial Distribution:** * Broadly distributed across the entire range of the x-axis. * Dense clusters appear at $x \approx -8, y \approx 30$ (overlapping with Green) and $x \approx -2, y \approx 24$. * Significant presence in the lower-right quadrant ($x > 0, y < 0$). ### Series 3: Unasso. Hallu. (Red) * **Visual Trend:** Shows a distinct, localized "cloud" or cluster that is partially separated from the main mass of the other two categories. * **Spatial Distribution:** * Primary cluster is located in the upper-left quadrant, specifically between $x \in [-15, -5]$ and $y \in [10, 18]$. * Secondary scattering occurs across the center ($y \approx 0$ to $10$). * Relatively few points are found in the bottom-right quadrant compared to the Green and Blue series. --- ## 4. Key Observations and Data Patterns 1. **Clustering:** There is a notable cluster of "Unasso. Hallu." (Red) in the top-left that is relatively pure, indicating these instances have distinct characteristics from factual associations. 2. **Overlap:** The "Factual Asso." (Green) and "Asso. Hallu." (Blue) categories are heavily mixed, particularly in the region where $x > 0$. This suggests that the model's internal representation of factual data and associative hallucinations are very similar. 3. **Outliers:** There are isolated points of all three colors on the far left ($x \approx -22, y \approx 0$), indicating a small group of data points that are mathematically distant from the main distribution. 4. **Density:** The highest density of points occurs in the central-lower region ($x \in [-10, 15], y \in [-15, 5]$), where all three categories converge. </details> Figure 22: t-SNE visualization of subject tokens’ representations at layer 11 of Mistral-7B-v0.3. Appendix C More Visualization on Hidden States In this section, we provide t-SNE visualization of subject tokens’ hidden states in Figure 21 and Figure 22. Compared to the last-token representations, the t-SNE visualization of subject-token hidden states shows that unassociated hallucinations (UHs) are moderately separated from factual and associated samples, but the separation is less distinct than in the last-token representations. This observation aligns with the results in § 5, where the hallucination detection performance using last-token hidden states outperforms that based on subject-token representations.

Rendering Paper...