# LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
> Corresponding author; Work partially done during internship at Apple.
Abstract
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as âhallucinationsâ. Recent studies have demonstrated that LLMsâ internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying thatâcontrary to prior claimsâtruthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMsâ internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the modelâs internal perspective, which can guide future research on enhancing error analysis and mitigation. Our code is available in https://github.com/technion-cs-nlp/LLMsKnow.
1 Introduction
The ever-growing popularity of large language models (LLM) across many domains has brought a significant limitation to center stage: their tendency to âhallucinateâ â which is often used to describe the generation of inaccurate information. But what are hallucinations, and what causes them? A considerable body of research has sought to define, taxonomize, and understand hallucinations through extrinsic, behavioral analysis, primarily examining how users perceive such errors (Bang et al., 2023; Ji et al., 2023; Huang et al., 2023a; Rawte et al., 2023). However, this approach does not adequately address how these errors are encoded within the LLMs. Alternatively, another line of work has explored the internal representations of LLMs, suggesting that LLMs encode signals of truthfulness (Kadavath et al., 2022; Li et al., 2024; Chen et al., 2024, inter alia). However, these analyses were typically restricted to detecting errorsâdetermining whether a generated output contains inaccuraciesâwithout delving deeper into how such signals are represented and could be leveraged to understand or mitigate hallucinations.
In this work, we reveal that the internal representations of LLMs encode much more information about truthfulness than previously recognized. Through a series of experiments, we train classifiers on these internal representations to predict various features related to the truthfulness of generated outputs. Our findings reveal the patterns and types of information encoded in model representations, linking this intrinsic data to extrinsic LLM behavior. This enhances our ability to detect errors (while understanding the limitations of error detection), and may guide the development of more nuanced strategies based on error types and mitigation methods that make use of the modelâs internal knowledge. Our experiments are designed to be general, covering a broad array of LLM limitations. While the term âhallucinationsâ is widely used, it lacks a universally accepted definition (Venkit et al., 2024). Our framework adopts a broad interpretation, considering hallucinations to encompass all errors produced by an LLM, including factual inaccuracies, biases, common-sense reasoning failures, and other real-world errors. This approach enables us to draw general conclusions about model errors from a broad perspective.
Our first step is identifying where truthfulness signals are encoded in LLMs. Previous studies have suggested methods for detecting errors in LLM outputs using intermediate representations, logits, or probabilities, implying that LLMs may encode signals of truthfulness (Kadavath et al., 2022; Li et al., 2024; Chen et al., 2024). Focusing on long-form generations, which reflect real-world usage of LLMs, our analysis uncovers a key oversight: the choice of token used to extract these signals (Section 3). We find that truthfulness information is concentrated in the exact answer tokens â e.g., âHartfordâ in âThe capital of Connecticut is Hartford, an iconic cityâŠâ. Recognizing this nuance significantly improves error detection strategies across the board, revealing that truthfulness encoding is stronger than previously observed.
From this point forward, we concentrate on our most effective strategy: a classifier trained on intermediate LLM representations within the exact answer tokens, referred to as âprobing classifiersâ (Belinkov, 2021). This approach helps us explore what these representations reveal about LLMs. Our demonstration that a trained probing classifier can predict errors suggests that LLMs encode information related to their own truthfulness. However, we find that probing classifiers do not generalize across different tasks (Section 4). Generalization occurs only within tasks requiring similar skills (e.g., factual retrieval), indicating the truthfulness information is âskill-specificâ and varies across different tasks. For tasks involving different skills, e.g., sentiment analysis, these classifiers are no betterâor worseâthan logit-based uncertainty predictors, challenging the idea of a âuniversal truthfulnessâ encoding proposed in previous work (Marks & Tegmark, 2023; Slobodkin et al., 2023). Instead, our results indicate that LLMs encode multiple, distinct notions of truth. Thus, deploying trainable error detectors in practical applications should be undertaken with caution.
We next find evidence that LLMs encode not only error detection signals but also more nuanced information about error types. Delving deeper into errors within a single task, we taxonomize its errors based on responses across repeated samples (Section 5). For example, the same error being consistently generated is different from an error that is generated occasionally among many other distinct errors. Using a different set of probing classifiers, we find that error types are predictable from the LLM representations, drawing a connection between the modelsâs internal representations and its external behavior. This classification offers a more nuanced understanding of errors, enabling developers to predict error patterns and implement more targeted mitigation strategies.
Finally, we find that the truthfulness signals encoded in LLMs can also differentiate between correct and incorrect answers for the same question (Section 6). Results highlight a significant misalignment between LLMâs internal representations and its external behavior in some cases. The modelâs internal encoding may identify the correct answerâyet it frequently generates an incorrect response. This discrepancy reveals that the LLMâs external behavior may misrepresent its abilities, potentially pointing to new strategies for reducing errors by utilizing its existing strengths. Overall, our model-centric framework provides a deeper understanding of LLM errors, suggesting potential directions for improvements in error analysis and mitigation.
2 Background
Defining and characterizing LLM errors.
The term âhallucinationsâ is widely used across various subfields such as conversational AI (Liu et al., 2022), abstractive summarization (Zhang et al., 2019), and machine translation (Wang & Sennrich, 2020), each interpreting the term differently. Yet, no consensus exists on defining hallucinations: Venkit et al. (2024) identified 31 distinct frameworks for conceptualizing hallucinations, revealing the diversity of perspectives. Research efforts aim to define and taxonomize hallucinations, distinguishing them from other error types (Liu et al., 2022; Ji et al., 2023; Huang et al., 2023a; Rawte et al., 2023). On the other hand, recent scholarly conversations introduce terms like âconfabulationsâ (Millidge, 2023) and âfabricationsâ (McGowan et al., 2023), attributing a possible âintentionâ to LLMs, although the notions of LLM âintentionâ and other human-like traits are still debated (Salles et al., 2020; Serapio-GarcĂa et al., 2023; Harnad, 2024). These categorizations, however, adopt a human-centric view by focusing on the subjective interpretations of LLM hallucinations, which does not necessarily reflect how these errors are encoded within the models themselves. This gap limits our ability to address the root causes of hallucinations, or to reason about their nature. For example, it is unclear whether conclusions about hallucinations defined in one framework can be applied to another framework. Liang et al. (2024) defined hallucinations as inconsistencies with the training data. While this approach engage with the possible root causes of hallucinations, our study focuses on insights from the model itself, without requiring training data access. Instead, we adopt a broad interpretation of hallucinations. Here, we define hallucinations as any type of error generated by an LLM, including factual inaccuracies, biases, failures in common-sense reasoning, and others.
Another line of research suggests that LLMs either encode information about their own errors (Kadavath et al., 2022; Azaria & Mitchell, 2023) or exhibit discrepancies between their outputs and internal representations (Liu et al., 2023; Gottesman & Geva, 2024), indicating the presence of underlying mechanisms not reflected in their final outputs. Moreover, Yona et al. (2024) found that current LLMs fail to effectively convey their uncertainty through their generated outputs. Hence, we propose shifting the focus from human-centric interpretations of hallucinations to a model-centric perspective, examining the modelâs intermediate activations.
Error detection in LLMs.
Error detection is a longstanding task in NLP, crucial for maintaining high standards in various practical applications and for constructing more reliable systems that ensure user trust (Bommasani et al., 2021). Over the years, many studies have proposed task-specific solutions (see Section A.1). However, the recent shift towards general-purpose LLMs necessitates a holistic approach capable of addressing any error type, rather than focusing on specific ones, making it suitable for the diverse errors generated by these models.
A line of work has addressed this challenge by leveraging external knowledge sources (Lewis et al., 2020; Gao et al., 2023) or an external LLM judge (Lin et al., 2021; Rawte et al., 2023) to identify erroneous outputs. On the other hand, our work focuses on detection methods that rely solely on the computations of the LLMâspecifically, output logits, probabilities after softmax, and hidden states.
Error detection in LLMs is also closely linked to uncertainty estimation, where low certainty signals potential inaccuracies and possible errors. Popular methods to derive calibrated confidence include inspecting the model logit output values (Varshney et al., 2023; Taubenfeld et al., 2025), agreement across multiple sampled answers (Kuhn et al., 2023; Manakul et al., 2023; Tian et al., 2023a), verbalized probability (Tian et al., 2023b), and direct prompting (Kadavath et al., 2022).
Another line of work trains probing classifiers to discover and utilize truthfulness features. This approach has shown some success by probing the final token of an answerâeither generated (Kadavath et al., 2022; Snyder et al., 2023; Yuksekgonul et al., 2023; Zou et al., 2023; Yin et al., 2024; Chen et al., 2024; Simhi et al., 2024; Gekhman et al., 2025) or not (Li et al., 2024; Marks & Tegmark, 2023; Burns et al., 2022; Azaria & Mitchell, 2023; Rateike et al., 2023). Others probe the final token of the prompt before the response is generated (Slobodkin et al., 2023; Snyder et al., 2023; Simhi et al., 2024; Gottesman & Geva, 2024). Many previous studies simplify the analysis by generating answers in a few-shot setting or limiting generation to a single token. In contrast, we simulate real-world usage of LLMs by allowing unrestricted answer generation. By probing exact answer tokens, we achieve significant improvements in error detection.
3 Better Error Detection
This section presents our experiments on detecting LLM errors through their own computations, focusing on token selectionâs impact and introducing a method that outperforms other approaches.
3.1 Task Definition
Given an LLM $M$ , an input prompt $p$ and the LLM-generated response $\hat{y}$ , the task is to predict whether $\hat{y}$ is correct or wrong. We assume that there is access to the LLMâs internal states (i.e., white-box setting), but no access to any external resources (e.g., search engine or additional LLMs).
We use a dataset $D=\{(q_{i},y_{i})\}_{i=1}^{N}$ , consisting of $N$ question-label pairs, where $\{q_{i}\}_{i=1}^{N}$ represents a series of questions (e.g., âWhat is the capital of Connecticut?â) and $\{y_{i}\}_{i=1}^{N}$ the corresponding ground-truth answers (âHartfordâ). For each question $q_{i}$ , we prompt the model $M$ to generate a response $y_{i}$ , resulting in the set of predicted answers $\{\hat{y}_{i}\}_{i=1}^{N}$ (âThe capital of Connecticut is HartfordâŠâ). Next, to build our error-detection dataset, we evaluate the correctness of each generated response $\hat{y}_{i}$ by comparing it to the ground-truth label $y_{i}$ . This comparison yields a correctness label $z_{i}â\{0,1\}$ ( $1$ correct, $0 0$ wrong). The comparison can be done either via automatic heuristics or with the assistance of an instruct-LLM. For most datasets, we use heuristics to predict correctness, except for one case. See Appendix A.2. Our error detection dataset is: $\{(q_{i},\hat{y}_{i},z_{i})\}_{i=1}^{N}$ . Note that this dataset is defined based on the analyzed LLM and its generated answers. Any instances where the LLM refuses to answer are excluded, as these can easily be classified as incorrect.
3.2 Experimental Setup
Datasets and models.
We perform all experiments on four LLMs: Mistral-7b (Jiang et al., 2023), Mistral-7b-instruct-v0.2 (denoted Mistral-7b-instruct), Llama3-8b (Touvron et al., 2023), and Llama3-8b-instruct. We consider 10 different datasets spanning various domains and tasks: TriviaQA (Joshi et al., 2017), HotpotQA with/without context (Yang et al., 2018), Natural Questions (Kwiatkowski et al., 2019), Winobias (Zhao et al., 2018), Winogrande (Sakaguchi et al., 2021), MNLI (Williams et al., 2018), Math (Sun et al., 2024), IMDB review sentiment analysis (Maas et al., 2011), and a dataset of movie roles (movies) that we curate. We allow unrestricted response generation to mimic real-world LLM usage, with answers decoded greedily. For more details on the datasets and the prompts used to generate answers, refer to Appendix A.3.
Performance metric.
We measure the area under the ROC curve to evaluate error detectors, providing a single metric that reflects their ability to distinguish between positive and negative cases across many thresholds, balancing sensitivity (true positive rate) and specificity (false positive rate).
Error detection methods. We compare methods from both uncertainty and hallucinations literature.
- Aggregated probabilities / logits: Previous studies (Guerreiro et al., 2023; Kadavath et al., 2022; Varshney et al., 2023; Huang et al., 2023b) aggregate output token probabilities or logits to score LLM confidence for error detection. We implement several methods from the literature, calculating the minimum, maximum, or mean of these values. The main paper reports results for the most common approach, Logits-mean, and the best-performing one, Logits-min, with additional baselines in Appendix B.
- P(True): Kadavath et al. (2022) showed that LLMs are relatively calibrated when asked to evaluate the correctness of their generation via prompting. We implement this evaluation using the same prompt.
- Probing: Probing classifiers involve training a small classifier on a modelâs intermediate activations to predict features of processed text (Belinkov, 2021). Recent studies show their effectiveness for error detection in generated text (Kadavath et al., 2022, inter alia). An intermediate activation is a vector $h_{l,t}$ from a specific LLM layer $l$ and (either read or generated) token $t$ . Thus, each LLM generation produces multiple such activations. Following prior work, we use a linear probing classifier for error detection (Li et al., 2024, inter alia) on static tokens: the last generated token ( $h_{l,-1}$ ), the one before it ( $h_{l,-2}$ ), and the final prompt token ( $h_{l,k}$ ). The layer $l$ is selected per token based on validation set performance.
For further details on the implementation of each method, refer to Appendix A.4.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Question-Answering System Illustration
### Overview
The image depicts a diagram illustrating a question-answering system, likely a large language model (LLM) interaction. It shows a prompt, the model's response, and annotations highlighting key tokens within the prompt and answer. The diagram appears to be used for demonstrating or analyzing the internal workings of such a system.
### Components/Axes
The diagram consists of three main sections:
1. **Prompt Section (Top):** Labeled "Prompt" in green text, positioned on the left side.
2. **Answer Section (Bottom):** Labeled "Mistral" in green text, positioned on the right side.
3. **Annotation Section (Center):** Red and blue brackets and labels indicating token boundaries and types.
The prompt section contains the question: `<s> [INST] What is the capital of the U.S. state of Connecticut? [/INST]`.
The answer section contains the text: "The capital city of the U.S. state of Connecticut is Hartford. Itâs one of the oldest cities in the United States and was founded in 1635. Hartford is located in the central part of the state and is home to several cultural institutions, universities, and businesses.</s>".
### Detailed Analysis or Content Details
The diagram highlights several tokens with specific labels:
* **`last_q_token`**: Located at the top-right of the prompt section, colored green.
* **`first_exact_answer_token`**: Located at the beginning of the answer section, colored red. The text is "Hartford".
* **`last_exact_answer_token`**: Located at the end of the answer section, colored red.
* **`</s>`**: Located at the end of the answer section, colored blue.
* **`<s>`**: Located at the beginning of the answer section, colored blue.
* **`[INST]`**: Located within the prompt, colored green.
* **`[/INST]`**: Located within the prompt, colored green.
* **-2, -1**: Numerical values positioned below the answer section, likely representing some internal metric or index.
The prompt asks: "What is the capital of the U.S. state of Connecticut?". The model's response correctly identifies Hartford as the capital.
### Key Observations
The diagram focuses on identifying the boundaries of the question and answer within the system's processing. The annotations suggest an analysis of token-level information, potentially for evaluating the model's performance or understanding its internal state. The numerical values (-2, -1) are unclear without further context, but may relate to token positions or scores.
### Interpretation
This diagram illustrates a simplified view of a question-answering process within a large language model. The annotations highlight the system's ability to identify the relevant portion of the answer (Hartford) corresponding to the question. The use of special tokens (`<s>`, `</s>`, `[INST]`, `[/INST]`) suggests a structured input format for the model. The diagram likely serves as a visual aid for debugging, analyzing, or explaining the model's behavior. The numerical values may represent internal states or scores used during the answer generation process. The diagram demonstrates a successful question-answering interaction, where the model accurately responds to the prompt.
</details>
Figure 1: Example for the input and LLM output from the TriviaQA dataset, and the names of the tokens that can be probed.
Exact Answer Tokens.
Existing methods often overlook a critical nuance: the token selection for error detection, typically focusing on the last generated token or taking a mean. However, since LLMs typically generate long-form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to LLMsâ unidirectional nature, failing to account for the generated response and missing cases where different sampled answers from the same model vary in correctness. We investigate a previously unexamined token location: the exact answer tokens, which represent the most meaningful parts of the generated response. We define exact answer tokens as those whose modification alters the answerâs correctness, disregarding subsequent generated content. In practice, we do not use this definition for extracting the exact answer, but rather an instruct model in a few-shot setting. Still, the definition is useful to manually verify that automatic extractions work as expected. Figure 1 illustrates the different token locations. In the following experiments, we implement each error detection method with an âexact answerâ version, demonstrating that it often improves performance, especially in probing. Implementation details for detecting the exact answer token are given in Appendix A.2.
3.3 Results
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/triviaqa_auc.png Details</summary>

### Visual Description
\n
## Heatmap: Layer vs. Token Correlation
### Overview
The image presents a heatmap visualizing the correlation between different layers of a model and various tokens. The heatmap uses a color gradient to represent correlation values, ranging from 0.5 (light blue) to 1.0 (dark blue). The x-axis represents tokens, and the y-axis represents layers.
### Components/Axes
* **X-axis (Horizontal):** "Token" with the following categories: "last\_q", "first\_answer", "second\_answer", "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_last", "-8", "-7", "-6", "-5", "-4", "-3", "-2", "-1".
* **Y-axis (Vertical):** "Layer" ranging from 0 to 30, with increments of 2.
* **Color Scale (Right):** Represents correlation values.
* 0.5: Light Blue
* 0.6: Light-Medium Blue
* 0.7: Medium Blue
* 0.8: Medium-Dark Blue
* 0.9: Dark Blue
* 1.0: Very Dark Blue
* **Legend Position:** Top-right corner.
### Detailed Analysis
The heatmap shows varying degrees of correlation between layers and tokens. Here's a breakdown of observed values, noting approximate values due to the visual nature of the data:
* **Layer 0-4:** High correlation (approximately 0.9-1.0) with "first\_answer" and "second\_answer". Correlation decreases as we move towards "exact\_answer\_last" and the negative tokens.
* **Layer 6-8:** Maintains high correlation (approximately 0.8-0.9) with "first\_answer" and "second\_answer". A slight increase in correlation with "exact\_answer\_before\_first" and "exact\_answer\_first" is observed.
* **Layer 10-12:** Correlation with "first\_answer" and "second\_answer" remains high (approximately 0.8-0.9). Correlation with "exact\_answer\_first" and "exact\_answer\_last" increases to around 0.7-0.8.
* **Layer 14-16:** A peak in correlation (approximately 0.9-1.0) is observed with "exact\_answer\_first". Correlation with "first\_answer" and "second\_answer" decreases slightly to around 0.7-0.8.
* **Layer 18-20:** Correlation with "exact\_answer\_first" remains high (approximately 0.8-0.9). Correlation with the negative tokens (-8 to -1) begins to increase, reaching around 0.6-0.7.
* **Layer 22-24:** Correlation with "exact\_answer\_first" decreases to around 0.7-0.8. Correlation with the negative tokens continues to increase, reaching around 0.7-0.8 for -4 and -3.
* **Layer 26-28:** Correlation with "exact\_answer\_first" is around 0.6-0.7. Correlation with the negative tokens is relatively stable, around 0.6-0.7.
* **Layer 30:** Low correlation (approximately 0.5-0.6) across all tokens.
**Specific Data Points (Approximate):**
* Layer 0, "first\_answer": ~0.95
* Layer 0, "exact\_answer\_last": ~0.55
* Layer 14, "exact\_answer\_first": ~1.0
* Layer 18, "-4": ~0.65
* Layer 30, "last\_q": ~0.55
* Layer 30, "exact\_answer\_first": ~0.6
### Key Observations
* The highest correlations are generally observed between layers 0-16 and the "first\_answer" and "second\_answer" tokens.
* Correlation with "exact\_answer\_first" peaks around layer 14-16.
* Correlation with negative tokens (-8 to -1) increases with layer depth, peaking around layers 18-24.
* Layer 30 exhibits the lowest overall correlation with all tokens.
* The "last\_q" token consistently shows lower correlation values compared to the answer tokens.
### Interpretation
This heatmap likely represents the attention weights or activation patterns within a neural network model, specifically related to question answering. The strong correlation between early layers (0-16) and "first\_answer" and "second\_answer" suggests that these layers are crucial for initial answer generation. The peak in correlation with "exact\_answer\_first" around layer 14-16 indicates that this layer is particularly important for refining the answer towards a precise match.
The increasing correlation with negative tokens as the layer depth increases could indicate that later layers are involved in identifying and suppressing incorrect or irrelevant information. The low correlation in layer 30 suggests that this layer might be involved in a more global processing step or a final decision-making stage.
The relatively low correlation of "last\_q" across all layers might suggest that the model doesn't heavily rely on the initial question representation in later stages of processing, or that the question information is effectively integrated into the hidden states.
The heatmap provides valuable insights into how the model processes information at different layers and how different tokens contribute to the final answer. This information can be used to diagnose potential issues, optimize model architecture, and improve performance.
</details>
(a) TriviaQA
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/winobias_auc.png Details</summary>

### Visual Description
## Heatmap: Attention Weights
### Overview
The image presents a heatmap visualizing attention weights between layers and tokens. The heatmap displays a color gradient representing the strength of attention, ranging from 0.5 (light blue) to 1.0 (dark blue). The x-axis represents tokens, and the y-axis represents layers. A rectangular region is highlighted with a black border.
### Components/Axes
* **X-axis (Horizontal):** "Token" with markers: "last\_q", "first\_answer", "second\_answer", "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_last", -8, -7, -6, -5, -4, -3, -2, -1.
* **Y-axis (Vertical):** "Layer" with markers ranging from 0 to 30, incrementing by 2.
* **Color Scale (Right):** Represents attention weight, ranging from 0.5 (lightest blue) to 1.0 (darkest blue).
* **Highlighted Region:** A black rectangle encompassing a portion of the heatmap, roughly between tokens "first\_answer" and "-1", and layers 4 to 28.
### Detailed Analysis
The heatmap shows varying attention weights across different layers and tokens.
* **Token "last\_q":** Displays high attention weights (dark blue, approximately 0.95-1.0) across layers 0-4, then rapidly decreases to approximately 0.6-0.7 for layers 6-30.
* **Token "first\_answer":** Shows high attention weights (dark blue, approximately 0.9-1.0) across layers 0-10, then gradually decreases to approximately 0.7-0.8 for layers 12-30.
* **Token "second\_answer":** Similar to "first\_answer", with high attention weights (dark blue, approximately 0.9-1.0) across layers 0-10, then gradually decreasing to approximately 0.7-0.8 for layers 12-30.
* **Tokens "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_last":** These tokens exhibit a pattern of high attention weights (dark blue, approximately 0.9-1.0) in the lower layers (0-8), followed by a more gradual decrease to approximately 0.7-0.8 in the higher layers (20-30).
* **Tokens -8 to -1:** These tokens show a generally lower attention weight (light to medium blue, approximately 0.6-0.8) across all layers, with a slight increase in attention weight (up to approximately 0.85) in the middle layers (10-20).
* **Layer 0:** Shows consistently high attention weights (dark blue, approximately 0.9-1.0) for tokens "last\_q", "first\_answer", "second\_answer", and "exact\_answer" tokens.
* **Layer 30:** Shows generally lower attention weights (light blue, approximately 0.6-0.7) for most tokens, except for "last\_q" which remains around 0.7.
The highlighted region contains a concentration of relatively high attention weights, particularly for the "first\_answer" and "second\_answer" tokens across a broad range of layers.
### Key Observations
* The attention weights are generally highest in the lower layers (0-10) for the "answer" related tokens.
* The attention weights for "last\_q" decrease more rapidly with increasing layer number compared to the "answer" tokens.
* The negative numbered tokens (-8 to -1) consistently exhibit lower attention weights than the other tokens.
* The highlighted region indicates a specific area of interest where attention is focused.
### Interpretation
This heatmap likely represents the attention mechanism within a neural network model, possibly a question-answering or language modeling system. The tokens represent different parts of the input sequence (e.g., question, answer candidates), and the layers represent different levels of processing within the network.
The high attention weights in the lower layers for the "answer" tokens suggest that the model initially focuses on identifying potential answers. As the processing progresses through higher layers, the attention shifts, and the model refines its focus. The decreasing attention weights for "last\_q" could indicate that the initial query representation becomes less relevant as the model processes the answer.
The lower attention weights for the negative numbered tokens might suggest that these tokens are less important for the task at hand, or that they represent less relevant information.
The highlighted region draws attention to a specific portion of the heatmap, potentially indicating a critical interaction between layers and tokens that contributes significantly to the model's performance. Further investigation of this region could reveal insights into the model's decision-making process. The heatmap provides a visual representation of how the model allocates its "attention" to different parts of the input sequence at different stages of processing.
</details>
(b) Winobias
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/answerable_math_auc.png Details</summary>

### Visual Description
## Heatmap: Attention Weights by Layer and Token
### Overview
The image presents a heatmap visualizing attention weights. The heatmap displays the relationship between layers (vertical axis) and tokens (horizontal axis). The color intensity represents the magnitude of the attention weight, ranging from 0.5 (lightest) to 1.0 (darkest).
### Components/Axes
* **X-axis (Horizontal):** "Token" - Represents different tokens. The tokens are labeled as: "last\_q", "first\_answer", "second\_answer", "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_last", and tokens numbered -8 to -1.
* **Y-axis (Vertical):** "Layer" - Represents the layer number, ranging from 0 to 30.
* **Color Scale (Right):** Represents the attention weight. The scale ranges from 0.5 (light blue/white) to 1.0 (dark blue).
* **Legend:** Located on the right side of the heatmap, providing a color-to-value mapping for the attention weights.
### Detailed Analysis
The heatmap shows varying attention weights across different layers and tokens.
* **Token "last\_q":** Exhibits high attention weights (close to 1.0) in the initial layers (0-8). The attention decreases as the layer number increases, becoming lighter blue (around 0.6-0.7) in the higher layers (20-30).
* **Token "first\_answer":** Shows a similar trend to "last\_q", with high attention in the lower layers and decreasing attention in the higher layers.
* **Token "second\_answer":** Displays a similar pattern to "first\_answer", with high attention in the lower layers and decreasing attention in the higher layers.
* **Token "exact\_answer\_before\_first":** Shows a moderate attention weight (around 0.7-0.8) across most layers, with a slight increase in attention in the middle layers (10-20).
* **Token "exact\_answer\_first":** Exhibits the highest attention weights (close to 1.0) across a broad range of layers (approximately 4-24). This is the most prominent feature of the heatmap.
* **Token "exact\_answer\_last":** Shows a similar pattern to "exact\_answer\_first", with high attention weights (close to 1.0) across a broad range of layers (approximately 4-24).
* **Tokens -8 to -1:** These tokens generally exhibit lower attention weights (around 0.5-0.7) across all layers, with some slight variations. The attention weights appear to increase slightly in the middle layers (10-20) for some of these tokens.
Here's a more granular breakdown of approximate attention weights at specific layer/token combinations:
* Layer 0, Token "exact\_answer\_first": ~0.95
* Layer 8, Token "exact\_answer\_first": ~0.98
* Layer 16, Token "exact\_answer\_first": ~0.97
* Layer 24, Token "exact\_answer\_first": ~0.95
* Layer 30, Token "exact\_answer\_first": ~0.85
* Layer 0, Token "last\_q": ~0.95
* Layer 30, Token "last\_q": ~0.6
* Layer 0, Token "-8": ~0.55
* Layer 30, Token "-8": ~0.65
### Key Observations
* The tokens "exact\_answer\_first" and "exact\_answer\_last" consistently receive the highest attention weights across a significant portion of the layers.
* The attention weights for "last\_q", "first\_answer", and "second\_answer" decrease as the layer number increases.
* The tokens numbered -8 to -1 generally have lower attention weights compared to the other tokens.
* There is a clear gradient in attention weights, with higher attention in the lower layers and decreasing attention in the higher layers for certain tokens.
### Interpretation
This heatmap likely represents the attention mechanism within a neural network model, possibly a transformer-based model used for question answering or a similar task. The high attention weights assigned to "exact\_answer\_first" and "exact\_answer\_last" suggest that these tokens are crucial for the model's decision-making process, particularly in the middle layers. The decreasing attention weights for "last\_q", "first\_answer", and "second\_answer" as the layer number increases could indicate that the model is refining its focus from the initial query and answers to the final, more precise answer tokens. The lower attention weights for the numbered tokens (-8 to -1) might suggest that these tokens are less relevant to the task or represent contextual information that is not as important as the answer tokens.
The heatmap demonstrates how the model distributes its attention across different parts of the input sequence at different stages of processing. This information can be valuable for understanding the model's behavior, identifying potential biases, and improving its performance. The strong attention on "exact\_answer\_first" and "exact\_answer\_last" suggests the model is heavily reliant on these tokens for making predictions.
</details>
(c) Math
Figure 2: AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. Generation proceeds from left to right, with detection performance peaking at the exact answer tokens.
Patterns of truthfulness encoding.
We first focus on probing classifiers to gain insights into the internal representations of LLMs. Specifically, we analyze the effects of layer and token selection on the error detection performance of these probing classifiers. By systematically probing all model layers, starting from the last question token to the final generated token, we observe consistent truthfulness encoding patterns. Figure 2 shows AUC metrics of probes across Mistral-7b-Instruct layers and tokens. Middle to later layers often yield the most effective probing results (see Appendix B for more datasets and models), aligning with previous studies on truthfulness encoding (Burns et al., 2022; CH-Wang et al., 2023) and transformer representations (nostalgebraist, 2020; Meng et al., 2022; Geva et al., 2023). Regarding tokens, a strong truthfulness signal appears immediately after the prompt, suggesting that this representation encodes information on the modelâs general ability to answer the question correctly. This signal weakens as text generation progresses but peaks again at the exact answer tokens. Towards the end of the generation process, signal strength rises again, though it remains weaker than at the exact answer tokens. These patterns are consistent across nearly all datasets and models (see Appendix B), suggesting a general mechanism by which LLMs encode and process truthfulness during text generation.
Error Detection Results.
Table 1: Comparison of error detection techniques using AUC metric, across different models and datasets. The best-performing method is bolded. Using exact answer tokens is useful for many cases, especially probing.
| | Mistral-7b-Instruct | Llama 3-8b-Instruct | | | | |
| --- | --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | TriviaQA | Winobias | Math | |
| Logits-mean | $0.60$ $± 0.009$ | $0.56$ $± 0.017$ | $0.55$ $± 0.029$ | $0.66$ $± 0.005$ | $0.60$ $± 0.026$ | $0.75$ $± 0.018$ |
| Logits-mean-exact | $0.68$ $± 0.007$ | $0.54$ $± 0.012$ | $0.51$ $± 0.005$ | $0.71$ $± 0.006$ | $0.55$ $± 0.019$ | $0.80$ $± 0.021$ |
| Logits-min | $0.63$ $± 0.008$ | $0.59$ $± 0.012$ | $0.51$ $± 0.017$ | $0.74$ $± 0.007$ | $0.61$ $± 0.024$ | $0.75$ $± 0.016$ |
| Logits-min-exact | $0.75$ $± 0.006$ | $0.53$ $± 0.013$ | $0.71$ $± 0.009$ | $0.79$ $± 0.006$ | $0.61$ $± 0.019$ | $0.89$ $± 0.018$ |
| p(True) | $0.66$ $± 0.006$ | $0.45$ $± 0.021$ | $0.48$ $± 0.022$ | $0.73$ $± 0.008$ | $0.59$ $± 0.020$ | $0.62$ $± 0.017$ |
| p(True)-exact | $0.74$ $± 0.003$ | $0.40$ $± 0.021$ | $0.60$ $± 0.025$ | $0.73$ $± 0.005$ | $0.63$ $± 0.014$ | $0.59$ $± 0.018$ |
| Probe @ token | | | | | | |
| Last generated [-1] | $0.71$ $± 0.006$ | $0.82$ $± 0.004$ | $0.74$ $± 0.008$ | $0.81$ $± 0.005$ | $0.86$ $± 0.007$ | $0.82$ $± 0.016$ |
| Before last generated [-2] | $0.73$ $± 0.004$ | $0.85$ $± 0.004$ | $0.74$ $± 0.007$ | $0.75$ $± 0.005$ | $0.88$ $± 0.005$ | $0.79$ $± 0.020$ |
| End of question | $0.76$ $± 0.008$ | $0.82$ $± 0.011$ | $0.72$ $± 0.007$ | $0.77$ $± 0.007$ | $0.80$ $± 0.018$ | $0.72$ $± 0.023$ |
| Exact | 0.85 $± 0.004$ | 0.92 $± 0.005$ | 0.92 $± 0.008$ | 0.83 $± 0.002$ | 0.93 $± 0.004$ | 0.95 $± 0.027$ |
Next, we evaluate various error detection methods by comparing their performance with and without the use of exact answer tokens. Table 1 compares the AUC across three representative datasets (additional datasets and models in Appendix B, showing consistent patterns). Here we present results for the last exact answer token, which outperformed both the first exact answer token and the one preceding it, while the token following the last performed similarly. Incorporating the exact answer token improves the different error detection methods in almost all datasets. Notably, our probing technique (bottom line) consistently outperforms all other baselines across the board. While we did not compare all existing error detection methods, the primary conclusion is that information about truthfulness is highly localized in specific generated tokens, and that focusing on exact answer tokens leads to significant improvements in error detection.
4 Generalization Between Tasks
The effectiveness of a probing classifier in detecting errors suggests that LLMs encode information about the truthfulness of their outputs. This supports using probing classifiers for error detection in production, but their generalizability across tasks remains unclear. While some studies argue for a universal mechanism of truthfulness encoding in LLMs (Marks & Tegmark, 2023; Slobodkin et al., 2023), results on probe generalization across datasets are mixed (Kadavath et al., 2022; Marks & Tegmark, 2023; CH-Wang et al., 2023; Slobodkin et al., 2023; Levinstein & Herrmann, 2024) âobserving a decline in performance, yet it remains significantly above random chance. Understanding this is essential for real-world applications, where the error detector may encounter examples that significantly differ from those it was trained on. Therefore, we explore whether a probe trained on one dataset can detect errors in others.
Our generalization experiments are conducted between all of the ten datasets discussed in Section 3, covering a broader range of reaslistic task settings than previous work. This breadth of experiments has not been previously explored, and is crucial considering the mixed findings in previous work. We select the optimal token and layer combination for each dataset, train all probes using this combination on other datasets, and then test them on the original dataset. We evaluate generalization performance using the absolute AUC score, defined as $\max(\text{auc},1-\text{auc})$ , to also account for cases where the learned signal in one dataset is reversed in another.
Results.
<details>
<summary>extracted/6450693/figures/generalization/mistral_instruct.png Details</summary>

### Visual Description
\n
## Heatmap: Cross-Dataset Transfer Performance
### Overview
This image presents a heatmap visualizing the transfer performance between different datasets used for training and testing. The heatmap displays the correlation or similarity in performance when a model trained on one dataset (rows) is evaluated on another dataset (columns). The color intensity represents the performance metric, with warmer colors (reds) indicating higher performance and cooler colors (blues) indicating lower performance.
### Components/Axes
* **X-axis:** "Test dataset" - Lists the datasets used for testing: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Y-axis:** "Train dataset" - Lists the datasets used for training: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Color Scale (Legend):** Located on the right side of the heatmap. Ranges from 0.0 (blue) to 1.0 (red), representing the performance metric. The scale is linear.
* **Cells:** Each cell represents the performance metric when a model trained on the corresponding row dataset is tested on the corresponding column dataset.
### Detailed Analysis
The heatmap contains 10x10 cells, each with a numerical value representing the performance metric. Here's a breakdown of the values, reading row by row:
* **TriviaQA:**
* TriviaQA - 0.86
* HotpotQA - 0.72
* Movies - 0.78
* Winobias - 0.55
* Winogrande - 0.57
* NLI - 0.63
* IMDB - 0.68
* Math - 0.80
* HotpotQA\_WC - 0.59
* NQ\_WC - 0.70
* **HotpotQA:**
* TriviaQA - 0.80
* HotpotQA - 0.85
* Movies - 0.78
* Winobias - 0.61
* Winogrande - 0.58
* NLI - 0.58
* IMDB - 0.87
* Math - 0.67
* HotpotQA\_WC - 0.62
* NQ\_WC - 0.65
* **Movies:**
* TriviaQA - 0.74
* HotpotQA - 0.69
* Movies - 0.82
* Winobias - 0.50
* Winogrande - 0.52
* NLI - 0.53
* IMDB - 0.81
* Math - 0.67
* HotpotQA\_WC - 0.57
* NQ\_WC - 0.71
* **Winobias:**
* TriviaQA - 0.54
* HotpotQA - 0.59
* Movies - 0.52
* Winobias - 0.92
* Winogrande - 0.73
* NLI - 0.64
* IMDB - 0.91
* Math - 0.52
* HotpotQA\_WC - 0.51
* NQ\_WC - 0.62
* **Winogrande:**
* TriviaQA - 0.60
* HotpotQA - 0.60
* Movies - 0.57
* Winobias - 0.61
* Winogrande - 0.84
* NLI - 0.66
* IMDB - 0.71
* Math - 0.61
* HotpotQA\_WC - 0.51
* NQ\_WC - 0.55
* **NLI:**
* TriviaQA - 0.51
* HotpotQA - 0.56
* Movies - 0.55
* Winobias - 0.56
* Winogrande - 0.66
* NLI - 0.93
* IMDB - 0.66
* Math - 0.64
* HotpotQA\_WC - 0.51
* NQ\_WC - 0.54
* **IMDB:**
* TriviaQA - 0.63
* HotpotQA - 0.54
* Movies - 0.66
* Winobias - 0.62
* Winogrande - 0.62
* NLI - 0.66
* IMDB - 0.97
* Math - 0.66
* HotpotQA\_WC - 0.51
* NQ\_WC - 0.58
* **Math:**
* TriviaQA - 0.56
* HotpotQA - 0.55
* Movies - 0.60
* Winobias - 0.57
* Winogrande - 0.51
* NLI - 0.64
* IMDB - 0.91
* Math - 0.92
* HotpotQA\_WC - 0.54
* NQ\_WC - 0.51
* **HotpotQA\_WC:**
* TriviaQA - 0.65
* HotpotQA - 0.73
* Movies - 0.56
* Winobias - 0.55
* Winogrande - 0.50
* NLI - 0.51
* IMDB - 0.92
* Math - 0.70
* HotpotQA\_WC - 0.75
* NQ\_WC - 0.67
* **NQ\_WC:**
* TriviaQA - 0.69
* HotpotQA - 0.66
* Movies - 0.68
* Winobias - 0.54
* Winogrande - 0.67
* NLI - 0.58
* IMDB - 0.94
* Math - 0.52
* HotpotQA\_WC - 0.53
* NQ\_WC - 0.87
### Key Observations
* **High Self-Transfer:** The diagonal cells (e.g., TriviaQA trained on TriviaQA tested) consistently show the highest values, close to 1.0, indicating that models perform best when tested on the same dataset they were trained on.
* **IMDB and Math Transfer:** IMDB and Math show strong transfer performance to other datasets, particularly Winobias, and each other.
* **Winobias Performance:** Winobias shows relatively low transfer performance *from* most other datasets, but high performance when trained and tested on itself.
* **Winogrande Performance:** Winogrande shows moderate transfer performance, generally better than Winobias but lower than TriviaQA or HotpotQA.
* **WC Datasets:** The "WC" (Weakly Supervised Contrastive) datasets (HotpotQA\_WC and NQ\_WC) show generally lower transfer performance compared to their non-WC counterparts.
### Interpretation
This heatmap demonstrates the challenges of cross-dataset transfer learning. While models can achieve high performance on the dataset they are trained on, their performance degrades when applied to different datasets. This suggests that the datasets have different characteristics, biases, or require different types of reasoning. The strong performance of IMDB and Math in transferring to other datasets might indicate that these datasets capture more generalizable features or reasoning skills. The lower performance of the WC datasets suggests that the weakly supervised contrastive learning approach may not lead to as robust or transferable representations. The heatmap provides valuable insights for selecting appropriate training datasets and developing more robust transfer learning techniques. The differences in performance highlight the importance of domain adaptation and the need for models that can generalize across diverse datasets.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/mistral_instruct_reduced.png Details</summary>

### Visual Description
\n
## Heatmap: Dataset Correlation Matrix
### Overview
The image presents a heatmap visualizing the correlation between different datasets used for training and testing. The heatmap displays correlation coefficients, ranging from negative to positive values, indicating the degree of relationship between each pair of datasets. The color intensity corresponds to the strength of the correlation, with red indicating positive correlation and blue indicating negative correlation.
### Components/Axes
* **X-axis:** Represents the "Test dataset" with the following categories: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC.
* **Y-axis:** Represents the "Train dataset" with the following categories: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC.
* **Color Scale (Legend):** Located on the right side of the heatmap, ranging from -0.2 (dark blue) to 0.3 (dark red). The scale is linear, with intermediate colors representing values in between.
* **Cell Values:** Each cell in the heatmap displays a numerical value representing the correlation coefficient between the corresponding train and test datasets.
### Detailed Analysis
The heatmap contains 10x10 cells, each representing the correlation between a train dataset and a test dataset. Here's a breakdown of the values, reading row by row (Train dataset vs. Test datasets):
* **TriviaQA:** -0.11, -0.05, 0.04, -0.04, -0.04, 0.01, -0.19, 0.10, -0.08, 0.02
* **HotpotQA:** 0.05, 0.04, 0.02, -0.03, -0.03, -0.04, -0.05, -0.04, 0.02
* **Movies:** -0.01, -0.08, 0.08, -0.08, -0.09, -0.08, -0.06, -0.03, -0.10, 0.02
* **Winobias:** -0.21, -0.18, -0.22, 0.33, 0.12, -0.19, -0.16, -0.19, -0.16
* **Winogrande:** -0.15, -0.17, -0.17, 0.02, 0.23, 0.04, -0.16, -0.10, -0.13
* **NLI:** -0.24, -0.21, -0.19, -0.03, 0.05, 0.32, -0.21, -0.07, -0.16, -0.15
* **IMDB:** -0.12, -0.23, -0.08, 0.04, 0.01, 0.04, 0.10, -0.04, -0.16, -0.10
* **Math:** -0.19, -0.22, -0.14, -0.02, -0.10, 0.02, 0.22, -0.13, -0.18
* **HotpotQA_WC:** -0.10, -0.03, -0.19, -0.04, -0.11, 0.05, -0.05, -0.19, -0.02
* **NQ_WC:** -0.07, -0.11, -0.07, -0.04, 0.06, -0.03, 0.07, -0.19, -0.14
**Trends:**
* **Winobias** shows a strong positive correlation with itself (0.33), and a moderate positive correlation with **Winogrande** (0.12).
* **NLI** exhibits a strong positive correlation with itself (0.32) and a moderate negative correlation with **TriviaQA** (-0.24) and **Winobias** (-0.19).
* **HotpotQA** and **HotpotQA_WC** show a weak positive correlation with each other (0.02).
* Generally, the correlations are relatively weak, with most values falling between -0.2 and 0.2.
### Key Observations
* The diagonal elements (correlations of a dataset with itself) are all close to 1 (not explicitly shown, but implied), as expected.
* There are several negative correlations, particularly involving TriviaQA, NLI, and Winobias.
* The correlations between datasets are generally low, suggesting that the datasets are relatively independent.
* The highest positive correlation is between Winobias and itself (0.33).
* The most negative correlation is between TriviaQA and NLI (-0.24).
### Interpretation
This heatmap provides insights into the relationships between different datasets commonly used in natural language processing tasks. The relatively low correlations suggest that these datasets capture different aspects of language understanding and reasoning. The negative correlations indicate that performance on one dataset might not necessarily translate to performance on another, highlighting the importance of evaluating models on a diverse set of datasets.
The strong self-correlations are expected and serve as a validation check. The variations in correlation coefficients can inform dataset selection for training and evaluation. For example, if a model performs well on Winobias, it might not necessarily perform well on NLI, and vice versa. The "WC" datasets (HotpotQA_WC and NQ_WC) appear to have similar correlation patterns to their non-"WC" counterparts, suggesting they capture similar information.
The heatmap suggests that these datasets are not redundant and each contributes unique information to model training and evaluation. This is valuable information for researchers and practitioners aiming to build robust and generalizable NLP models.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 3: Generalization between datasets, Mistral-7b-instruct. After subtracting the logit-based methodâs performance, we observe that most datasets show limited or no meaningful generalization.
Figure 3(a) shows the generalization results for Mistral-7b-instruct, with similar patterns observed for other LLMs in Appendix C. In this context, values above $0.5$ indicate successful generalization. At first glance, the results appear consistent with previous research: most heatmap values exceed $0.5$ , implying some degree of generalization across tasks. This observation supports the existence of a universal mechanism for decoding truthfulness, since the same linear directionsâcaptured by the probeâencode truthfulness information across many datasets. However, upon closer inspection, it turns out that most of this performance can be achieved by logit-based truthfulness detection, which only observes the output logits. Figure 3(b) presents the same heatmap after subtracting results from our strongest logit-based baseline (Logit-min-exact). This adjusted heatmap reveals the probeâs generalization rarely exceeds what can be achieved by examining logits alone. This suggests that the observed generalization is not due to a universal internal encoding of truthfulness. Instead, it likely arises from information already available through external features, such as logits. Past evidence for generalization may therefore have been overstated.
Nonetheless, we do observe some successful generalization in tasks requiring similar skills, such as parametric factual retrieval (TriviaQA, HotpotQA, Movies) and common-sense reasoning (Winobias, Wingrande, NLI). This suggests that, although the overall pattern of truthfulness signals across tokens appeared consistent across tasks (as observed in Section 3.3), LLMs have many âskill-specificâ truthfulness mechanisms rather than universal ones. However, some patterns remain unexplained, such as the asymmetric generalization from TriviaQA to Math tasks. Overall, our findings indicate that models have a multifaceted representation of truthfulness. The internal mechanisms responsible for solving distinct problem are implemented as different mechanisms (e.g., circuits) within models (Elhage et al., 2021; Olah et al., 2023). Similarly, LLMs do not encode truthfulness through a single unified mechanism but rather through multiple mechanisms, each corresponding to different notions of truth. Further investigation is required to disentangle these mechanisms.
5 Investigating Error Types
Having established the limitations of error detection, we now shift to error analysis. Previously, we explored types of LLM limitations across different tasks, noting both commonalities and distinctions in their error representations. In this section, we focus on the types of errors LLMs make in a specific taskâTriviaQAâwhich represents factual errors, a commonly studied issue in LLMs (Kadavath et al., 2022; Snyder et al., 2023; Li et al., 2024; Chen et al., 2024; Simhi et al., 2024).
5.1 Taxonomy of Errors
Intuitively, not all mistakes are identical. In one case, an LLM may consistently generate an incorrect answer, considering it correct, while in another case, it could issue a best guess. To analyze errors from the LLMâs perspective, we sample $K=30$ responses at a temperature setting of $T=1$ We chose $K=30$ as the overall correctness seemed to plateau around this point; see Appendix D. We found that lower temperatures generally produced less truthful answers across repeated trials. for each example in the dataset and then analyze the resulting distribution of answers.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Model Output Analysis
### Overview
The image presents a diagram illustrating the output of a "Model" concerning statements about Otis Barton. The diagram shows three statements about Otis Barton, each connected to the "Model" block via arrows labeled with percentage values. Each statement is enclosed in a dashed-line box, with a checkmark or cross symbol indicating the model's assessment of the statement's accuracy.
### Components/Axes
- **Model:** A central rectangular block labeled "Model" in blue.
- **Statements:** Three dashed-line boxes containing text statements about Otis Barton.
- **Arrows:** Arrows connecting the "Model" block to each statement, labeled with percentage values: 93%, 3%, and 3%.
- **Accuracy Indicators:** A green checkmark (â) and two red crosses (â) within the statement boxes, indicating the model's assessment.
- **Statement 1:** "Otis Barton was a pioneer in exploring the underwater world..." - Associated with 93% and a green checkmark.
- **Statement 2:** "...best known for his excavations in the Maya region of Central America" - Associated with 3% and a red cross.
- **Statement 3:** "...Exploring the underground rivers to Tennessee..." - Associated with 3% and a red cross.
- **Initial Statement:** "Otis Barton was a pioneer in exploring where?" - This statement is on the left side of the diagram.
### Detailed Analysis or Content Details
The diagram shows the model's confidence level for each statement.
- **Statement 1:** The model assigns a 93% confidence level to the statement "Otis Barton was a pioneer in exploring the underwater world...", and marks it as correct (green checkmark).
- **Statement 2:** The model assigns a 3% confidence level to the statement "...best known for his excavations in the Maya region of Central America", and marks it as incorrect (red cross).
- **Statement 3:** The model assigns a 3% confidence level to the statement "...Exploring the underground rivers to Tennessee...", and marks it as incorrect (red cross).
- **Initial Statement:** The model is presented with the statement "Otis Barton was a pioneer in exploring where?".
### Key Observations
- The model is highly confident in the first statement regarding underwater exploration.
- The model strongly disagrees with the statements about Maya excavations and Tennessee rivers, assigning them very low confidence scores.
- The low confidence scores for the latter two statements suggest they are likely inaccurate or not well-supported by the data the model was trained on.
### Interpretation
This diagram illustrates a model's ability to assess the veracity of statements. The percentages likely represent the model's probability assessment of the statement being true, based on its training data. The checkmarks and crosses provide a clear visual indication of the model's judgment. The significant difference in confidence levels between the statements suggests the model has a strong understanding of Barton's work in underwater exploration but lacks information or finds conflicting information regarding his involvement in Maya archaeology or Tennessee river exploration. The initial statement "Otis Barton was a pioneer in exploring where?" suggests the model is being used to fill in missing information or validate existing knowledge. The model's output demonstrates its capability to distinguish between accurate and inaccurate information, albeit with varying degrees of confidence.
</details>
(a) The LLM mostly answers correctly, but sometimes hallucinates.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: State Border Analysis
### Overview
This diagram illustrates the output of a "Model" attempting to answer the question: "Which American state borders on only one other state?". The model provides two potential answers, Missouri and Maine, along with associated confidence levels. Each answer is presented within a dashed-line box, with a visual indicator (red 'X' or green checkmark) signifying whether the answer is correct or incorrect.
### Components/Axes
The diagram consists of the following components:
* **Question Box (Left):** A dashed-line box containing the question: "Which American state borders on only one other state?".
* **Model Box (Center):** A blue rectangular box labeled "Model", representing the processing unit.
* **Answer Boxes (Right):** Two dashed-line boxes, one for Missouri and one for Maine, each containing the model's response and a correctness indicator.
* **Confidence Levels:** Numerical percentages (87% and 13%) associated with each answer, indicating the model's confidence.
* **Correctness Indicators:** A red 'X' next to the Missouri answer and a green checkmark next to the Maine answer.
### Content Details
The diagram presents the following information:
* **Question:** "Which American state borders on only one other state?"
* **Model Output 1:**
* State: Missouri
* Confidence: 87%
* Correctness: Incorrect (indicated by the red 'X')
* Text: "Missouri is the... The only state to border... is Missouri..."
* **Model Output 2:**
* State: Maine
* Confidence: 13%
* Correctness: Correct (indicated by the green checkmark)
* Text: "Maine is the... The US state that... is Maine, which..."
### Key Observations
The model incorrectly identifies Missouri as the state bordering only one other state, assigning it a high confidence level of 87%. Conversely, it correctly identifies Maine but with a low confidence level of 13%. This suggests the model is biased towards Missouri, despite it being the incorrect answer.
### Interpretation
The diagram demonstrates a potential flaw in the model's reasoning or training data. While the model provides a confidence score, it appears to be unreliable in this case, as the higher confidence is associated with the incorrect answer. The model's output suggests it may be prioritizing factors other than the actual number of bordering states when making its prediction. The incomplete sentences within the answer boxes suggest the model is generating text based on the identified state, but the core logic of identifying the correct state is flawed. The visual indicators (red 'X' and green checkmark) provide a clear and immediate assessment of the model's performance on this specific question. The low confidence in the correct answer (Maine) is also noteworthy, indicating a need for further refinement of the model's training or algorithm.
</details>
(b) The LLM mostly answers incorrectly, but seems to have some knowledge on the correct answer.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Model Output for First Female Football Commentator
### Overview
This diagram illustrates the output of a "Model" attempting to identify the first female to deliver football commentary on "match of the day". The model generates three potential answers, each with an associated confidence score (expressed as a percentage) and a validation status (indicated by a checkmark or cross).
### Components/Axes
* **Input:** "Who became the first female to deliver football commentary on âmatch of the dayâ?" - Located on the left side of the diagram, enclosed in a dashed box.
* **Model:** A blue rectangular box labeled "Model" positioned centrally.
* **Outputs:** Three dashed boxes representing potential answers, each connected to the "Model" by an arrow.
* **Confidence Scores:** Percentages associated with each output (20%, 6%, 6%).
* **Validation Status:** A red circle with a cross (incorrect) or a green circle with a checkmark (correct) next to each output.
### Detailed Analysis or Content Details
The diagram presents three model outputs:
1. **Output 1:**
* **Text:** "... In 2007, Gabby Logan ..."
* **Confidence:** 20%
* **Validation:** Incorrect (Red Cross)
2. **Output 2:**
* **Text:** "The first ⊠is Clare Balding"
* **Confidence:** 6%
* **Validation:** Incorrect (Red Cross)
3. **Output 3:**
* **Text:** "Jackie Oatley is the first woman âŠ"
* **Confidence:** 6%
* **Validation:** Correct (Green Checkmark)
The arrows indicate the flow of information from the "Model" to each potential answer. The confidence scores suggest the model initially favored Gabby Logan, but ultimately identified Jackie Oatley as the correct answer, albeit with a lower confidence level.
### Key Observations
* The model initially assigned a higher confidence score to an incorrect answer (Gabby Logan).
* The correct answer (Jackie Oatley) had the lowest confidence score among the three outputs.
* The diagram visually emphasizes the validation status using clear color-coded symbols.
### Interpretation
This diagram demonstrates a machine learning model's attempt to answer a factual question. The model's initial higher confidence in an incorrect answer highlights the potential for bias or incomplete training data. The fact that the correct answer was identified, even with low confidence, suggests the model has some understanding of the topic. The diagram illustrates the importance of validation and the need to critically evaluate model outputs, even when they are presented with high confidence. The model's output suggests it has access to information about Gabby Logan and Clare Balding, but correctly identifies Jackie Oatley as the first female football commentator. This could be due to the model learning from a dataset where these names are frequently mentioned in relation to football commentary.
</details>
(c) The LLM generates many different answers, one of them is the correct one which is generated a small fraction of the resamples.
Figure 4: Different error types in free-form generation, exposed when resampled many times.
Figure 4 illustrates three representative error types. In one (Figure 4(a)), the model usually gives the correct answer but occasionally make an error, implying correct information is present but sampling may lead to mistakes. In another (Figure 4(b), the model often responds incorrectly, though it is capable of providing the right answer, indicating some retained knowledge despite consistently making the same error. In a third type (Figure 4(c)), the model generates a wide array of mostly incorrect answers, reflecting low confidence in any generated answer.
More generally, we categorize the errors by logging three specific features for each example: (a) the number of different answers generated; (b) the frequency of the correct answer; and (c) the frequency of the most common incorrect answer. These features reveal the following error patterns:
- (A) Refuses to answer: The model responds that it cannot answer the question in at least half the cases.
- (B) Consistently correct: Answers correctly in at least half of the cases. This category is divided into: (B1) always correct; and (B2) mostly correct with occasional errors.
- (C) Consistently incorrect: Consistently generates the same incorrect response in at least half of the cases. Similarly to type B, we subdivide this type into (C1) correct answer is never produced; and (C2) correct answer appears at least once.
- (D) Two competing: Generates both correct and incorrect responses at similar ratesâdifference in rates is 5 or less, and each response is generated at least 5 times.
- (E) Many answers: Generates over 10 distinct answers. Like types C and D, Subtypes include (E1) correct answer is never generated; and (E2) correct answer is generated at least once.
This taxonomy covers 96% of the errors in TriviaQA for Mistral-7b-instruct. For more qualitative examples of each type of error, see Appendix D.3. Although some overlap exists between types, our goal is to identify general patterns and explore their connection to the modelsâs internal representations. For a discussion on the design choices of this taxonomy, refer to Appendix D.1. This taxonomy classifies LLM errors based on an extrinsic, behavior-based analysis. Similarly, previous work analyzed repeated samples to assess an LLMâs knowledge of the correct answer (Simhi et al., 2024; Gekhman et al., 2024). Our approach is distinct because it also examines the nature of errors that the LLM makes. Furthermore, as we discuss next, we analyze the connection between these behavioral patterns and the modelâs internal encoding.
5.2 Predicting Error Types
Our taxonomy offers an external, behavioral analysis of LLMs, which we complement by an intrinsic evaluation. We explore whether LLMs encode information on potential error types within their intermediate activations, offering a deeper insight into the underlying mechanisms. To investigate this, we train a probe in a one-to-many setting, where a single probe identifies a specific error type from all others. We use representations extracted from the answers produced via greedy decoding.
Table 2 presents the results. Our findings show that error types can be predicted from the intermediate representations of the greedy decoding generations, suggesting that they may capture not only output correctness but also fine-grained information about potential errors. While detection performance varies between types, the predictability of each type is valuable on its own, as it opens the possibility of tailoring targeted interventions for specific error types. Additionally, although performance on error types C and D is lower, it remains well above random, providing meaningful insights. These results suggest that internal representations encode more than just binary correctness, revealing a nuanced taxonomy of error types and offering deeper insights into how these models process and encode knowledge.
Table 2: AUC scores for error type classification (TriviaQA). Error types are predictable from the inner model representations, indicating the encoding of fine-grained information on errors.
| Error type | Mistral-7b | Mistral-Instr-7b | Llama3-8b | Llama3-Instr-8b |
| --- | --- | --- | --- | --- |
| (A) Refuses to answer | $0.86\scriptscriptstyle{± 0.002}$ | $0.85\scriptscriptstyle{± 0.011}$ | $0.87\scriptscriptstyle{± 0.002}$ | $0.88\scriptscriptstyle{± 0.014}$ |
| (B) Consistently correct | $0.88\scriptscriptstyle{± 0.001}$ | $0.82\scriptscriptstyle{± 0.008}$ | $0.86\scriptscriptstyle{± 0.001}$ | $0.81\scriptscriptstyle{± 0.002}$ |
| (C) Consistently incorrect | $0.59\scriptscriptstyle{± 0.002}$ | $0.67\scriptscriptstyle{± 0.002}$ | $0.59\scriptscriptstyle{± 0.002}$ | $0.64\scriptscriptstyle{± 0.003}$ |
| (D) Two competing | $0.63\scriptscriptstyle{± 0.002}$ | $0.68\scriptscriptstyle{± 0.006}$ | $0.61\scriptscriptstyle{± 0.001}$ | $0.65\scriptscriptstyle{± 0.004}$ |
| (E) Many answers | $0.90\scriptscriptstyle{± 0.001}$ | $0.84\scriptscriptstyle{± 0.003}$ | $0.89\scriptscriptstyle{± 0.001}$ | $0.89\scriptscriptstyle{± 0.001}$ |
6 Detecting the Correct Answer
After identifying that models encode diverse truthfulness-related information, we examine how this internal truthfulness aligns with their external behavior during response generation. To this end, we use our probe, We choose the best-performing probe for each task, which is trained on the last exact answer token. trained on error detection, to select an answer from a pool of 30 generated responses to the same question. We then measure the modelâs accuracy based on the selected answers. A case where this accuracy does not significantly differ from traditional decoding methods (such as greedy decoding), suggests that the LLMâs internal representation of truthfulness is consistent with its external behavior. In simpler terms, that the model is generating answers that it also internally considers as correct. Conversely, a case where using the probe alters performance either way, would suggest a misalignment between the LLMâs internal representations and its actual behavior.
Experimental Setup
The experiments were conducted on TriviaQA, Winobias, and Math. We resample each model answer in the same strategy described in Section 5.1. The final chosen answer is the one with the highest correctness probability, as assessed by the probe. We compare to three baselines: (1) greedy decoding, (2) random selection from the $K=30$ answer candidates; and (3) majority vote wherein the most frequently generated answer is chosen.
Results
The results for Mistral-7b-instruct are summarized in Figure 5, with additional results for other LLMs and datasets as well as qualitative examples provided in Appendix E. We only present results on error types that appear 30 times or more in our test dataset. Overall, using the probe to select answers enhances the LLMs accuracy across all examined tasks. However, the extent of improvement varies by error type. For instance, in the TriviaQA dataset, there is minimal gain in the âmostly correctâ category (B2). In contrast, substantial gainsâranging from 30 to 40 points in some casesâare observed in the âmostly incorrectâ (C2), âtwo competing answersâ (D), and âmany answersâ (E1) categories. Interestingly, and perhaps surprisingly, the probe is most effective in cases where the LLM lacks any (external) preference for the correct answer during generation. The fact that the probe can effectively identify the correct answer in these scenarios, points at a significant disconnect between the LLMâs internal encoding and its external behavior. These results suggest that even when the model encodes information of which answer is correct, it can still generate an incorrect answer in practice.
While using the probe to select the answer proves effective, it is not proposed here as an error mitigation strategy but rather as a diagnostic tool. However, these findings indicate that further research in this area could leverage the existing knowledge within LLMs to significantly reduce errors. We recommend exploring this direction in future investigations.
<details>
<summary>extracted/6450693/figures/choose_answer/probe_choose_answer_triviaqa_mistral_instruct.png Details</summary>

### Visual Description
## Bar Chart: Model Response Analysis
### Overview
This bar chart compares the performance of four different models ("greedy", "random", "majority", and "probing") across several categories of response behavior. The categories represent different types of responses the models can give, ranging from refusing to answer to providing consistently correct or incorrect answers, and scenarios with competing or multiple answers. The y-axis represents a percentage, presumably indicating the frequency or proportion of responses falling into each category.
### Components/Axes
* **X-axis Title:** Response Type
* **Y-axis Title:** Percentage (%)
* **X-axis Categories:** "All", "Refuses to answer", "Consistently correct (All)", "Consistently correct (Most)", "Consistently incorrect (All)", "Consistently incorrect (Most)", "Two competing", "Many answers (Non correct)", "Many answers (Correct appears)"
* **Legend:** Located at the top-right of the chart.
* "greedy" (Green)
* "random" (Brown)
* "majority" (Gray)
* "probing" (Teal)
### Detailed Analysis
The chart consists of nine groups of bars, one for each response type category. Within each group, there are four bars, one for each model.
* **All:**
* greedy: Approximately 63%
* random: Approximately 64%
* majority: Approximately 71%
* probing: Approximately 67%
* **Refuses to answer:**
* greedy: Approximately 6%
* random: Approximately 6%
* majority: Approximately 0%
* probing: Approximately 28%
* **Consistently correct (All):**
* greedy: 100%
* random: 100%
* majority: 100%
* probing: 100%
* **Consistently correct (Most):**
* greedy: Approximately 88%
* random: Approximately 83%
* majority: Approximately 89%
* probing: Approximately 99%
* **Consistently incorrect (All):**
* greedy: 0%
* random: 0%
* majority: 0%
* probing: 0%
* **Consistently incorrect (Most):**
* greedy: Approximately 11%
* random: Approximately 15%
* majority: Approximately 0%
* probing: Approximately 53%
* **Two competing:**
* greedy: Approximately 32%
* random: Approximately 45%
* majority: Approximately 50%
* probing: Approximately 78%
* **Many answers (Non correct):**
* greedy: Approximately 1%
* random: Approximately 0%
* majority: Approximately 0%
* probing: Approximately 0%
* **Many answers (Correct appears):**
* greedy: Approximately 19%
* random: Approximately 23%
* majority: Approximately 38%
* probing: Approximately 56%
### Key Observations
* All models achieve 100% on "Consistently correct (All)".
* The "probing" model consistently exhibits the highest percentage in categories like "Refuses to answer", "Consistently incorrect (Most)", and "Two competing".
* The "majority" model performs well in "Consistently correct (Most)" and "Many answers (Correct appears)".
* The "greedy" and "random" models show relatively similar performance across most categories.
* The "majority" model has 0% in "Consistently incorrect (All)" and "Consistently incorrect (Most)".
### Interpretation
The data suggests that the "probing" model is more cautious and avoids providing answers when uncertain, leading to a higher rate of refusing to answer and a higher rate of incorrect responses when it does attempt an answer. The "majority" model appears to be more confident and provides answers more frequently, with a strong tendency towards correctness when it is consistent. The "greedy" and "random" models represent a middle ground, with more balanced performance across the categories. The chart highlights the trade-offs between different modeling approaches in terms of accuracy, confidence, and the handling of ambiguous or complex questions. The distinction between "All" and "Most" in the "Consistently correct/incorrect" categories suggests that the models can sometimes be correct or incorrect on a subset of instances, even if they generally exhibit consistent behavior. The "Two competing" and "Many answers" categories indicate scenarios where the models struggle to converge on a single, definitive answer.
</details>
(a) TriviaQA
<details>
<summary>extracted/6450693/figures/choose_answer/probe_choose_answer_math_mistral_instruct.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy Assessment
### Overview
The image presents a bar chart comparing accuracy levels across different categories: "All", "Consistently correct (All)", "Consistently correct (Most)", "Consistently incorrect (All)", and "Consistently incorrect (Most)". Each category has three bars representing different data series. The Y-axis represents a percentage scale from 0 to 100.
### Components/Axes
* **X-axis:** Categories: "All", "Consistently correct (All)", "Consistently correct (Most)", "Consistently incorrect (All)", "Consistently incorrect (Most)".
* **Y-axis:** Percentage scale, ranging from 0 to 100, with increments of 10.
* **Data Series:** Three distinct data series, represented by blue, green, and red bars.
* **No Legend:** The chart lacks a legend explicitly identifying each data series.
### Detailed Analysis
The chart consists of five groups of three bars each.
* **"All" Category:**
* Blue bar: Approximately 55.
* Green bar: Approximately 52.
* Red bar: Approximately 70.
* **"Consistently correct (All)" Category:**
* Blue bar: 100.
* Green bar: 100.
* Red bar: 100.
* **"Consistently correct (Most)" Category:**
* Blue bar: Approximately 87.
* Green bar: Approximately 84.
* Red bar: Approximately 96.
* **"Consistently incorrect (All)" Category:**
* Blue bar: 0.
* Green bar: 0.
* Red bar: Approximately 5.
* **"Consistently incorrect (Most)" Category:**
* Blue bar: 0.
* Green bar: Approximately 10.
* Red bar: Approximately 82.
### Key Observations
* The "Consistently correct (All)" category shows 100% accuracy across all three data series.
* The "Consistently incorrect (All)" category shows very low accuracy, with the blue and green series at 0%.
* The "All" category shows the lowest overall accuracy, with values ranging from 52 to 70.
* The red data series generally shows higher values than the blue and green series, except in the "Consistently incorrect (All)" category.
* There is a significant difference in accuracy between the "Consistently correct" and "Consistently incorrect" categories.
### Interpretation
The data suggests an assessment of accuracy across different levels of consistency. The "All" category represents a general accuracy measure, while the "Consistently correct/incorrect" categories focus on cases where the assessment is consistently the same. The chart demonstrates that when assessments are consistently correct, accuracy is 100%. However, when assessments are consistently incorrect, accuracy is very low. The "All" category shows a moderate level of accuracy, indicating some variability in the assessments. The red data series might represent a different metric or a more stringent evaluation criteria, as it consistently shows higher values than the other two series. The chart highlights the importance of consistency in assessments for achieving high accuracy. The large difference between the "Consistently correct" and "Consistently incorrect" categories suggests that errors are not random but rather systematic in nature.
</details>
(b) Math
Figure 5: Different answer choice strategies, Mistral-7B-Instruct. A notable improvement in accuracy by using the error-detection probe is observed for error types where the LLM shows no preference for the correct answer across repeated generations.
7 Discussion and Conclusions
In this study, we analyzed LLM errors through their internal representations. Our approach depends on access to internal representations, restricting its use to open-source models. We focus on QA tasks with clear gold labels, which are key for benchmarking truthfulness detection and valued by the community. To ensure robustness, we tested 10 datasets across 4 model architectures. Open-ended tasks are left for future research, with our work laying the groundwork for broader applications. For instance, we found that truthfulness-related information is localized in specific tokens within long-responses, enabling practical improvements in error detection for production models. This insight could extend to tasks like summarization, by probing the most meaningful entities in an answer.
Truthfulness features showed poor generalization across tasks and datasets, highlighting the need for caution when applying trained error detectors in varied settings. Some unexplained patterns suggest hidden links between unrelated tasks that warrant further research. Improving generalization could involve exploring the effects of layer-token combinations and training on diverse datasets, as demonstrated by BĂŒrger et al. (2024). Deciphering task-specific truthfulness features and their overlaps across tasks might also enhance classifier design. Still, task-specific probes could be highly valuable in critical fields like medicine and law, where reliability matters. These probes can detect errors, predict error types, and guide response selection from resampled outputs, offering significant practical benefits. Guidelines for applying these probes are provided in Appendix F.
Finally, we identified a significant discrepancy between the modelâs external behavior and internal states, where it repeatedly outputs incorrect responses despite internally encoding the correct answer. It is possible that mechanisms favoring likelihood override those promoting truthfulness, as LLMs are trained to predicting likely tokens, which does not necessarily align with factual accuracy. Our findings imply that these models already encode valuable information that could possibly be harnessed to reduce errors. Work by Chuang et al. (2024) shows promising results in this area, while a subsequent work by Gekhman et al. (2025) focused exclusively on this âhidden knowledgeâ phenomenon, formally defining it and studying its extent. In conclusion, our findings suggest that LLMsâ internal representations provide useful insights into their errors, highlights the complex link between the internal processes of models and their external outputs, and hopefully paves the way for further improvements in error detection and mitigation.
8 Reproducibility Statement
To ensure reproducibility of our work, we provide detailed instructions and necessary code. The source code, including scripts for generating model answers, probing, resampling, and error type analysis, is available in the supplementary material, where we also provide command examples and specific seeds used for experiment reproducibility. This repository includes documentation on how to set up the environment, download and preprocess datasets, and execute the experiments outlined in Sections 3â6 of the paper. Additionally, all datasets, models, and results generation steps are described in the Appendix A.
Acknowledgments
This research was supported by the Israel Science Foundation (grant No. 448/20), an Azrieli Foundation Early Career Faculty Fellowship, an AI Alignment grant from Open Philanthropy, and a Google gift. HO is supported by the Apple AIML PhD fellowship. This research was funded by the European Union (ERC, Control-LM, 101165402). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
References
- Allauzen (2007) Alexandre Allauzen. Error detection in confusion network. In 8th Annual Conference of the International Speech Communication Association, INTERSPEECH 2007, Antwerp, Belgium, August 27-31, 2007, pp. 1749â1752. ISCA, 2007. doi: 10.21437/INTERSPEECH.2007-490. URL https://doi.org/10.21437/Interspeech.2007-490.
- Azaria & Mitchell (2023) Amos Azaria and Tom Mitchell. The internal state of an llm knows when itâs lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 967â976, 2023.
- Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Belinkov (2021) Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances, 2021. URL https://arxiv.org/abs/2102.12452.
- Bell et al. (2019) Samuel J. Bell, Helen Yannakoudakis, and Marek Rei. Context is key: Grammatical error detection with contextual word representations. In Helen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, IldikĂł PilĂĄn, and Torsten Zesch (eds.), Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2019, Florence, Italy, August 2, 2019, pp. 103â115. Association for Computational Linguistics, 2019. doi: 10.18653/V1/W19-4410. URL https://doi.org/10.18653/v1/w19-4410.
- Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Brunner et al. (2020) Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. On identifiability in transformers. In 8th International Conference on Learning Representations (ICLR 2020)(virtual). International Conference on Learning Representations, 2020.
- BĂŒrger et al. (2024) Lennart BĂŒrger, Fred A Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms. arXiv preprint arXiv:2407.12831, 2024.
- Burns et al. (2022) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Caines et al. (2020) Andrew Caines, Christian Bentz, Kate M. Knill, Marek Rei, and Paula Buttery. Grammatical error detection in transcriptions of spoken english. In Donia Scott, NĂșria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp. 2144â2162. International Committee on Computational Linguistics, 2020. doi: 10.18653/V1/2020.COLING-MAIN.195. URL https://doi.org/10.18653/v1/2020.coling-main.195.
- CH-Wang et al. (2023) Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. Do androids know theyâre only dreaming of electric sheep?, 2023.
- Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMsâ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Zj12nzlQbz.
- Chen et al. (2013) Wei Chen, Sankaranarayanan Ananthakrishnan, Rohit Kumar, Rohit Prasad, and Prem Natarajan. ASR error detection in a conversational spoken language translation system. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pp. 7418â7422. IEEE, 2013. doi: 10.1109/ICASSP.2013.6639104. URL https://doi.org/10.1109/ICASSP.2013.6639104.
- Cheng & Duan (2020) Yong Cheng and Mofan Duan. Chinese grammatical error detection based on BERT model. In Erhong YANG, Endong XUN, Baolin ZHANG, and Gaoqi RAO (eds.), Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 108â113, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlptea-1.15.
- Chuang et al. (2024) Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Th6NyL07na.
- Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021.
- Errattahi et al. (2015) Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. Automatic speech recognition errors detection and correction: A review. In Mourad Abbas and Ahmed Abdelali (eds.), 1st International Conference on Natural Language and Speech Processing, ICNLSP 2015, Algiers, Algeria, October 18-19, 2015, volume 128 of Procedia Computer Science, pp. 32â37. Elsevier, 2015. doi: 10.1016/J.PROCS.2018.03.005. URL https://doi.org/10.1016/j.procs.2018.03.005.
- Flickinger et al. (2016) Dan Flickinger, Michael Wayne Goodman, and Woodley Packard. Uw-stanford system description for AESW 2016 shared task on grammatical error detection. In Joel R. Tetreault, Jill Burstein, Claudia Leacock, and Helen Yannakoudakis (eds.), Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@NAACL-HLT 2016, June 16, 2016, San Diego, California, USA, pp. 105â111. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/W16-0511. URL https://doi.org/10.18653/v1/w16-0511.
- Gao et al. (2023) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16477â16508, 2023.
- Gekhman et al. (2020) Zorik Gekhman, Roee Aharoni, Genady Beryozkin, Markus Freitag, and Wolfgang Macherey. KoBE: Knowledge-based machine translation evaluation. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3200â3207, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.287. URL https://aclanthology.org/2020.findings-emnlp.287.
- Gekhman et al. (2022) Zorik Gekhman, Dina Zverinski, Jonathan Mallinson, and Genady Beryozkin. RED-ACE: Robust error detection for ASR using confidence embeddings. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2800â2808, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.180. URL https://aclanthology.org/2022.emnlp-main.180.
- Gekhman et al. (2023) Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2053â2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
- Gekhman et al. (2024) Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations?, 2024.
- Gekhman et al. (2025) Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpector, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in llms. arXiv preprint arXiv:2503.15299, 2025.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.
- Gottesman & Geva (2024) Daniela Gottesman and Mor Geva. Estimating knowledge in large language models without generating a single token. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, Florida, 2024. Association for Computational Linguistics.
- Guerreiro et al. (2023) Nuno M Guerreiro, Elena Voita, and AndrĂ© FT Martins. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1059â1075, 2023.
- Harnad (2024) Stevan Harnad. Language writ large: Llms, chatgpt, grounding, meaning and understanding. arXiv preprint arXiv:2402.02243, 2024.
- Honovich et al. (2021) Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. $q^{2}$ : Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7856â7870, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
- Honovich et al. (2022) Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. TRUE: Re-evaluating factual consistency evaluation. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3905â3920, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.287. URL https://aclanthology.org/2022.naacl-main.287.
- Huang et al. (2023a) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023a.
- Huang et al. (2023b) Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023b.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1â38, 2023.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601â1611, 2017.
- Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Kasewa et al. (2018) Sudhanshu Kasewa, Pontus Stenetorp, and Sebastian Riedel. Wronging a right: Generating better errors to improve grammatical error detection. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Junâichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 4977â4983. Association for Computational Linguistics, 2018. URL https://aclanthology.org/D18-1541/.
- Kotek et al. (2023) Hadas Kotek, Rikker Dockum, and David Sun. Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference, pp. 12â24, 2023.
- Kryscinski et al. (2020) Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9332â9346, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.750. URL https://aclanthology.org/2020.emnlp-main.750.
- Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VD-AYtP0dve.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
- Laban et al. (2022) Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163â177, 2022. doi: 10.1162/taclËaË00453. URL https://aclanthology.org/2022.tacl-1.10.
- Levinstein & Herrmann (2024) Benjamin A Levinstein and Daniel A Herrmann. Still no lie detector for language models: Probing empirical and conceptual roadblocks. Philosophical Studies, pp. 1â27, 2024.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich KĂŒttler, Mike Lewis, Wen-tau Yih, Tim RocktĂ€schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459â9474, 2020.
- Li et al. (2024) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
- Li & Wang (2024) Wei Li and Houfeng Wang. Detection-correction structure via general language model for grammatical error correction. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pp. 1748â1763. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.acl-long.96.
- Liang et al. (2024) Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, et al. Internal consistency and self-feedback in large language models: A survey. arXiv preprint arXiv:2407.14507, 2024.
- Lin et al. (2021) Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Liu et al. (2023) Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4791â4797, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.291. URL https://aclanthology.org/2023.emnlp-main.291.
- Liu et al. (2022) Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. A token-level reference-free hallucination detection benchmark for free-form text generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6723â6737, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.464. URL https://aclanthology.org/2022.acl-long.464.
- Lo (2019) Chi-kiu Lo. YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In OndĆej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, AndrĂ© Martins, Christof Monz, Matteo Negri, AurĂ©lie NĂ©vĂ©ol, Mariana Neves, Matt Post, Marco Turchi, and Karin Verspoor (eds.), Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 507â513, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5358. URL https://aclanthology.org/W19-5358.
- Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142â150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9004â9017, 2023.
- Marks & Tegmark (2023) Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
- McGowan et al. (2023) Alessia McGowan, Yunlai Gui, Matthew Dobbs, Sophia Shuster, Matthew Cotter, Alexandria Selloni, Marianne Goodman, Agrima Srivastava, Guillermo A Cecchi, and Cheryl M Corcoran. Chatgpt and bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Research, 326:115334, 2023.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262.
- Millidge (2023) Beren Millidge. LLMs confabulate not hallucinate. Berenâs Blog, March 2023. URL https://www.beren.io/2023-03-19-LLMs-confabulate-not-hallucinate/.
- Mishra & Kaur (2013) Ritika Mishra and Navjot Kaur. A survey of spelling error detection and correction techniques. International Journal of Computer Trends and Technology, 4(3):372â374, 2013.
- nostalgebraist (2020) nostalgebraist. Interpreting gpt: The logit lens. LessWrong blog post, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2024-11-18.
- Olah et al. (2023) Chris Olah, Nelson Elhage, Neel Nanda, Catherine Schubert, Daniel Filan, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825â2830, 2011.
- Pellegrini & Trancoso (2009) Thomas Pellegrini and Isabel Trancoso. Error detection in broadcast news ASR using markov chains. In Zygmunt Vetulani (ed.), Human Language Technology. Challenges for Computer Science and Linguistics - 4th Language and Technology Conference, LTC 2009, Poznan, Poland, November 6-8, 2009, Revised Selected Papers, volume 6562 of Lecture Notes in Computer Science, pp. 59â69. Springer, 2009. doi: 10.1007/978-3-642-20095-3âË6. URL https://doi.org/10.1007/978-3-642-20095-3_6.
- Pu et al. (2021) Amy Pu, Hyung Won Chung, Ankur Parikh, Sebastian Gehrmann, and Thibault Sellam. Learning compact metrics for MT. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 751â762, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.58. URL https://aclanthology.org/2021.emnlp-main.58.
- Rao et al. (2020) Gaoqi Rao, Erhong Yang, and Baolin Zhang. Overview of NLPTEA-2020 shared task for Chinese grammatical error diagnosis. In Erhong YANG, Endong XUN, Baolin ZHANG, and Gaoqi RAO (eds.), Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 25â35, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlptea-1.4.
- Rateike et al. (2023) Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, and Skyler Speakman. Weakly supervised detection of hallucinations in llm activations. arXiv preprint arXiv:2312.02798, 2023.
- Rawte et al. (2023) Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, SM Tonmoy, Aman Chadha, Amit P Sheth, and Amitava Das. The troubling emergence of hallucination in large language modelsâan extensive definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2310.04988, 2023.
- Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685â2702, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.213. URL https://aclanthology.org/2020.emnlp-main.213.
- Rei et al. (2022a) Ricardo Rei, JosĂ© G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and AndrĂ© F. T. Martins. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Philipp Koehn, LoĂŻc Barrault, OndĆej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussĂ , Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, AndrĂ© Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, AurĂ©lie NĂ©vĂ©ol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri (eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 578â585, Abu Dhabi, United Arab Emirates (Hybrid), December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.52.
- Rei et al. (2022b) Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, JosĂ© G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and AndrĂ© F. T. Martins. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Philipp Koehn, LoĂŻc Barrault, OndĆej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussĂ , Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, AndrĂ© Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, AurĂ©lie NĂ©vĂ©ol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri (eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 634â645, Abu Dhabi, United Arab Emirates (Hybrid), December 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.60.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99â106, 2021.
- Salles et al. (2020) Arleen Salles, Kathinka Evers, and Michele Farisco. Anthropomorphism in ai. AJOB neuroscience, 11(2):88â95, 2020.
- Scialom et al. (2021) Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. QuestEval: Summarization asks for fact-based evaluation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6594â6604, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.529. URL https://aclanthology.org/2021.emnlp-main.529.
- Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881â7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704.
- Serapio-GarcĂa et al. (2023) Greg Serapio-GarcĂa, Mustafa Safdari, ClĂ©ment Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja MatariÄ. Personality traits in large language models. arXiv preprint arXiv:2307.00184, 2023.
- Simhi et al. (2024) Adi Simhi, Jonathan Herzig, Idan Szpektor, and Yonatan Belinkov. Constructing benchmarks and interventions for combating hallucinations in llms, 2024.
- Slobodkin et al. (2023) Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, and Shauli Ravfogel. The curious case of hallucinatory (un)answerability: Finding truths in the hidden states of over-confident large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3607â3625, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.220. URL https://aclanthology.org/2023.emnlp-main.220.
- Snyder et al. (2023) Ben Snyder, Marius Moisescu, and Muhammad Bilal Zafar. On early detection of hallucinations in factual question answering, 2023. URL https://arxiv.org/abs/2312.14183.
- Sun et al. (2024) Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Hui Zhao. Benchmarking hallucination in large language models based on unanswerable math word problem. CoRR, 2024.
- Taubenfeld et al. (2025) Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. arXiv preprint arXiv:2502.06233, 2025.
- Tian et al. (2023a) Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023a.
- Tian et al. (2023b) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023b.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste RoziÚre, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023.
- Venkit et al. (2024) Pranav Narayanan Venkit, Tatiana Chakravorti, Vipul Gupta, Heidi Biggs, Mukund Srinath, Koustava Goswami, Sarah Rajtmajer, and Shomir Wilson. â confidently nonsensical?â: A critical survey on the perspectives and challenges ofâhallucinationsâ in nlp. arXiv preprint arXiv:2404.07461, 2024.
- Wang & Sennrich (2020) Chaojun Wang and Rico Sennrich. On exposure bias, hallucination and domain shift in neural machine translation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3544â3552, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.326. URL https://aclanthology.org/2020.acl-main.326.
- Wang & Tan (2020) Quanbin Wang and Ying Tan. Grammatical error detection with self attention by pairwise training. In 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020, pp. 1â7. IEEE, 2020. doi: 10.1109/IJCNN48605.2020.9206715. URL https://doi.org/10.1109/IJCNN48605.2020.9206715.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112â1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369â2380, 2018.
- Yin et al. (2024) Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang. Characterizing truthfulness in large language model generations with local intrinsic dimension. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024.
- Yona et al. (2024) Gal Yona, Roee Aharoni, and Mor Geva. Can large language models faithfully express their intrinsic uncertainty in words?, 2024. URL https://arxiv.org/abs/2405.16908.
- Yuksekgonul et al. (2023) Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, and Besmira Nushi. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, 2023.
- Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. ERNIE: Enhanced language representation with informative entities. In Anna Korhonen, David Traum, and LluĂs MĂ rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441â1451, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1139. URL https://aclanthology.org/P19-1139.
- Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
- Zhou et al. (2005) Lina Zhou, Yongmei Shi, Jinjuan Feng, and Andrew Sears. Data mining for detecting errors in dictation speech recognition. IEEE Trans. Speech Audio Process., 13(5-1):681â688, 2005. doi: 10.1109/TSA.2005.851874. URL https://doi.org/10.1109/TSA.2005.851874.
- Zou et al. (2023) Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai transparency, 2023. URL https://arxiv.org/abs/2310.01405.
Appendix A Implementation Details
A.1 Task Specific Error Detection
In this work, we specifically address errors produced by modern large language models (LLMs). Given the diverse range of tasks these models are applied to, our focus is on general error detection across all categories, rather than isolating specific types. Prior to the emergence of LLMs, much research targeted error detection for specific tasks, with common examples including grammatical errors (Kasewa et al., 2018; Bell et al., 2019; Cheng & Duan, 2020; Wang & Tan, 2020; Flickinger et al., 2016), spelling mistakes (Mishra & Kaur, 2013), machine translation inaccuracies (Lo, 2019; Pu et al., 2021; Sellam et al., 2020; Gekhman et al., 2020; Rei et al., 2020; 2022a; 2022b), speech recognition faults (Caines et al., 2020; Rao et al., 2020; Li & Wang, 2024; Zhou et al., 2005; Allauzen, 2007; Gekhman et al., 2022; Errattahi et al., 2015; Pellegrini & Trancoso, 2009; Chen et al., 2013), and factual consistency failures (Honovich et al., 2022; Laban et al., 2022; Honovich et al., 2021; Gekhman et al., 2023; Scialom et al., 2021; Kryscinski et al., 2020).
A.2 Probing: Implementation Details
We examine the intermediate representations of the exact answer tokens generated by a large language model (LLM) during the answer generation process. The intermediate representation selected for this analysis is derived from the output of the final multi-layer perceptron (MLP). This choice is based on preliminary experiments comparing the MLP output, the residual stream, and the attention heads, which showed no significant differences. We leave the in-depth analysis for future work.
For the probing classifier, we employ a logistic regression model from the scikit-learn library (Pedregosa et al., 2011). We used the default hyperparameters, which include a norm penalty of L2 and an LBFGS solver. We initially experimented with other hyper-parameters and did not find a singnificant difference. For each random seed, the dataset was split to training and validation in a 80-20 ratio, and the test dataset was bootstrap sampled.
Obtaining correctness label for the probing dataset.
An answer is generally considered correct if it includes the correct answer label and appears before any alternative incorrect labels. We manually analyzed the results of this heuristic to confirm that it is accurate in almost all cases. However, one exception is the Natural Questions with Context (NQ_WC) dataset, where we identified false negatives and thus deployed a more precise validation using an instruct LLM, as demonstrated below:
{mdframed}
[backgroundcolor=blue!5, skipabove=0.5] Evaluate the following answers to questions. For each question you would be given an LLM answer and the correct answer. You would have to determine if the LLM answer is correct or not. If the LLM answer is correct, write â1â and if it is not correct, write â0â. For example:
Question: [Question 1]
Ground Truth: [Gold label 1]
LLM Answer: [LLM long answer 1]
Correctness: 0
Question: [Question 2]
Ground Truth: [Gold label 2]
LLM Answer: [LLM long answer 2]
Correctness: 1
Question: [Question]
Ground Truth: [Label]
LLM Answer: [LLM long answer]
Correctness:
Detecting and using exact answer tokens.
Exact answers are identified from a lengthy generated answer using an external algorithm, which processes the question and the LLMâs response, $A(q_{i},\hat{y_{i}})$ , to extract the exact answer. After extraction, we identify the exact answer tokens via a simple search process, focusing on four key tokens: the one before the first exact answer token, the first and last exact answer tokens, and the one after the last.
For the implementation of $A$ that detects the exact locations of answer tokens, we use a combination of heuristic methods and an instruction-tuned LLM. Specifically, when the set of possible answers is finite, we rely on heuristics. For more open-ended scenarios, such as factual questions, we automatically locate the answer if it matches the gold label. Otherwise, we prompt an instruction-tuned LLM, specifically Mistral-7b-Instruct (Jiang et al., 2023), to identify and extract the exact answer substring using the following prompt:
{mdframed}
[backgroundcolor=blue!5, skipabove=0.5] Extract from the following long answer the short answer, only the relevant tokens. If the long answer does not answer the question, output NO ANSWER.
Q: [Question 1]
A: [LLM long answer 1]
Exact answer: [Short exact answer 1]
Q: [Question 2]
A: [LLM long answer that does not answer the question]
Exact answer: NO ANSWER
Q: [Question]
A: [LLM long answer] Exact answer:
To extract a valid exact answer from a long response, we prompt the instruct LLM up to five times. This process involves verifying that the exact answer is a substring of the long answer unless the instruct LLM indicates that there is no answer. To avoid bias in our probing task, we only retain questions for which a valid exact answer was successfully extracted. This ensures there is no unfair correlation between invalid answers and incorrect answers in the experiments.
We note the following: (a) While it is possible to use an instruct LLM to extract every answer regardless of its correctness, we chose the aforementioned strategy to improve the efficiency of our experiments; (b) This is just one possible implementation. For each LLM, one could use the same LLM to extract its own exact answer token, as demonstrated in a proof-of-concept over 1000 samples of TriviaQA in Table 3. Alternatively, it may be more effective to train a smaller system specifically designed for detecting exact answer tokens, which would be more suitable for real-world scenarios. We choose to keep the extraction process as abstract as possible, as our primary focus is not on the specific implementation, but on analyzing the potential gains from probing these locations.
Additionally, if the exact answer token is not among the first generated tokens, we examine the token immediately preceding it (âbefore exact answer tokenâ). If the exact answer token is not the last one, we also examine the following token. When the exact answer spans multiple tokens, the first and last exact answer tokens are probed separately.
Table 3: Success rate of extracting exact answer from a long model answer. Each model is used to extract answers from its own output.
| Mistral-7b | Mistral-Instruct-7b | Llama3-8b | Llama3-Instruct-8b |
| --- | --- | --- | --- |
| 0.99 | 0.96 | 0.99 | 0.95 |
A.3 Datasets
We outline here all ten datasets that we investigate in our work. In our analysis, we aimed at covering a wide range of tasks, skills required to solve the tasks, diversity of datasets and as a result also different LLM limitations such as factual inaccuracies (often referred to as âhallucinationsâ), biases, arithmetic mistakes, and more. For each dataset, we explain how it covers something different from all the previous datasets. For all datasets, we present the LLM with non or a short instruct, a context (if exists for the task), and let it generate a free text. We follow this paradigm as it better mimics real-world usage of LLMs by humans, as opposed to using few-shot to force a short answer that is generated on the first token (Yuksekgonul et al., 2023; Chen et al., 2024; Simhi et al., 2024). One exception to this is a the sentiment analysis (IMDB) for which we apply 1-shot for the LLM to use the allowed labels, as it did not follow the instruction alone and we could not identify if the answer is correct or not even with manual analysis. Additionally, we implemented a different prompting strategy to the instruct and non-instruct LLMs. To see the exact formats we used to prompt each dataset and LLM, refer to our code implementation at https://github.com/technion-cs-nlp/LLMsKnow.
For each dataset we used a split of 10K training samples and 10K test samples, unless the dataset is too small, in which case we mention the size.
- TriviaQA (Joshi et al., 2017): a collection of trivia question-answer pairs. The questions are presented to the LLM without any context, allowing it to generate responses based solely on its internal, parametric knowledge. The dataset includes various acceptable variations of the correct answer, which are used to automatically evaluate the accuracy of the generated res.
- HotpotQA (Yang et al., 2018): a dataset designed for diverse multi-hop question answering. Each entry includes Wikipedia documents that help answering the questions. We use two different settings: (1) without context, where questions are asked directly, which covers slightly different skills from TriviaQA as it requires reasoning in addition to factual knowledge; and (2) with context (HotpotQA_WC), where the additional context is provided, emphasizing the ability to adhere to and utilize contextual information to solve the task.
- Movies: to further investigate generalization, we focused on a case of classic âhallucinationsâ, involving factual knowledge, within a non-diverse dataset. This approach allowed us to test whether generalization to other types of errors is influenced by the type of error (factual versus others) or by the datasetâs diversity. For this purpose, we created the movies dataset consisting of prompts in the form: âWho acted as [figure name] in the movie [movie name]?â The figures, movies, and correct answers were sourced from âThe Movies Datasetâ in Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset, which is based on the MovieLens website.
- Winogrande (Sakaguchi et al., 2021): we use this dataset to explore errors in common-sense reasoning. It consists of Winograd-style coreference challenges, where each example presents a sentence containing two entities and a pronoun. The objective is to determine which entity the pronoun refers to, relying on common-sense reasoning. For example, in the sentence: âThe trophy doesnât fit into the suitcase because itâs too large,â the pronoun âitâ refers to the trophy, not the suitcase.
- Winobias (Zhao et al., 2018): this benchmark focuses on coreference resolution in the context of gender bias, revealing a different type of limitation in LLMs. Each example consists of two professions: one stereotypically male and one stereotypically female, along with a gendered pronoun. The task requires the LLM to determine which profession the pronoun refers to. The sentences are unambiguous, with one correct answer. In some cases, the correct answer aligns with the stereotype, while in others, it is anti-stereotypical. For example, in the sentence âThe developer argued with the designer because she did not like the design,â âsheâ refers to the developer, which is an anti-stereotypical case since âdeveloperâ is considered a stereotypically male profession. Research has shown that LLMs often perform poorly on anti-stereotypical sentences (Zhao et al., 2018) and tend to base their decisions on stereotypes rather than on common-sense reasoning or linguistic rules (Kotek et al., 2023). Each split contains around 1500 samples.
- NLI (Natural Language Inference): NLI involves determining whether a given âhypothesisâ is true (entailment), false (contradiction), or undetermined (neutral) based on a provided âpremise.â For this purpose, we use the MNLI dataset (Williams et al., 2018). NLI tasks address a distinct aspect of common-sense reasoning and are generally considered complex. This complexity allows us to investigate whether a modelâs generalization ability is related to the difficulty of the task it was trained on, or to other factors, such as the limited diversity of labels (NLI has only three valid labels) or the type of task.
- Math (Sun et al., 2024): this dataset includes both unanswerable and answerable math problems. In our study, we focus exclusively on the answerable problems, as our aim is to assess the correctness of the LLMâs outputs, which requires a known correct answer (gold standard). This task introduces an additional, previously unexplored skill of arithmetic reasoning. The train-test split consists of approximately 2,000 and 650 samples, respectively.
- IMDB (Maas et al., 2011): contains movie reviews used for the task of sentiment classification.
- Natural Questions With Context (Kwiatkowski et al., 2019): the Natural Questions (NQ) dataset is designed to evaluate and train automatic question-answering systems. It consists of real, anonymized queries submitted by users to Google, with answers extracted from Wikipedia, as well as the relevant Wikipedia pages which can be given in context. We included this dataset to introduce an additional challenge that requires adherence to context, complementing the HotpotQA with context dataset.
A.4 Baselines: Implementation Details
Aggregated probabilities / logits.
Inspired by prior work (Kadavath et al., 2022; Guerreiro et al., 2023), we compute an aggregated score using the log-probabilities or raw probabilities of the generated text tokens $y_{1},y_{2},...,y_{N}$ produced by the generative large language model (LLM). For instance, the following formulation is used to compute the Logits-mean baseline on the entire generated answer:
$$
\centering\frac{1}{N}\sum_{i=1}^{N}\mathbb{P}(y_{i}|Q,y_{1},...,y_{i-1})\@add@centering \tag{1}
$$
We also explore aggregation strategies that focus solely on the exact answer tokens (PE-Exact). Following Varshney et al. (2023), we also experiment with aggregating the minimum and maximum values (PE-[MinâMax]-[Exact]), alongside the mean aggregation described in Equation 1.
P(True):
We follow Kadavath et al. (2022) and prompt the LLM to judge whether its answer is correct. Our prompt followed the following template, from Kadavath et al. (2022):
{mdframed}
[backgroundcolor=blue!5, skipabove=0.5]
Question: [Question]
Proposed Answer: [LLM long answer]
Is the proposed answer:
(A) True
(B) False
The proposed answer is:
Appendix B Full Error Detection Results
Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnow for the figures.
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/hotpotqa_auc.png Details</summary>

### Visual Description
## Heatmap: Layer Activation vs. Token Influence
### Overview
The image presents a heatmap visualizing the relationship between layers in a neural network and the influence of specific tokens. The heatmap uses a color gradient to represent values ranging from approximately 0.5 to 1.0, with darker blues indicating lower values and lighter blues/whites indicating higher values. The x-axis represents tokens, and the y-axis represents layers.
### Components/Axes
* **X-axis (Horizontal):** "Token" with the following categories: "last\_q", "first\_answer", "second\_answer", "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_after\_last", and tokens numbered -8 to -1.
* **Y-axis (Vertical):** "Layer" ranging from 2 to 30, with increments of 2.
* **Color Scale (Right):** Represents the value associated with each cell in the heatmap. The scale ranges from 0.5 (dark blue) to 1.0 (dark red). The scale is marked with values 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0.
* **Legend:** Located on the right side of the image, providing the color mapping for the heatmap values.
### Detailed Analysis
The heatmap shows varying levels of activation/influence between layers and tokens. Here's a breakdown of observed values, noting approximate values due to the resolution of the image:
* **Token "last\_q":** Shows relatively low values (around 0.55-0.6) across most layers, with a slight increase towards layer 30 (approximately 0.65).
* **Token "first\_answer":** Displays a strong activation in layers 2-10, peaking around 0.9-1.0. The activation decreases as the layer number increases, falling to approximately 0.6-0.7 by layer 30.
* **Token "second\_answer":** Similar to "first\_answer", it exhibits high activation in lower layers (around 0.9-1.0) and a decreasing trend as the layer number increases, reaching approximately 0.6-0.7 at layer 30.
* **Token "exact\_answer\_before\_first":** Shows moderate activation (around 0.7-0.8) in layers 2-16, then decreases to approximately 0.6 by layer 30.
* **Token "exact\_answer\_first":** Displays a peak activation around 0.9-1.0 in layers 6-12, then decreases to approximately 0.6-0.7 by layer 30.
* **Token "exact\_answer\_after\_last":** Shows relatively low activation (around 0.55-0.65) across all layers.
* **Tokens -8 to -1:** These tokens generally exhibit lower activation values (around 0.55-0.7) across all layers, with some minor fluctuations. Token -1 shows a slight increase in activation around layer 26 (approximately 0.75).
**Trend Verification:**
* For "first\_answer" and "second\_answer", the heatmap visually confirms a downward sloping trend as the layer number increases.
* "last\_q" and "exact\_answer\_after\_last" show relatively flat activation across layers.
* The numbered tokens (-8 to -1) show generally low and stable activation.
### Key Observations
* The tokens "first\_answer" and "second\_answer" have the highest activation values in the earlier layers (2-10).
* Activation generally decreases as the layer number increases for most tokens.
* "last\_q" and "exact\_answer\_after\_last" consistently show lower activation values compared to other tokens.
* There is a slight increase in activation for token -1 around layer 26.
### Interpretation
This heatmap likely represents the importance or contribution of different tokens to the activation of various layers within a neural network model, potentially a question-answering system. The decreasing activation of "first\_answer" and "second\_answer" as the layer number increases suggests that the initial processing of the answer is more prominent in the earlier layers, while later layers may focus on refining or integrating this information. The lower activation of "last\_q" and "exact\_answer\_after\_last" could indicate that these tokens are less crucial for the model's overall processing. The heatmap provides insights into how the model processes information at different stages, highlighting which tokens are most influential at each layer. The slight activation peak for token -1 at layer 26 could be an anomaly or indicate a specific feature being processed at that layer. Further investigation would be needed to understand the specific meaning of these tokens within the context of the model.
</details>
(a) HotpotQA
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/hotpotqa_with_context_auc.png Details</summary>

### Visual Description
\n
## Heatmap: Layer Activation vs. Token Influence
### Overview
The image presents a heatmap visualizing the relationship between layers in a neural network and the influence of different tokens. The heatmap uses a color gradient to represent values ranging from 0.5 to 1.0, indicating the strength of the relationship. The x-axis represents tokens, and the y-axis represents layers.
### Components/Axes
* **X-axis (Horizontal):** "Token" with markers ranging from -8 to "last\_q". Specific tokens include: "last\_q", "first\_answer", "second\_answer", "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_after\_last", and numerical tokens from -8 to -1.
* **Y-axis (Vertical):** "Layer" with markers ranging from 0 to 30, incrementing by 2.
* **Color Scale (Right):** Represents the value of the heatmap cells. The scale ranges from 0.5 (lightest blue) to 1.0 (darkest blue).
* **Legend:** Located on the right side of the image, the color scale provides a mapping between color intensity and numerical value.
### Detailed Analysis
The heatmap shows varying levels of activation between layers and tokens. The highest activation values (darkest blue, approximately 1.0) are concentrated in the upper-left portion of the heatmap, corresponding to lower layer numbers (0-8) and the tokens "first\_answer" and "second\_answer".
Here's a breakdown of approximate values, reading from left to right and top to bottom:
* **last\_q:** Layer 0: ~0.95, Layer 2: ~0.85, Layer 4: ~0.75, Layer 6: ~0.7, Layer 8: ~0.65, Layer 10: ~0.6, Layer 12: ~0.6, Layer 14: ~0.6, Layer 16: ~0.6, Layer 18: ~0.6, Layer 20: ~0.6, Layer 22: ~0.6, Layer 24: ~0.6, Layer 26: ~0.6, Layer 28: ~0.6, Layer 30: ~0.6
* **first\_answer:** Layer 0: ~1.0, Layer 2: ~0.95, Layer 4: ~0.9, Layer 6: ~0.8, Layer 8: ~0.7, Layer 10: ~0.65, Layer 12: ~0.6, Layer 14: ~0.6, Layer 16: ~0.6, Layer 18: ~0.6, Layer 20: ~0.6, Layer 22: ~0.6, Layer 24: ~0.6, Layer 26: ~0.6, Layer 28: ~0.6, Layer 30: ~0.6
* **second\_answer:** Layer 0: ~1.0, Layer 2: ~0.95, Layer 4: ~0.9, Layer 6: ~0.8, Layer 8: ~0.7, Layer 10: ~0.65, Layer 12: ~0.6, Layer 14: ~0.6, Layer 16: ~0.6, Layer 18: ~0.6, Layer 20: ~0.6, Layer 22: ~0.6, Layer 24: ~0.6, Layer 26: ~0.6, Layer 28: ~0.6, Layer 30: ~0.6
* **exact\_answer\_before\_first:** Layer 0: ~0.8, Layer 2: ~0.8, Layer 4: ~0.7, Layer 6: ~0.65, Layer 8: ~0.6, Layer 10: ~0.6, Layer 12: ~0.6, Layer 14: ~0.6, Layer 16: ~0.6, Layer 18: ~0.6, Layer 20: ~0.6, Layer 22: ~0.6, Layer 24: ~0.6, Layer 26: ~0.6, Layer 28: ~0.6, Layer 30: ~0.6
* **exact\_answer\_first:** Layer 0: ~0.9, Layer 2: ~0.9, Layer 4: ~0.8, Layer 6: ~0.7, Layer 8: ~0.65, Layer 10: ~0.6, Layer 12: ~0.6, Layer 14: ~0.6, Layer 16: ~0.6, Layer 18: ~0.6, Layer 20: ~0.6, Layer 22: ~0.6, Layer 24: ~0.6, Layer 26: ~0.6, Layer 28: ~0.6, Layer 30: ~0.6
* **exact\_answer\_after\_last:** Layer 0: ~0.7, Layer 2: ~0.7, Layer 4: ~0.6, Layer 6: ~0.6, Layer 8: ~0.6, Layer 10: ~0.6, Layer 12: ~0.6, Layer 14: ~0.6, Layer 16: ~0.6, Layer 18: ~0.6, Layer 20: ~0.6, Layer 22: ~0.6, Layer 24: ~0.6, Layer 26: ~0.6, Layer 28: ~0.6, Layer 30: ~0.6
* **-8 to -1:** These tokens generally show lower activation values (lighter blues, around 0.5-0.7) across all layers.
### Key Observations
* The activation values generally decrease as the layer number increases, suggesting that the influence of these tokens diminishes in deeper layers of the network.
* "first\_answer" and "second\_answer" exhibit the strongest activation, particularly in the initial layers.
* The numerical tokens (-8 to -1) consistently show lower activation values compared to the named tokens.
* There is a clear gradient of activation, with the top-left corner representing the highest influence and the bottom-right corner representing the lowest.
### Interpretation
This heatmap likely represents the attention weights or activation patterns within a transformer-based model, possibly a question-answering system. The tokens represent different parts of the input sequence (e.g., question, answer candidates), and the layers represent the different stages of processing within the model.
The high activation values for "first\_answer" and "second\_answer" in the early layers suggest that the model quickly focuses on these tokens when processing the input. The decreasing activation values in deeper layers indicate that the model refines its focus as it processes the information. The lower activation values for the numerical tokens suggest that these tokens are less important for the model's decision-making process.
The heatmap provides insights into how the model attends to different parts of the input sequence at different stages of processing. This information can be used to understand the model's behavior, identify potential biases, and improve its performance. The consistent decline in activation across layers suggests a hierarchical processing of information, where initial layers capture broad features and later layers refine the focus.
</details>
(b) HotpotQA with context
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/movies_auc.png Details</summary>

### Visual Description
\n
## Heatmap: Layer vs. Token Correlation
### Overview
The image presents a heatmap visualizing the correlation between 'Layer' and 'Token' variables. The color intensity represents the correlation strength, ranging from 0.5 (light blue) to 1.0 (dark blue). The heatmap appears to be a matrix where each cell represents the correlation value for a specific layer and token combination.
### Components/Axes
* **X-axis (Horizontal):** Labeled "Token". The tokens are: 'last_q', 'first_answer', 'second_answer', 'exact_answer_before_first', 'exact_answer_first', 'exact_answer_last', '-8', '-7', '-6', '-5', '-4', '-3', '-2', '-1'.
* **Y-axis (Vertical):** Labeled "Layer". The layers range from 2 to 30, with increments of 2.
* **Color Scale (Legend):** Located on the right side of the heatmap. The scale ranges from 0.5 (lightest blue) to 1.0 (darkest blue). The values on the scale are: 0.5, 0.6, 0.7, 0.8, 0.9, 1.0.
### Detailed Analysis
The heatmap shows varying degrees of correlation between layers and tokens. Here's a breakdown of approximate values, noting the inherent difficulty in precise reading from a visual representation:
* **'last_q' Token:** Correlation values are generally low, ranging from approximately 0.52 to 0.65 across layers 2 to 30.
* **'first_answer' Token:** Shows a moderate increase in correlation, peaking around 0.75-0.85 between layers 6 and 14.
* **'second_answer' Token:** Similar to 'first_answer', with a peak correlation of approximately 0.75-0.85 between layers 6 and 14.
* **'exact_answer_before_first' Token:** Correlation values are generally low, similar to 'last_q', ranging from approximately 0.52 to 0.65.
* **'exact_answer_first' Token:** Exhibits a strong correlation, particularly between layers 4 and 16, reaching values close to 0.95-1.0.
* **'exact_answer_last' Token:** Shows a strong correlation, peaking around 0.85-0.95 between layers 6 and 14.
* **'-8' to '-1' Tokens:** These tokens show a generally lower correlation, ranging from approximately 0.55 to 0.75, with some slight variations across layers. The correlation appears to be relatively consistent across these tokens.
**Specific Data Points (Approximate):**
* Layer 2, 'exact_answer_first': ~0.98
* Layer 4, 'exact_answer_first': ~1.0
* Layer 6, 'first_answer': ~0.78
* Layer 8, 'first_answer': ~0.82
* Layer 10, 'first_answer': ~0.85
* Layer 12, 'first_answer': ~0.83
* Layer 14, 'first_answer': ~0.79
* Layer 16, 'exact_answer_first': ~0.97
* Layer 18, 'exact_answer_first': ~0.95
* Layer 20, 'exact_answer_first': ~0.92
* Layer 22, 'exact_answer_first': ~0.88
* Layer 24, 'exact_answer_first': ~0.82
* Layer 26, 'exact_answer_first': ~0.75
* Layer 28, 'exact_answer_first': ~0.68
* Layer 30, 'exact_answer_first': ~0.62
### Key Observations
* The 'exact_answer_first' token consistently exhibits the highest correlation across most layers, particularly in the lower layers (2-16).
* 'first_answer' and 'second_answer' tokens show a similar correlation pattern, peaking around layers 6-14.
* 'last_q' and 'exact_answer_before_first' tokens have the lowest correlation values.
* The correlation for most tokens appears to decrease as the layer number increases beyond 16.
### Interpretation
This heatmap likely represents the attention weights or feature importance of different tokens at various layers within a neural network model, potentially a question-answering system. The high correlation between 'exact_answer_first' and lower layers suggests that the model quickly focuses on identifying the initial correct answer. The moderate correlation of 'first_answer' and 'second_answer' indicates that the model considers these tokens as relevant, but to a lesser extent. The low correlation of 'last_q' and 'exact_answer_before_first' suggests these tokens are less influential in the model's decision-making process.
The decreasing correlation with higher layers could indicate that the model refines its focus as it processes information through deeper layers. The heatmap provides insights into which tokens are most important at each layer, which can be valuable for understanding the model's behavior and identifying potential areas for improvement. The strong correlation of 'exact_answer_first' suggests the model is heavily reliant on the initial correct answer, which might be a limitation if the initial answer is incorrect or incomplete.
</details>
(c) Movies
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/winogrande_auc.png Details</summary>

### Visual Description
\n
## Heatmap: Layer vs. Token Correlation
### Overview
The image presents a heatmap visualizing the correlation between 'Layer' and 'Token' variables. The heatmap uses a color gradient from light blue (low correlation) to dark blue (high correlation), with values ranging from approximately 0.5 to 1.0. The heatmap appears to represent a matrix where each cell's color indicates the strength of the relationship between a specific layer and a specific token.
### Components/Axes
* **X-axis (Horizontal):** Labeled "Token". The tokens are: 'last_q', 'first_answer', 'second_answer', 'exact_answer_before_first', 'exact_answer_first', 'exact_answer_last', '-8', '-7', '-6', '-5', '-4', '-3', '-2', '-1'.
* **Y-axis (Vertical):** Labeled "Layer". The layers are numbered from 0 to 30, with each number representing a distinct layer.
* **Color Scale (Legend):** Located on the right side of the heatmap. The scale ranges from 0.5 (light blue) to 1.0 (dark blue). The scale is linear.
* **Data Cells:** Each cell represents the correlation value between a specific layer and a specific token.
### Detailed Analysis
The heatmap shows varying degrees of correlation between layers and tokens. Here's a breakdown of observed trends and approximate values:
* **'last_q' Token:** Shows consistently low correlation (around 0.5 - 0.6) across all layers.
* **'first_answer' Token:** Correlation starts low at layer 0 (approximately 0.55) and increases to a peak around layer 10-12 (approximately 0.85-0.9), then decreases again.
* **'second_answer' Token:** Similar trend to 'first_answer', with a peak correlation around layers 8-14 (approximately 0.8-0.9).
* **'exact_answer_before_first' Token:** Shows a moderate correlation (around 0.65-0.75) across most layers, with a slight increase towards the middle layers.
* **'exact_answer_first' Token:** Exhibits a strong correlation (around 0.8-0.9) across layers 6-18, with a peak around layer 10-12.
* **'exact_answer_last' Token:** Shows a moderate correlation (around 0.7-0.8) across layers 6-20.
* **'-8' to '-1' Tokens:** These tokens show a generally increasing correlation with increasing layer number, peaking around layers 16-24. The correlation values range from approximately 0.6 to 0.9. Specifically:
* '-8': Correlation increases from ~0.6 at layer 0 to ~0.85 at layer 24.
* '-7': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 20.
* '-6': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 22.
* '-5': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 24.
* '-4': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 26.
* '-3': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 28.
* '-2': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 30.
* '-1': Correlation increases from ~0.6 at layer 0 to ~0.9 at layer 30.
### Key Observations
* The 'first_answer' and 'exact_answer_first' tokens exhibit the highest correlations, particularly in the middle layers (6-18).
* The 'last_q' token consistently shows the lowest correlation across all layers.
* The negative numbered tokens show a clear positive correlation with layer number, suggesting their influence increases as the model progresses through layers.
* There is a noticeable diagonal pattern in the heatmap, indicating a relationship between layer number and token influence.
### Interpretation
This heatmap likely represents the attention weights or activation patterns within a neural network model, possibly a language model. The 'tokens' likely represent different parts of the input or output sequence, and the 'layers' represent different stages of processing within the model.
The high correlation between 'first_answer' and 'exact_answer_first' in the middle layers suggests that these tokens are strongly related during the model's reasoning process. The low correlation of 'last_q' might indicate that the initial question has less influence on the model's final answer.
The increasing correlation of the negative numbered tokens with layer number suggests that these tokens become more important as the model processes information and refines its understanding. This could indicate that these tokens represent features or concepts that are gradually learned or refined throughout the model's layers.
The heatmap provides insights into how the model attends to different parts of the input and how this attention changes across different layers. This information can be valuable for understanding the model's behavior, identifying potential biases, and improving its performance.
</details>
(d) Winogrande
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/mnli_auc.png Details</summary>

### Visual Description
## Heatmap: Layer vs. Token Correlation
### Overview
The image presents a heatmap visualizing the correlation between different layers of a model and specific tokens. The heatmap uses a color gradient to represent correlation values, ranging from approximately 0.5 (light blue) to 1.0 (dark blue). The x-axis represents tokens, and the y-axis represents layers.
### Components/Axes
* **X-axis (Horizontal):** Labeled "Token". The tokens are: "last\_q", "first\_answer", "second\_answer", "exact\_answer\_before\_first", "exact\_answer\_first", "exact\_answer\_last", "-8", "-7", "-6", "-5", "-4", "-3", "-2", "-1".
* **Y-axis (Vertical):** Labeled "Layer". The layers range from 0 to 30, with increments of 2.
* **Color Scale (Right):** Represents the correlation value. The scale ranges from 0.5 (lightest blue) to 1.0 (darkest blue). Intermediate values are marked as 0.6, 0.7, 0.8, and 0.9.
* **Legend:** Located on the right side of the heatmap, providing the color-to-value mapping.
### Detailed Analysis
The heatmap shows varying correlation strengths between layers and tokens. Here's a breakdown of observed values, noting approximate values due to the visual nature of the data:
* **"last\_q" Token:** Shows a generally low correlation across all layers, with values mostly around 0.5 - 0.6.
* **"first\_answer" Token:** Exhibits a moderate correlation, peaking around layer 2 at approximately 0.8. Correlation decreases with increasing layer number.
* **"second\_answer" Token:** Similar to "first\_answer", with a peak correlation around layer 2, approximately 0.8, and decreasing correlation at higher layers.
* **"exact\_answer\_before\_first" Token:** Shows a moderate correlation, peaking around layer 6 at approximately 0.8.
* **"exact\_answer\_first" Token:** Displays a very strong correlation, particularly between layers 4 and 10, with values consistently around 0.9 - 1.0. This is the most prominent feature of the heatmap.
* **"exact\_answer\_last" Token:** Shows a moderate correlation, peaking around layer 10 at approximately 0.8.
* **Tokens "-8" to "-1":** These tokens exhibit a generally increasing correlation with increasing layer number, peaking around layers 24-28, with values around 0.7-0.8.
**Specific Data Points (Approximate):**
* Layer 0, "first\_answer": ~0.6
* Layer 2, "first\_answer": ~0.8
* Layer 4, "exact\_answer\_first": ~0.95
* Layer 8, "exact\_answer\_first": ~1.0
* Layer 10, "exact\_answer\_first": ~0.95
* Layer 24, "-8": ~0.7
* Layer 28, "-1": ~0.8
* Layer 30, "last\_q": ~0.55
### Key Observations
* The "exact\_answer\_first" token demonstrates the strongest and most consistent correlation with layers 4-10, indicating a strong relationship between these layers and the first exact answer.
* The tokens "-8" to "-1" show an increasing correlation with layers, suggesting their importance grows in deeper layers of the model.
* "last\_q" consistently exhibits the lowest correlation across all layers.
* The correlation for "first\_answer" and "second\_answer" is highest in the earlier layers and decreases as the layer number increases.
### Interpretation
This heatmap likely represents the attention weights or activation patterns within a neural network model, specifically related to question answering. The strong correlation between layers 4-10 and the "exact\_answer\_first" token suggests that these layers are crucial for identifying and processing the first exact answer to a given question. The decreasing correlation of "first\_answer" and "second\_answer" with increasing layers could indicate that the initial answer processing is more prominent in the earlier layers, while later layers focus on refining or contextualizing the answer. The increasing correlation of the negative numbered tokens with deeper layers suggests these tokens become more relevant as the model processes information more abstractly. The low correlation of "last\_q" might indicate that the initial question representation is less important for the final answer generation compared to the answer-related tokens. This visualization provides insights into which layers are most sensitive to specific tokens and how information flows through the model during the question-answering process.
</details>
(e) NLI
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/imdb_auc.png Details</summary>

### Visual Description
\n
## Heatmap: Layer Activation vs. Token Position
### Overview
The image presents a heatmap visualizing the relationship between neural network layers and token positions. The color intensity represents a numerical value, likely indicating activation strength or attention weight. The heatmap spans 32 layers (numbered 2 to 30) against 9 token positions (labeled 'last_q', 'exact_answer_first', 'exact_answer_last', 'exact_answer_after_last', and tokens -8 to -1). A colorbar on the right indicates the value scale from 0.5 to 1.0.
### Components/Axes
* **X-axis (Horizontal):** "Token" - Represents the position of a token in a sequence. The tokens are labeled as follows: 'last_q', 'exact_answer_first', 'exact_answer_last', 'exact_answer_after_last', '-8', '-7', '-6', '-5', '-4', '-3', '-2', '-1'.
* **Y-axis (Vertical):** "Layer" - Represents the layer number in a neural network, ranging from 2 to 30.
* **Colorbar:** Located on the right side of the heatmap. The scale ranges from 0.5 (lightest color) to 1.0 (darkest color).
* **Data:** The heatmap itself, with each cell representing the value corresponding to a specific layer and token position.
### Detailed Analysis
The heatmap shows varying levels of activation across layers and tokens. The color intensity is used to represent the value.
Here's a breakdown of approximate values, reading from the heatmap:
* **'last_q' Token:**
* Layer 2: ~0.95
* Layer 4: ~0.95
* Layer 6: ~0.9
* Layer 8: ~0.85
* Layer 10: ~0.8
* Layer 12: ~0.75
* Layer 14: ~0.7
* Layer 16: ~0.68
* Layer 18: ~0.65
* Layer 20: ~0.6
* Layer 22: ~0.6
* Layer 24: ~0.65
* Layer 26: ~0.7
* Layer 28: ~0.75
* Layer 30: ~0.8
* **'exact_answer_first' Token:**
* Layer 2: ~0.9
* Layer 4: ~0.9
* Layer 6: ~0.85
* Layer 8: ~0.8
* Layer 10: ~0.75
* Layer 12: ~0.7
* Layer 14: ~0.65
* Layer 16: ~0.6
* Layer 18: ~0.58
* Layer 20: ~0.58
* Layer 22: ~0.6
* Layer 24: ~0.65
* Layer 26: ~0.7
* Layer 28: ~0.75
* Layer 30: ~0.8
* **'exact_answer_last' Token:**
* Layer 2: ~0.9
* Layer 4: ~0.9
* Layer 6: ~0.85
* Layer 8: ~0.8
* Layer 10: ~0.75
* Layer 12: ~0.7
* Layer 14: ~0.65
* Layer 16: ~0.6
* Layer 18: ~0.58
* Layer 20: ~0.58
* Layer 22: ~0.6
* Layer 24: ~0.65
* Layer 26: ~0.7
* Layer 28: ~0.75
* Layer 30: ~0.8
* **'exact_answer_after_last' Token:**
* Layer 2: ~0.85
* Layer 4: ~0.85
* Layer 6: ~0.8
* Layer 8: ~0.75
* Layer 10: ~0.7
* Layer 12: ~0.65
* Layer 14: ~0.6
* Layer 16: ~0.55
* Layer 18: ~0.55
* Layer 20: ~0.55
* Layer 22: ~0.6
* Layer 24: ~0.65
* Layer 26: ~0.7
* Layer 28: ~0.75
* Layer 30: ~0.8
* **Tokens -8 to -1:** Generally show lower activation values, ranging from approximately 0.55 to 0.75, with some variation across layers. There appears to be a slight increase in activation for these tokens in the later layers (26-30).
### Key Observations
* The initial layers (2-6) exhibit consistently high activation values (close to 1.0) across all tokens.
* Activation values generally decrease as the layer number increases, particularly for the 'last_q' token.
* The 'exact_answer' tokens ('first', 'last', 'after_last') show a similar activation pattern, slightly lower than 'last_q' in the initial layers.
* The tokens -8 to -1 consistently have the lowest activation values.
* There's a subtle trend of increasing activation for the -8 to -1 tokens in the deeper layers (26-30).
### Interpretation
This heatmap likely represents the attention weights or activation strengths of different layers in a transformer model when processing a sequence of tokens. The high activation in the early layers suggests that these layers are capturing general features of the input sequence. The decreasing activation in later layers, particularly for the 'last_q' token, could indicate that the model is focusing on more specific features or refining its representation of the input.
The lower activation values for the -8 to -1 tokens suggest that these tokens are less relevant to the task the model is performing. The slight increase in activation for these tokens in the deeper layers could indicate that the model is still attempting to extract some information from them, or that these tokens become more relevant in the context of the entire sequence.
The distinct activation patterns for the 'exact_answer' tokens suggest that the model is paying attention to these specific parts of the input sequence when generating an answer. The heatmap provides valuable insights into how the model processes information and makes predictions. The heatmap suggests that the model is focusing on the 'last_q' token more than the 'exact_answer' tokens in the initial layers, but this focus shifts as the information propagates through the network.
</details>
(f) IMDB
Figure 6: AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. The detection performance spikes at the exact answer tokens.
Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results.
Table 4: Comparison of error detection performance (AUC) on Mistral-7B.
| | Mistral-7B | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.67$ $± 0.004$ | $0.49$ $± 0.010$ | $0.41$ $± 0.015$ | $0.67$ $± 0.007$ | $0.88$ $± 0.064$ |
| Logits-mean-exact | $0.67$ $± 0.004$ | $0.50$ $± 0.010$ | $0.56$ $± 0.026$ | $0.68$ $± 0.008$ | $0.57$ $± 0.080$ |
| Logits-min | $0.80$ $± 0.003$ | $0.45$ $± 0.014$ | $0.48$ $± 0.021$ | $0.73$ $± 0.006$ | $0.78$ $± 0.056$ |
| Logits-min-exact | $0.80$ $± 0.005$ | $0.53$ $± 0.014$ | $0.78$ $± 0.032$ | $0.72$ $± 0.005$ | $0.57$ $± 0.080$ |
| Logits-max | $0.53$ $± 0.008$ | $0.49$ $± 0.010$ | $0.42$ $± 0.023$ | $0.54$ $± 0.005$ | $0.83$ $± 0.076$ |
| Logits-max-exact | $0.54$ $± 0.009$ | $0.50$ $± 0.010$ | $0.40$ $± 0.024$ | $0.58$ $± 0.007$ | $0.57$ $± 0.080$ |
| Probas-mean | $0.76$ $± 0.003$ | $0.53$ $± 0.018$ | $0.66$ $± 0.016$ | $0.72$ $± 0.007$ | $0.87$ $± 0.041$ |
| Probas-mean-exact | $0.78$ $± 0.002$ | $0.55$ $± 0.014$ | $0.62$ $± 0.016$ | $0.74$ $± 0.007$ | $0.83$ $± 0.057$ |
| Probas-min | $0.82$ $± 0.003$ | $0.52$ $± 0.013$ | $0.82$ $± 0.020$ | $0.73$ $± 0.006$ | $0.86$ $± 0.032$ |
| Probas-min-exact | 0.85 $± 0.003$ | $0.58$ $± 0.011$ | $0.84$ $± 0.015$ | $0.74$ $± 0.006$ | $0.83$ $± 0.057$ |
| Probas-max | $0.53$ $± 0.008$ | $0.50$ $± 0.016$ | $0.43$ $± 0.025$ | $0.55$ $± 0.008$ | $0.80$ $± 0.074$ |
| Probas-max-exact | $0.55$ $± 0.009$ | $0.51$ $± 0.013$ | $0.39$ $± 0.019$ | $0.59$ $± 0.009$ | $0.83$ $± 0.057$ |
| p(True) | $0.57$ $± 0.007$ | $0.53$ $± 0.019$ | $0.56$ $± 0.027$ | $0.51$ $± 0.003$ | $0.65$ $± 0.004$ |
| p(True)-exact | $0.56$ $± 0.006$ | $0.55$ $± 0.026$ | $0.57$ $± 0.036$ | $0.52$ $± 0.003$ | $0.65$ $± 0.003$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.83$ $± 0.002$ | $0.65$ $± 0.008$ | $0.82$ $± 0.023$ | $0.79$ $± 0.002$ | $0.85$ $± 0.007$ |
| Before last generated [-2] | $0.82$ $± 0.003$ | $0.84$ $± 0.012$ | $0.83$ $± 0.019$ | $0.78$ $± 0.003$ | $0.95$ $± 0.004$ |
| End of question | $0.74$ $± 0.005$ | $0.78$ $± 0.012$ | $0.83$ $± 0.016$ | $0.77$ $± 0.002$ | $0.81$ $± 0.009$ |
| Exact answer last | $0.84$ $± 0.005$ | 0.89 $± 0.007$ | 0.96 $± 0.008$ | $0.78$ $± 0.003$ | 0.95 $± 0.004$ |
| Exact answer last+1 | $0.84$ $± 0.004$ | $0.84$ $± 0.012$ | $0.95$ $± 0.010$ | 0.80 $± 0.002$ | $0.85$ $± 0.007$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.63$ $± 0.005$ | $0.52$ $± 0.009$ | $0.49$ $± 0.004$ | $0.51$ $± 0.004$ | $0.69$ $± 0.006$ |
| Logits-mean-exact | $0.57$ $± 0.008$ | $0.52$ $± 0.007$ | $0.50$ $± 0.003$ | 0.93 $± 0.004$ | $0.72$ $± 0.005$ |
| Logits-min | $0.72$ $± 0.008$ | $0.59$ $± 0.006$ | $0.50$ $± 0.007$ | $0.53$ $± 0.005$ | $0.65$ $± 0.009$ |
| Logits-min-exact | $0.72$ $± 0.007$ | $0.65$ $± 0.004$ | $0.51$ $± 0.007$ | $0.49$ $± 0.006$ | $0.70$ $± 0.005$ |
| Logits-max | $0.54$ $± 0.007$ | $0.49$ $± 0.010$ | $0.48$ $± 0.005$ | $0.48$ $± 0.005$ | $0.59$ $± 0.012$ |
| Logits-max-exact | $0.48$ $± 0.010$ | $0.44$ $± 0.007$ | $0.50$ $± 0.003$ | $0.48$ $± 0.005$ | $0.58$ $± 0.009$ |
| Probas-mean | $0.65$ $± 0.004$ | $0.55$ $± 0.006$ | $0.51$ $± 0.007$ | $0.49$ $± 0.003$ | $0.63$ $± 0.008$ |
| Probas-mean-exact | $0.62$ $± 0.006$ | $0.56$ $± 0.007$ | $0.51$ $± 0.005$ | $0.02$ $± 0.001$ | $0.66$ $± 0.007$ |
| Probas-min | $0.73$ $± 0.005$ | $0.58$ $± 0.007$ | $0.52$ $± 0.009$ | $0.53$ $± 0.004$ | $0.63$ $± 0.011$ |
| Probas-min-exact | $0.78$ $± 0.005$ | $0.66$ $± 0.004$ | $0.52$ $± 0.008$ | $0.49$ $± 0.005$ | $0.69$ $± 0.006$ |
| Probas-max | $0.54$ $± 0.008$ | $0.49$ $± 0.007$ | $0.50$ $± 0.005$ | $0.47$ $± 0.004$ | $0.52$ $± 0.004$ |
| Probas-max-exact | $0.48$ $± 0.010$ | $0.44$ $± 0.005$ | $0.50$ $± 0.004$ | $0.48$ $± 0.003$ | $0.53$ $± 0.012$ |
| p(True) | $0.55$ $± 0.007$ | $0.54$ $± 0.006$ | $0.51$ $± 0.005$ | $0.51$ $± 0.003$ | $0.52$ $± 0.008$ |
| p(True)-exact | $0.61$ $± 0.005$ | $0.54$ $± 0.006$ | $0.61$ $± 0.006$ | $0.51$ $± 0.006$ | $0.53$ $± 0.014$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.78$ $± 0.006$ | $0.67$ $± 0.004$ | $0.51$ $± 0.007$ | $0.77$ $± 0.004$ | $0.78$ $± 0.003$ |
| Before last generated [-2] | $0.79$ $± 0.007$ | $0.69$ $± 0.007$ | $0.66$ $± 0.004$ | $0.81$ $± 0.002$ | $0.75$ $± 0.006$ |
| End of question | $0.72$ $± 0.007$ | $0.56$ $± 0.003$ | $0.51$ $± 0.007$ | $0.88$ $± 0.004$ | $0.70$ $± 0.005$ |
| Exact answer last | $0.80$ $± 0.008$ | 0.74 $± 0.007$ | 0.69 $± 0.006$ | $0.84$ $± 0.004$ | $0.81$ $± 0.009$ |
| Exact answer last+1 | 0.81 $± 0.008$ | $0.72$ $± 0.005$ | $0.59$ $± 0.005$ | $0.75$ $± 0.006$ | 0.84 $± 0.007$ |
Table 5: Comparison of error detection performance (AUC) on Mistral-7B-Instruct.
| | Mistral-7B-Instruct | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.60$ $± 0.009$ | $0.56$ $± 0.017$ | $0.55$ $± 0.029$ | $0.63$ $± 0.005$ | $0.57$ $± 0.006$ |
| Logits-mean-exact | $0.68$ $± 0.007$ | $0.54$ $± 0.012$ | $0.51$ $± 0.005$ | $0.70$ $± 0.004$ | $0.87$ $± 0.007$ |
| Logits-min | $0.63$ $± 0.008$ | $0.59$ $± 0.012$ | $0.51$ $± 0.017$ | $0.66$ $± 0.008$ | $0.52$ $± 0.007$ |
| Logits-min-exact | $0.75$ $± 0.006$ | $0.53$ $± 0.013$ | $0.71$ $± 0.009$ | $0.74$ $± 0.005$ | $0.87$ $± 0.007$ |
| Logits-max | $0.54$ $± 0.005$ | $0.53$ $± 0.012$ | $0.54$ $± 0.039$ | $0.54$ $± 0.004$ | $0.47$ $± 0.004$ |
| Logits-max-exact | $0.55$ $± 0.004$ | $0.54$ $± 0.011$ | $0.32$ $± 0.015$ | $0.61$ $± 0.006$ | $0.87$ $± 0.007$ |
| Probas-mean | $0.60$ $± 0.007$ | $0.58$ $± 0.018$ | $0.56$ $± 0.028$ | $0.61$ $± 0.002$ | $0.54$ $± 0.008$ |
| Probas-mean-exact | $0.71$ $± 0.003$ | $0.57$ $± 0.015$ | $0.71$ $± 0.014$ | $0.74$ $± 0.006$ | $0.84$ $± 0.007$ |
| Probas-min | $0.59$ $± 0.008$ | $0.58$ $± 0.014$ | $0.50$ $± 0.025$ | $0.60$ $± 0.008$ | $0.51$ $± 0.010$ |
| Probas-min-exact | $0.74$ $± 0.004$ | $0.57$ $± 0.016$ | $0.75$ $± 0.011$ | $0.73$ $± 0.006$ | $0.84$ $± 0.007$ |
| Probas-max | $0.50$ $± 0.006$ | $0.41$ $± 0.010$ | $0.53$ $± 0.009$ | $0.51$ $± 0.005$ | $0.48$ $± 0.004$ |
| Probas-max-exact | $0.51$ $± 0.007$ | $0.54$ $± 0.010$ | $0.45$ $± 0.015$ | $0.60$ $± 0.003$ | $0.84$ $± 0.007$ |
| p(True) | $0.68$ $± 0.005$ | $0.45$ $± 0.021$ | $0.48$ $± 0.026$ | $0.62$ $± 0.005$ | $0.62$ $± 0.009$ |
| p(True)-exact | $0.74$ $± 0.003$ | $0.40$ $± 0.021$ | $0.60$ $± 0.025$ | $0.69$ $± 0.008$ | $0.60$ $± 0.009$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.71$ $± 0.006$ | $0.82$ $± 0.004$ | $0.74$ $± 0.008$ | $0.72$ $± 0.005$ | $0.92$ $± 0.010$ |
| Before last generated [-2] | $0.73$ $± 0.004$ | $0.85$ $± 0.004$ | $0.74$ $± 0.007$ | $0.72$ $± 0.006$ | $0.94$ $± 0.006$ |
| End of question | $0.76$ $± 0.008$ | $0.82$ $± 0.011$ | $0.72$ $± 0.007$ | $0.74$ $± 0.003$ | $0.96$ $± 0.006$ |
| Exact answer last | $0.85$ $± 0.004$ | 0.92 $± 0.005$ | 0.92 $± 0.008$ | $0.81$ $± 0.003$ | 0.97 $± 0.005$ |
| Exact answer last+1 | 0.86 $± 0.006$ | $0.88$ $± 0.006$ | $0.90$ $± 0.010$ | 0.82 $± 0.003$ | $0.96$ $± 0.006$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.61$ $± 0.002$ | $0.55$ $± 0.009$ | $0.59$ $± 0.004$ | $0.64$ $± 0.006$ | $0.71$ $± 0.008$ |
| Logits-mean-exact | $0.66$ $± 0.009$ | $0.55$ $± 0.004$ | $0.49$ $± 0.004$ | $0.57$ $± 0.004$ | $0.69$ $± 0.009$ |
| Logits-min | $0.61$ $± 0.003$ | $0.53$ $± 0.013$ | $0.61$ $± 0.003$ | $0.62$ $± 0.002$ | $0.67$ $± 0.008$ |
| Logits-min-exact | $0.77$ $± 0.004$ | $0.67$ $± 0.013$ | $0.48$ $± 0.004$ | $0.54$ $± 0.005$ | $0.69$ $± 0.006$ |
| Logits-max | $0.53$ $± 0.008$ | $0.51$ $± 0.011$ | $0.52$ $± 0.006$ | $0.59$ $± 0.008$ | $0.63$ $± 0.011$ |
| Logits-max-exact | $0.51$ $± 0.011$ | $0.41$ $± 0.010$ | $0.49$ $± 0.007$ | $0.64$ $± 0.003$ | $0.63$ $± 0.013$ |
| Probas-mean | $0.63$ $± 0.003$ | $0.56$ $± 0.010$ | $0.58$ $± 0.005$ | $0.62$ $± 0.005$ | $0.68$ $± 0.010$ |
| Probas-mean-exact | $0.72$ $± 0.006$ | $0.66$ $± 0.010$ | $0.46$ $± 0.004$ | $0.57$ $± 0.003$ | $0.65$ $± 0.008$ |
| Probas-min | $0.58$ $± 0.003$ | $0.52$ $± 0.008$ | $0.59$ $± 0.002$ | $0.58$ $± 0.008$ | $0.65$ $± 0.014$ |
| Probas-min-exact | $0.76$ $± 0.004$ | $0.68$ $± 0.010$ | $0.46$ $± 0.005$ | $0.57$ $± 0.003$ | $0.66$ $± 0.008$ |
| Probas-max | $0.50$ $± 0.005$ | $0.53$ $± 0.003$ | $0.48$ $± 0.007$ | $0.52$ $± 0.007$ | $0.51$ $± 0.005$ |
| Probas-max-exact | $0.46$ $± 0.010$ | $0.46$ $± 0.010$ | $0.48$ $± 0.004$ | $0.53$ $± 0.004$ | $0.52$ $± 0.018$ |
| p(True) | $0.54$ $± 0.006$ | $0.54$ $± 0.004$ | $0.53$ $± 0.003$ | $0.58$ $± 0.003$ | $0.57$ $± 0.006$ |
| p(True)-exact | $0.60$ $± 0.008$ | $0.48$ $± 0.005$ | $0.57$ $± 0.011$ | $0.65$ $± 0.004$ | $0.57$ $± 0.009$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.72$ $± 0.005$ | $0.64$ $± 0.005$ | $0.74$ $± 0.005$ | $0.85$ $± 0.004$ | $0.82$ $± 0.006$ |
| Before last generated [-2] | $0.73$ $± 0.006$ | $0.64$ $± 0.004$ | $0.76$ $± 0.004$ | $0.87$ $± 0.002$ | $0.84$ $± 0.009$ |
| End of question | $0.80$ $± 0.003$ | $0.63$ $± 0.003$ | $0.71$ $± 0.007$ | $0.79$ $± 0.004$ | $0.85$ $± 0.010$ |
| Exact answer last | $0.85$ $± 0.003$ | $0.75$ $± 0.006$ | 0.84 $± 0.005$ | 0.93 $± 0.003$ | $0.86$ $± 0.003$ |
| Exact answer last+1 | 0.85 $± 0.002$ | 0.76 $± 0.004$ | $0.80$ $± 0.004$ | $0.92$ $± 0.004$ | 0.87 $± 0.006$ |
Table 6: Comparison of error detection performance (AUC) on Llama-8b.
| | Llama-8b | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.58$ $± 0.006$ | $0.44$ $± 0.015$ | $0.43$ $± 0.026$ | $0.64$ $± 0.008$ | $0.77$ $± 0.007$ |
| Logits-mean-exact | $0.63$ $± 0.007$ | $0.50$ $± 0.015$ | $0.50$ $± 0.028$ | $0.64$ $± 0.008$ | $0.77$ $± 0.007$ |
| Logits-min | $0.75$ $± 0.007$ | $0.50$ $± 0.022$ | $0.45$ $± 0.042$ | $0.73$ $± 0.005$ | $0.73$ $± 0.007$ |
| Logits-min-exact | $0.76$ $± 0.003$ | $0.53$ $± 0.009$ | $0.75$ $± 0.022$ | $0.73$ $± 0.005$ | $0.77$ $± 0.007$ |
| Logits-max | $0.48$ $± 0.006$ | $0.48$ $± 0.009$ | $0.42$ $± 0.027$ | $0.53$ $± 0.005$ | $0.72$ $± 0.007$ |
| Logits-max-exact | $0.52$ $± 0.007$ | $0.49$ $± 0.014$ | $0.35$ $± 0.026$ | $0.53$ $± 0.005$ | $0.77$ $± 0.007$ |
| Probas-mean | $0.64$ $± 0.006$ | $0.41$ $± 0.008$ | $0.61$ $± 0.029$ | $0.71$ $± 0.007$ | $0.70$ $± 0.008$ |
| Probas-mean-exact | $0.72$ $± 0.005$ | $0.50$ $± 0.018$ | $0.54$ $± 0.026$ | $0.72$ $± 0.006$ | $0.88$ $± 0.003$ |
| Probas-min | $0.79$ $± 0.008$ | $0.43$ $± 0.004$ | $0.75$ $± 0.044$ | $0.74$ $± 0.005$ | $0.68$ $± 0.005$ |
| Probas-min-exact | $0.82$ $± 0.003$ | $0.53$ $± 0.014$ | $0.78$ $± 0.022$ | $0.74$ $± 0.005$ | $0.88$ $± 0.003$ |
| Probas-max | $0.49$ $± 0.006$ | $0.50$ $± 0.009$ | $0.46$ $± 0.032$ | $0.53$ $± 0.007$ | $0.60$ $± 0.009$ |
| Probas-max-exact | $0.53$ $± 0.008$ | $0.50$ $± 0.018$ | $0.36$ $± 0.032$ | $0.54$ $± 0.007$ | $0.88$ $± 0.003$ |
| p(True) | $0.62$ $± 0.005$ | $0.48$ $± 0.011$ | $0.53$ $± 0.027$ | $0.61$ $± 0.005$ | $0.51$ $± 0.010$ |
| p(True)-exact | $0.67$ $± 0.002$ | $0.53$ $± 0.017$ | $0.63$ $± 0.028$ | $0.58$ $± 0.005$ | $0.52$ $± 0.008$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.77$ $± 0.005$ | $0.59$ $± 0.024$ | $0.83$ $± 0.013$ | $0.82$ $± 0.005$ | $0.94$ $± 0.002$ |
| Before last generated [-2] | $0.76$ $± 0.012$ | $0.58$ $± 0.021$ | $0.82$ $± 0.032$ | $0.79$ $± 0.004$ | $0.96$ $± 0.002$ |
| End of question | $0.73$ $± 0.005$ | $0.77$ $± 0.012$ | $0.80$ $± 0.027$ | $0.78$ $± 0.005$ | $0.68$ $± 0.009$ |
| Exact answer last | 0.82 $± 0.006$ | 0.91 $± 0.007$ | 0.96 $± 0.010$ | $0.80$ $± 0.005$ | 0.97 $± 0.001$ |
| Exact answer last+1 | $0.82$ $± 0.006$ | $0.86$ $± 0.008$ | $0.95$ $± 0.007$ | 0.82 $± 0.006$ | $0.95$ $± 0.003$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.65$ $± 0.004$ | $0.62$ $± 0.006$ | $0.48$ $± 0.003$ | $0.47$ $± 0.002$ | $0.53$ $± 0.010$ |
| Logits-mean-exact | $0.55$ $± 0.003$ | $0.54$ $± 0.006$ | $0.49$ $± 0.004$ | $0.48$ $± 0.002$ | $0.58$ $± 0.009$ |
| Logits-min | $0.57$ $± 0.004$ | $0.49$ $± 0.003$ | $0.48$ $± 0.003$ | $0.48$ $± 0.007$ | $0.58$ $± 0.009$ |
| Logits-min-exact | $0.69$ $± 0.002$ | $0.68$ $± 0.006$ | $0.49$ $± 0.003$ | $0.48$ $± 0.007$ | $0.61$ $± 0.010$ |
| Logits-max | $0.61$ $± 0.005$ | $0.60$ $± 0.004$ | $0.48$ $± 0.003$ | $0.52$ $± 0.003$ | $0.51$ $± 0.008$ |
| Logits-max-exact | $0.47$ $± 0.003$ | $0.46$ $± 0.005$ | $0.49$ $± 0.004$ | $0.51$ $± 0.002$ | $0.54$ $± 0.005$ |
| Probas-mean | $0.67$ $± 0.002$ | $0.62$ $± 0.006$ | $0.49$ $± 0.002$ | $0.48$ $± 0.004$ | $0.57$ $± 0.003$ |
| Probas-mean-exact | $0.62$ $± 0.005$ | $0.56$ $± 0.005$ | $0.51$ $± 0.002$ | $0.46$ $± 0.006$ | $0.64$ $± 0.007$ |
| Probas-min | $0.62$ $± 0.006$ | $0.51$ $± 0.002$ | $0.49$ $± 0.003$ | $0.50$ $± 0.010$ | $0.62$ $± 0.005$ |
| Probas-min-exact | $0.76$ $± 0.005$ | $0.67$ $± 0.004$ | $0.51$ $± 0.002$ | $0.50$ $± 0.010$ | $0.69$ $± 0.008$ |
| Probas-max | $0.61$ $± 0.004$ | $0.58$ $± 0.004$ | $0.48$ $± 0.002$ | $0.48$ $± 0.003$ | $0.51$ $± 0.012$ |
| Probas-max-exact | $0.49$ $± 0.003$ | $0.44$ $± 0.004$ | $0.51$ $± 0.003$ | $0.47$ $± 0.002$ | $0.56$ $± 0.005$ |
| p(True) | $0.52$ $± 0.007$ | $0.45$ $± 0.005$ | $0.54$ $± 0.004$ | $0.54$ $± 0.007$ | $0.56$ $± 0.006$ |
| p(True)-exact | $0.58$ $± 0.005$ | $0.50$ $± 0.007$ | $0.64$ $± 0.004$ | $0.62$ $± 0.005$ | $0.61$ $± 0.002$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.76$ $± 0.007$ | $0.57$ $± 0.006$ | $0.59$ $± 0.006$ | $0.89$ $± 0.002$ | $0.66$ $± 0.010$ |
| Before last generated [-2] | $0.74$ $± 0.007$ | $0.58$ $± 0.005$ | $0.59$ $± 0.005$ | $0.94$ $± 0.002$ | $0.63$ $± 0.008$ |
| End of question | $0.71$ $± 0.006$ | $0.53$ $± 0.004$ | $0.48$ $± 0.003$ | $0.91$ $± 0.001$ | $0.66$ $± 0.004$ |
| Exact answer last | $0.81$ $± 0.006$ | $0.77$ $± 0.004$ | 0.65 $± 0.004$ | 0.94 $± 0.002$ | 0.75 $± 0.008$ |
| Exact answer last+1 | 0.82 $± 0.004$ | 0.79 $± 0.001$ | $0.57$ $± 0.004$ | $0.90$ $± 0.002$ | $0.75$ $± 0.007$ |
Table 7: Comparison of error detection performance (AUC) on Llama-8b-Instruct.
| | Llama-8b-Instruct | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.66$ $± 0.005$ | $0.60$ $± 0.026$ | $0.75$ $± 0.018$ | $0.75$ $± 0.005$ | $0.59$ $± 0.017$ |
| Logits-mean-exact | $0.71$ $± 0.006$ | $0.55$ $± 0.019$ | $0.80$ $± 0.021$ | $0.72$ $± 0.004$ | $0.88$ $± 0.012$ |
| Logits-min | $0.74$ $± 0.007$ | $0.61$ $± 0.024$ | $0.75$ $± 0.016$ | $0.71$ $± 0.005$ | $0.55$ $± 0.016$ |
| Logits-min-exact | $0.79$ $± 0.006$ | $0.61$ $± 0.019$ | $0.89$ $± 0.018$ | $0.77$ $± 0.006$ | $0.88$ $± 0.012$ |
| Logits-max | $0.54$ $± 0.007$ | $0.55$ $± 0.013$ | $0.73$ $± 0.027$ | $0.67$ $± 0.003$ | $0.51$ $± 0.009$ |
| Logits-max-exact | $0.58$ $± 0.005$ | $0.54$ $± 0.019$ | $0.64$ $± 0.014$ | $0.61$ $± 0.003$ | $0.88$ $± 0.012$ |
| Probas-mean | $0.67$ $± 0.006$ | $0.63$ $± 0.024$ | $0.66$ $± 0.033$ | $0.73$ $± 0.006$ | $0.73$ $± 0.015$ |
| Probas-mean-exact | $0.75$ $± 0.009$ | $0.61$ $± 0.014$ | $0.83$ $± 0.022$ | $0.74$ $± 0.005$ | $0.74$ $± 0.021$ |
| Probas-min | $0.67$ $± 0.009$ | $0.65$ $± 0.019$ | $0.64$ $± 0.036$ | $0.65$ $± 0.004$ | $0.57$ $± 0.016$ |
| Probas-min-exact | $0.79$ $± 0.008$ | $0.62$ $± 0.014$ | $0.86$ $± 0.024$ | $0.74$ $± 0.005$ | $0.74$ $± 0.021$ |
| Probas-max | $0.54$ $± 0.003$ | $0.49$ $± 0.020$ | $0.57$ $± 0.022$ | $0.64$ $± 0.006$ | $0.49$ $± 0.008$ |
| Probas-max-exact | $0.56$ $± 0.007$ | $0.55$ $± 0.016$ | $0.57$ $± 0.018$ | $0.61$ $± 0.003$ | $0.74$ $± 0.021$ |
| p(True) | $0.73$ $± 0.008$ | $0.59$ $± 0.020$ | $0.62$ $± 0.017$ | $0.66$ $± 0.004$ | $0.60$ $± 0.006$ |
| p(True)-exact | $0.73$ $± 0.005$ | $0.63$ $± 0.014$ | $0.59$ $± 0.018$ | $0.63$ $± 0.006$ | $0.76$ $± 0.004$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.81$ $± 0.005$ | $0.86$ $± 0.007$ | $0.82$ $± 0.016$ | $0.78$ $± 0.004$ | $0.81$ $± 0.014$ |
| Before last generated [-2] | $0.75$ $± 0.005$ | $0.88$ $± 0.005$ | $0.79$ $± 0.020$ | $0.82$ $± 0.005$ | $0.83$ $± 0.006$ |
| End of question | $0.77$ $± 0.007$ | $0.80$ $± 0.018$ | $0.72$ $± 0.023$ | $0.76$ $± 0.005$ | $0.87$ $± 0.006$ |
| Exact answer last | 0.83 $± 0.002$ | 0.93 $± 0.004$ | 0.95 $± 0.027$ | $0.85$ $± 0.005$ | 0.96 $± 0.003$ |
| Exact answer last+1 | $0.83$ $± 0.006$ | $0.90$ $± 0.005$ | $0.94$ $± 0.023$ | 0.86 $± 0.004$ | $0.95$ $± 0.004$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.65$ $± 0.002$ | $0.56$ $± 0.004$ | $0.58$ $± 0.007$ | $0.59$ $± 0.009$ | $0.65$ $± 0.006$ |
| Logits-mean-exact | $0.66$ $± 0.008$ | $0.57$ $± 0.005$ | $0.48$ $± 0.003$ | $0.49$ $± 0.010$ | $0.67$ $± 0.005$ |
| Logits-min | $0.67$ $± 0.008$ | $0.55$ $± 0.007$ | $0.60$ $± 0.008$ | $0.53$ $± 0.009$ | $0.68$ $± 0.004$ |
| Logits-min-exact | $0.76$ $± 0.010$ | $0.65$ $± 0.010$ | $0.48$ $± 0.004$ | $0.50$ $± 0.009$ | $0.68$ $± 0.004$ |
| Logits-max | $0.59$ $± 0.005$ | $0.56$ $± 0.005$ | $0.46$ $± 0.004$ | $0.55$ $± 0.013$ | $0.56$ $± 0.006$ |
| Logits-max-exact | $0.52$ $± 0.006$ | $0.48$ $± 0.002$ | $0.48$ $± 0.003$ | $0.49$ $± 0.009$ | $0.63$ $± 0.008$ |
| Probas-mean | $0.61$ $± 0.002$ | $0.56$ $± 0.010$ | $0.57$ $± 0.007$ | $0.58$ $± 0.007$ | $0.65$ $± 0.007$ |
| Probas-mean-exact | $0.68$ $± 0.008$ | $0.65$ $± 0.006$ | $0.51$ $± 0.006$ | $0.57$ $± 0.009$ | $0.67$ $± 0.003$ |
| Probas-min | $0.60$ $± 0.004$ | $0.51$ $± 0.007$ | $0.59$ $± 0.007$ | $0.55$ $± 0.005$ | $0.64$ $± 0.008$ |
| Probas-min-exact | $0.74$ $± 0.007$ | $0.67$ $± 0.007$ | $0.51$ $± 0.006$ | $0.59$ $± 0.008$ | $0.66$ $± 0.004$ |
| Probas-max | $0.56$ $± 0.005$ | $0.53$ $± 0.005$ | $0.46$ $± 0.003$ | $0.51$ $± 0.004$ | $0.55$ $± 0.004$ |
| Probas-max-exact | $0.49$ $± 0.007$ | $0.47$ $± 0.002$ | $0.51$ $± 0.005$ | $0.50$ $± 0.009$ | $0.62$ $± 0.006$ |
| p(True) | $0.55$ $± 0.005$ | $0.55$ $± 0.008$ | $0.47$ $± 0.002$ | $0.54$ $± 0.006$ | $0.71$ $± 0.003$ |
| p(True)-exact | $0.55$ $± 0.004$ | $0.50$ $± 0.005$ | $0.50$ $± 0.008$ | $0.50$ $± 0.003$ | $0.67$ $± 0.007$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.77$ $± 0.005$ | $0.68$ $± 0.006$ | $0.69$ $± 0.006$ | $0.78$ $± 0.005$ | $0.77$ $± 0.009$ |
| Before last generated [-2] | $0.76$ $± 0.002$ | $0.69$ $± 0.005$ | $0.67$ $± 0.008$ | $0.79$ $± 0.004$ | $0.75$ $± 0.007$ |
| End of question | $0.78$ $± 0.004$ | $0.60$ $± 0.003$ | $0.65$ $± 0.004$ | $0.74$ $± 0.002$ | $0.75$ $± 0.011$ |
| Exact answer last | 0.83 $± 0.005$ | 0.76 $± 0.003$ | 0.78 $± 0.007$ | 0.91 $± 0.005$ | 0.78 $± 0.006$ |
| Exact answer last+1 | $0.83$ $± 0.002$ | $0.76$ $± 0.006$ | $0.70$ $± 0.006$ | $0.90$ $± 0.004$ | $0.78$ $± 0.007$ |
Appendix C Full Generalization Results
Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.
<details>
<summary>extracted/6450693/figures/generalization/mistral.png Details</summary>

### Visual Description
## Heatmap: Train/Test Dataset Performance Correlation
### Overview
This image presents a heatmap visualizing the correlation between different training datasets and test datasets. The color intensity represents the correlation coefficient, ranging from 0.0 to 1.0. The heatmap displays the performance of models trained on one dataset and evaluated on another.
### Components/Axes
* **X-axis:** Test dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Y-axis:** Train dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Color Scale (Legend):** Located on the right side of the heatmap. Ranges from blue (0.0) to red (1.0), representing negative to positive correlation.
* 0.0: Blue
* 0.2: Light Blue
* 0.4: Medium Blue
* 0.6: Light Red
* 0.8: Medium Red
* 1.0: Dark Red
### Detailed Analysis
The heatmap shows the correlation coefficients between each pair of train and test datasets. I will analyze row by row, noting trends and specific values.
* **TriviaQA (Train):**
* TriviaQA (Test): 0.84
* HotpotQA (Test): 0.64
* Movies (Test): 0.73
* Winobias (Test): 0.50
* Winogrande (Test): 0.54
* NLI (Test): 0.51
* IMDB (Test): 0.80
* Math (Test): 0.72
* HotpotQA\_WC (Test): 0.54
* NQ\_WC (Test): 0.66
* **HotpotQA (Train):**
* TriviaQA (Test): 0.77
* HotpotQA (Test): 0.80
* Movies (Test): 0.72
* Winobias (Test): 0.53
* Winogrande (Test): 0.53
* NLI (Test): 0.52
* IMDB (Test): 0.66
* Math (Test): 0.56
* HotpotQA\_WC (Test): 0.61
* NQ\_WC (Test): 0.69
* **Movies (Train):**
* TriviaQA (Test): 0.68
* HotpotQA (Test): 0.57
* Movies (Test): 0.80
* Winobias (Test): 0.51
* Winogrande (Test): 0.54
* NLI (Test): 0.53
* IMDB (Test): 0.78
* Math (Test): 0.55
* HotpotQA\_WC (Test): 0.56
* NQ\_WC (Test): 0.64
* **Winobias (Train):**
* TriviaQA (Test): 0.57
* HotpotQA (Test): 0.63
* Movies (Test): 0.65
* Winobias (Test): 0.89
* Winogrande (Test): 0.53
* NLI (Test): 0.52
* IMDB (Test): 0.80
* Math (Test): 0.60
* HotpotQA\_WC (Test): 0.52
* NQ\_WC (Test): 0.56
* **Winogrande (Train):**
* TriviaQA (Test): 0.52
* HotpotQA (Test): 0.51
* Movies (Test): 0.55
* Winobias (Test): 0.55
* Winogrande (Test): 0.66
* NLI (Test): 0.52
* IMDB (Test): 0.89
* Math (Test): 0.54
* HotpotQA\_WC (Test): 0.53
* NQ\_WC (Test): 0.52
* **NLI (Train):**
* TriviaQA (Test): 0.58
* HotpotQA (Test): 0.58
* Movies (Test): 0.58
* Winobias (Test): 0.51
* Winogrande (Test): 0.50
* NLI (Test): 0.88
* IMDB (Test): 0.56
* Math (Test): 0.75
* HotpotQA\_WC (Test): 0.53
* NQ\_WC (Test): 0.51
* **IMDB (Train):**
* TriviaQA (Test): 0.60
* HotpotQA (Test): 0.50
* Movies (Test): 0.57
* Winobias (Test): 0.63
* Winogrande (Test): 0.54
* NLI (Test): 0.52
* IMDB (Test): 0.95
* Math (Test): 0.78
* HotpotQA\_WC (Test): 0.55
* NQ\_WC (Test): 0.50
* **Math (Train):**
* TriviaQA (Test): 0.58
* HotpotQA (Test): 0.64
* Movies (Test): 0.56
* Winobias (Test): 0.57
* Winogrande (Test): 0.52
* NLI (Test): 0.55
* IMDB (Test): 0.61
* Math (Test): 0.96
* HotpotQA\_WC (Test): 0.55
* NQ\_WC (Test): 0.60
* **HotpotQA\_WC (Train):**
* TriviaQA (Test): 0.65
* HotpotQA (Test): 0.69
* Movies (Test): 0.62
* Winobias (Test): 0.53
* Winogrande (Test): 0.53
* NLI (Test): 0.55
* IMDB (Test): 0.81
* Math (Test): 0.54
* HotpotQA\_WC (Test): 0.74
* NQ\_WC (Test): 0.64
* **NQ\_WC (Train):**
* TriviaQA (Test): 0.62
* HotpotQA (Test): 0.67
* Movies (Test): 0.54
* Winobias (Test): 0.52
* Winogrande (Test): 0.56
* NLI (Test): 0.56
* IMDB (Test): 0.68
* Math (Test): 0.51
* HotpotQA\_WC (Test): 0.56
* NQ\_WC (Test): 0.84
### Key Observations
* The diagonal elements (training and testing on the same dataset) all have a correlation of 1.0 (dark red), as expected.
* IMDB and Math show high correlation when used as both training and testing datasets (0.95 and 0.96 respectively).
* Winobias shows a relatively high correlation with itself (0.89).
* The correlations between TriviaQA and HotpotQA are consistently moderate (around 0.6-0.8).
* NLI, Winogrande, and HotpotQA\_WC generally exhibit lower correlations with other datasets.
### Interpretation
This heatmap demonstrates the transferability of knowledge learned from different datasets. High correlation coefficients indicate that a model trained on one dataset performs well on another, suggesting that the datasets share similar underlying characteristics or require similar reasoning abilities. The strong self-correlations confirm that models perform best when tested on data similar to what they were trained on.
The relatively low correlations involving NLI, Winogrande, and HotpotQA\_WC suggest that these datasets may be more specialized or require different skills than the other datasets. This could be due to differences in the types of questions asked, the complexity of the reasoning required, or the domain of knowledge tested.
The high correlation between IMDB and other datasets suggests that the features learned from IMDB are broadly applicable to other tasks. The moderate correlation between TriviaQA and HotpotQA suggests that these datasets share some commonalities, but also have distinct characteristics.
The heatmap provides valuable insights into the relationships between these datasets and can be used to guide the selection of training data for specific tasks. For example, if a model is intended to perform well on IMDB, training on IMDB itself or on datasets with high correlation to IMDB (like TriviaQA) would be a good strategy.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/mistral_reduced.png Details</summary>

### Visual Description
\n
## Heatmap: Correlation Matrix of Dataset Performance
### Overview
The image presents a heatmap visualizing the correlation matrix between different datasets used for training and testing. The color intensity represents the correlation coefficient, ranging from -0.3 to 0.3. The heatmap displays the correlation between each pair of datasets, indicating how well performance on one dataset predicts performance on another.
### Components/Axes
* **X-axis:** "Test dataset" - Lists the datasets used for testing: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, and NQ\_WC.
* **Y-axis:** "Train dataset" - Lists the datasets used for training: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, and NQ\_WC.
* **Color Scale:** A continuous color scale ranging from blue to red. Blue represents negative correlation, white represents zero correlation, and red represents positive correlation. The scale is labeled from -0.3 to 0.3.
* **Data Values:** Numerical values are displayed within each cell of the heatmap, representing the correlation coefficient between the corresponding train and test datasets.
### Detailed Analysis
The heatmap shows a variety of correlation coefficients. Here's a breakdown of some key observations, referencing the color and approximate values:
* **TriviaQA:**
* TriviaQA (Train) vs. TriviaQA (Test): 0.15 (Light Red) - Moderate positive correlation.
* TriviaQA (Train) vs. HotpotQA (Test): -0.08 (Light Blue) - Slight negative correlation.
* TriviaQA (Train) vs. Movies (Test): 0.04 (Near White) - Very weak correlation.
* **HotpotQA:**
* HotpotQA (Train) vs. HotpotQA (Test): 0.17 (Light Red) - Moderate positive correlation.
* HotpotQA (Train) vs. Movies (Test): -0.01 (Near White) - Very weak correlation.
* HotpotQA (Train) vs. Winobias (Test): 0.01 (Near White) - Very weak correlation.
* **Movies:**
* Movies (Train) vs. Movies (Test): 0.13 (Light Red) - Moderate positive correlation.
* Movies (Train) vs. Winobias (Test): -0.07 (Light Blue) - Slight negative correlation.
* Movies (Train) vs. Winogrande (Test): -0.17 (Light Blue) - Slight negative correlation.
* **Winobias:**
* Winobias (Train) vs. Winobias (Test): 0.36 (Dark Red) - Strong positive correlation.
* Winobias (Train) vs. Winogrande (Test): 0.19 (Light Red) - Moderate positive correlation.
* **Winogrande:**
* Winogrande (Train) vs. Winogrande (Test): 0.01 (Near White) - Very weak correlation.
* Winogrande (Train) vs. NLI (Test): -0.01 (Near White) - Very weak correlation.
* **NLI:**
* NLI (Train) vs. NLI (Test): 0.35 (Dark Red) - Strong positive correlation.
* **IMDB:**
* IMDB (Train) vs. IMDB (Test): -0.01 (Near White) - Very weak correlation.
* **Math:**
* Math (Train) vs. Math (Test): 0.08 (Light Red) - Slight positive correlation.
* **HotpotQA\_WC:**
* HotpotQA\_WC (Train) vs. HotpotQA\_WC (Test): 0.09 (Light Red) - Slight positive correlation.
* **NQ\_WC:**
* NQ\_WC (Train) vs. NQ\_WC (Test): 0.33 (Dark Red) - Strong positive correlation.
Generally, the diagonal elements (where train and test datasets are the same) show the strongest positive correlations, as expected.
### Key Observations
* **Strongest Positive Correlations:** Winobias (0.36), NLI (0.35), and NQ\_WC (0.33) exhibit the strongest positive correlations with themselves.
* **Negative Correlations:** Several pairs show slight negative correlations (values between -0.1 and -0.2), indicating that performance on one dataset might be weakly predictive of *lower* performance on the other.
* **Weak Correlations:** Many pairs have correlation coefficients close to zero, suggesting little to no relationship between performance on those datasets.
### Interpretation
This heatmap provides insights into the transferability of performance between different datasets. The strong positive correlations along the diagonal suggest that a model trained and tested on the same dataset will generally perform consistently. The weak correlations between many dataset pairs indicate that performance on one dataset is not a strong predictor of performance on another. This could be due to differences in the nature of the questions, the domain of knowledge required, or the types of reasoning skills needed.
The negative correlations, while weak, suggest that improving performance on one dataset might sometimes come at the cost of performance on another. This could be due to overfitting to the specific characteristics of one dataset, leading to poorer generalization on others.
The data suggests that these datasets, while all related to question answering, represent distinct challenges. A model that excels on Winobias or NLI might not necessarily perform well on TriviaQA or Movies, and vice versa. This highlights the importance of evaluating models on a diverse set of datasets to get a comprehensive understanding of their capabilities. The "WC" datasets (HotpotQA\_WC and NQ\_WC) appear to have relatively low correlations with other datasets, suggesting they may represent different types of question answering tasks or require different skills.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 7: Generalization between datasets, Mistral-7b.
<details>
<summary>extracted/6450693/figures/generalization/llama.png Details</summary>

### Visual Description
## Heatmap: Train-Test Dataset Performance Correlation
### Overview
This image presents a heatmap visualizing the correlation between different training datasets and test datasets. The color intensity represents the correlation coefficient, ranging from 0.0 to 1.0. The heatmap displays the performance of various models trained on one dataset and evaluated on another.
### Components/Axes
* **X-axis:** Test dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Y-axis:** Train dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Color Scale (Legend):** Located on the right side of the heatmap. Ranges from blue (0.0) to red (1.0), representing low to high correlation.
* 0.0 is represented by a light blue color.
* 1.0 is represented by a dark red color.
* **Labels:** Each cell in the heatmap contains a numerical value representing the correlation coefficient.
### Detailed Analysis
The heatmap shows the correlation coefficients between each pair of train and test datasets. Here's a breakdown of the values, reading row by row (Train dataset vs. Test datasets):
* **TriviaQA (Train):**
* TriviaQA (Test): 0.82
* HotpotQA (Test): 0.69
* Movies (Test): 0.69
* Winobias (Test): 0.53
* Winogrande (Test): 0.52
* NLI (Test): 0.52
* IMDB (Test): 0.59
* Math (Test): 0.82
* HotpotQA\_WC (Test): 0.50
* NQ\_WC (Test): 0.55
* **HotpotQA (Train):**
* TriviaQA (Test): 0.76
* HotpotQA (Test): 0.82
* Movies (Test): 0.70
* Winobias (Test): 0.54
* Winogrande (Test): 0.53
* NLI (Test): 0.51
* IMDB (Test): 0.59
* Math (Test): 0.79
* HotpotQA\_WC (Test): 0.63
* NQ\_WC (Test): 0.55
* **Movies (Train):**
* TriviaQA (Test): 0.70
* HotpotQA (Test): 0.58
* Movies (Test): 0.82
* Winobias (Test): 0.60
* Winogrande (Test): 0.51
* NLI (Test): 0.56
* IMDB (Test): 0.54
* Math (Test): 0.54
* HotpotQA\_WC (Test): 0.52
* NQ\_WC (Test): 0.56
* **Winobias (Train):**
* TriviaQA (Test): 0.63
* HotpotQA (Test): 0.60
* Movies (Test): 0.60
* Winobias (Test): 0.91
* Winogrande (Test): 0.53
* NLI (Test): 0.52
* IMDB (Test): 0.77
* Math (Test): 0.74
* HotpotQA\_WC (Test): 0.56
* NQ\_WC (Test): 0.51
* **Winogrande (Train):**
* TriviaQA (Test): 0.61
* HotpotQA (Test): 0.55
* Movies (Test): 0.60
* Winobias (Test): 0.65
* Winogrande (Test): 0.62
* NLI (Test): 0.86
* IMDB (Test): 0.54
* Math (Test): 0.50
* HotpotQA\_WC (Test): 0.53
* NQ\_WC (Test): 0.53
* **NLI (Train):**
* TriviaQA (Test): 0.57
* HotpotQA (Test): 0.53
* Movies (Test): 0.59
* Winobias (Test): 0.57
* Winogrande (Test): 0.52
* NLI (Test): 0.94
* IMDB (Test): 0.70
* Math (Test): 0.56
* HotpotQA\_WC (Test): 0.51
* NQ\_WC (Test): 0.53
* **IMDB (Train):**
* TriviaQA (Test): 0.60
* HotpotQA (Test): 0.53
* Movies (Test): 0.62
* Winobias (Test): 0.66
* Winogrande (Test): 0.52
* NLI (Test): 0.67
* IMDB (Test): 0.97
* Math (Test): 0.57
* HotpotQA\_WC (Test): 0.58
* NQ\_WC (Test): 0.52
* **Math (Train):**
* TriviaQA (Test): 0.62
* HotpotQA (Test): 0.53
* Movies (Test): 0.57
* Winobias (Test): 0.51
* Winogrande (Test): 0.51
* NLI (Test): 0.51
* IMDB (Test): 0.74
* Math (Test): 0.96
* HotpotQA\_WC (Test): 0.54
* NQ\_WC (Test): 0.56
* **HotpotQA\_WC (Train):**
* TriviaQA (Test): 0.67
* HotpotQA (Test): 0.68
* Movies (Test): 0.55
* Winobias (Test): 0.51
* Winogrande (Test): 0.53
* NLI (Test): 0.58
* IMDB (Test): 0.78
* Math (Test): 0.75
* HotpotQA\_WC (Test): 0.77
* NQ\_WC (Test): 0.50
* **NQ\_WC (Train):**
* TriviaQA (Test): 0.66
* HotpotQA (Test): 0.56
* Movies (Test): 0.68
* Winobias (Test): 0.58
* Winogrande (Test): 0.55
* NLI (Test): 0.53
* IMDB (Test): 0.53
* Math (Test): 0.56
* HotpotQA\_WC (Test): 0.54
* NQ\_WC (Test): 0.75
### Key Observations
* The highest correlations are observed when a dataset is used for both training and testing (diagonal elements), with values close to 1.0 (e.g., Winobias-Winobias: 0.91, NLI-NLI: 0.94, IMDB-IMDB: 0.97, Math-Math: 0.96).
* The correlation between TriviaQA and Math is relatively high (0.82), suggesting some shared characteristics or transferability between these datasets.
* The correlations between datasets are generally lower for datasets like Winogrande, NLI, and Math when tested against datasets like HotpotQA or Movies.
* The "WC" datasets (HotpotQA\_WC and NQ\_WC) show moderate correlations with other datasets, but generally lower than the original datasets.
### Interpretation
This heatmap demonstrates the degree to which models trained on one dataset generalize to other datasets. High correlation coefficients indicate that a model trained on one dataset is likely to perform well on another. The diagonal dominance suggests that models perform best when tested on data similar to what they were trained on. The lower off-diagonal values highlight the challenges of transfer learning and the importance of dataset selection. The "WC" datasets, potentially representing a different data collection or processing method, exhibit lower correlations, suggesting they may have different characteristics than the original datasets. This information is valuable for selecting appropriate training data for specific tasks and understanding the limitations of models trained on particular datasets. The heatmap provides a quantitative assessment of dataset similarity and transferability, which can guide model development and evaluation.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/llama_reduced.png Details</summary>

### Visual Description
\n
## Heatmap: Correlation Matrix of Dataset Performance
### Overview
The image presents a heatmap visualizing the correlation matrix between different datasets used for training and testing a model. The rows represent the training datasets, and the columns represent the test datasets. The color intensity indicates the correlation coefficient, with red representing positive correlation and blue representing negative correlation. The values within each cell represent the correlation coefficient.
### Components/Axes
* **X-axis (Horizontal):** Test dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Y-axis (Vertical):** Train dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Color Scale (Right):** Represents the correlation coefficient, ranging from -0.4 to 0.4.
* Dark Red: ~0.4
* Light Red: ~0.2
* White: ~0.0
* Light Blue: ~-0.1
* Dark Blue: ~-0.2
* **Labels:** Each cell contains a numerical value representing the correlation coefficient between the corresponding train and test datasets.
### Detailed Analysis
The heatmap displays the correlation coefficients between each pair of training and testing datasets. Here's a breakdown of the values, row by row:
* **TriviaQA:** -0.06, -0.01, -0.04, -0.01, 0.04, 0.04, -0.19, 0.07, -0.18, -0.05
* **HotpotQA:** 0.00, 0.12, -0.03, 0.01, 0.05, 0.00, -0.19, 0.04, -0.05, -0.06
* **Movies:** -0.06, -0.11, 0.08, 0.06, 0.02, -0.09, -0.24, -0.20, -0.16, -0.04
* **Winobias:** -0.14, -0.09, -0.11, 0.37, 0.04, 0.03, -0.12, -0.09, -0.12, -0.08
* **Winogrande:** -0.16, -0.14, 0.12, 0.16, 0.14, 0.09, -0.21, -0.18, -0.08
* **NLI:** -0.19, -0.17, -0.15, 0.04, 0.36, -0.08, -0.19, -0.17, -0.07
* **IMDB:** -0.17, -0.17, -0.12, 0.13, 0.03, 0.19, 0.20, -0.18, -0.10, -0.08
* **Math:** -0.15, -0.16, -0.16, -0.03, 0.02, 0.03, -0.21, -0.14, -0.05
* **HotpotQA\_WC:** -0.09, -0.01, -0.18, -0.12, -0.08, 0.01, 0.01, 0.09, -0.09
* **NQ\_WC:** -0.11, -0.13, -0.06, -0.08, 0.06, -0.24, -0.19, -0.14, -0.04
**Trends:**
* The diagonal elements (where train and test datasets are the same) are generally close to zero, indicating weak self-correlation.
* Winobias and NLI show a relatively strong positive correlation (0.37 and 0.36 respectively) when used as training data with themselves as test data.
* Movies and Math consistently exhibit negative correlations with several test datasets.
* The correlations are generally weak, with most values falling between -0.2 and 0.2.
### Key Observations
* The highest positive correlation is between Winobias as train and test data (0.37).
* The most negative correlation is between Movies as train and NLI as test data (-0.24) and between Movies as train and Math as test data (-0.24).
* The correlations are generally low, suggesting that performance on one dataset does not strongly predict performance on another.
### Interpretation
This heatmap reveals the relationships between the performance of a model across different datasets. The low overall correlation suggests that the datasets are relatively independent in terms of the skills or knowledge required to perform well on them. This could be due to differences in the types of questions, the domain of knowledge, or the style of writing.
The strong positive correlation between Winobias and NLI suggests that these datasets share common characteristics, potentially related to natural language inference or commonsense reasoning. The negative correlations involving Movies and Math might indicate that the model struggles to transfer knowledge between these datasets and others, possibly due to differences in the complexity or structure of the data.
The heatmap provides valuable insights for dataset selection and model evaluation. If the goal is to build a model that generalizes well across a variety of datasets, it may be beneficial to include datasets with low correlations in the training set. Conversely, if the goal is to optimize performance on a specific dataset, it may be helpful to focus on datasets that are highly correlated with that dataset. The heatmap also highlights potential areas for improvement, such as addressing the negative correlations between Movies and other datasets.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 8: Generalization between datasets, Llama-3-8b.
<details>
<summary>extracted/6450693/figures/generalization/llama_instruct.png Details</summary>

### Visual Description
\n
## Heatmap: Cross-Dataset Generalization Performance
### Overview
This image presents a heatmap visualizing the cross-dataset generalization performance of a model. The heatmap displays the correlation between performance on different training datasets (rows) and different test datasets (columns). The color intensity represents the correlation coefficient, ranging from 0.0 to 1.0, with darker shades of red indicating higher correlation and blue indicating lower correlation.
### Components/Axes
* **X-axis (Horizontal):** Test dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Y-axis (Vertical):** Train dataset. Categories are: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA\_WC, NQ\_WC.
* **Color Scale (Legend):** Located on the right side of the heatmap. Ranges from 0.0 (blue) to 1.0 (red). The scale is marked with values 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Labels:** Each cell in the heatmap contains a numerical value representing the correlation coefficient.
### Detailed Analysis
The heatmap shows the correlation coefficients between training and testing on different datasets. I will analyze row by row, noting trends and specific values.
* **TriviaQA (Row 1):** The diagonal value is 0.84. Performance is relatively high when trained and tested on TriviaQA. Correlation decreases with other datasets, ranging from approximately 0.55 (Winogrande) to 0.74 (HotpotQA).
* **HotpotQA (Row 2):** Diagonal value is 0.78. Similar to TriviaQA, performance is best when trained and tested on HotpotQA. Correlation values range from 0.53 (Winogrande) to 0.83 (TriviaQA).
* **Movies (Row 3):** Diagonal value is 0.69. Performance is moderate when trained and tested on Movies. Correlation values range from 0.52 (Winogrande) to 0.82 (HotpotQA).
* **Winobias (Row 4):** Diagonal value is 0.93. This dataset shows the highest self-correlation. Performance is relatively low when tested on other datasets, ranging from 0.52 (Winogrande) to 0.57 (TriviaQA).
* **Winogrande (Row 5):** Diagonal value is 0.78. Performance is moderate when trained and tested on Winogrande. Correlation values range from 0.52 (HotpotQA\_WC) to 0.69 (NLI).
* **NLI (Row 6):** Diagonal value is 0.91. High self-correlation. Performance is relatively low when tested on other datasets, ranging from 0.52 (HotpotQA\_WC) to 0.57 (TriviaQA).
* **IMDB (Row 7):** Diagonal value is 0.96. Highest self-correlation. Performance is relatively low when tested on other datasets, ranging from 0.54 (Math) to 0.61 (Movies).
* **Math (Row 8):** Diagonal value is 0.95. High self-correlation. Performance is relatively low when tested on other datasets, ranging from 0.52 (NQ\_WC) to 0.67 (HotpotQA).
* **HotpotQA\_WC (Row 9):** Diagonal value is 0.76. Performance is moderate when trained and tested on HotpotQA\_WC. Correlation values range from 0.53 (NLI) to 0.72 (HotpotQA).
* **NQ\_WC (Row 10):** Diagonal value is 0.78. Performance is moderate when trained and tested on NQ\_WC. Correlation values range from 0.55 (Math) to 0.73 (TriviaQA).
### Key Observations
* **High Self-Correlation:** Datasets Winobias, NLI, and IMDB exhibit very high correlation when trained and tested on themselves (diagonal values close to 1.0). This suggests the model performs well when evaluated on data similar to what it was trained on.
* **Low Cross-Dataset Correlation:** Generally, the correlation coefficients are lower when the test dataset differs from the training dataset. This indicates that models trained on one dataset do not generalize well to other datasets.
* **Winobias and NLI Generalization:** Winobias and NLI show particularly low correlation with most other datasets, suggesting they are very specific and do not transfer well to other tasks.
* **TriviaQA and HotpotQA:** These datasets show relatively higher correlation with each other, suggesting some overlap in the types of questions or knowledge required.
### Interpretation
The heatmap demonstrates the challenge of cross-dataset generalization in question answering and natural language inference. Models trained on one dataset tend to perform poorly on others, even within the same general domain. This suggests that datasets differ significantly in their characteristics, such as question types, knowledge requirements, and data distribution. The high self-correlation indicates that models can achieve high performance when evaluated on data similar to their training data, but their ability to generalize to new datasets is limited.
The low correlation between Winobias/NLI and other datasets suggests these datasets may focus on very specific linguistic phenomena or require specialized reasoning skills. The relatively higher correlation between TriviaQA and HotpotQA suggests they share some common characteristics, making it easier for a model trained on one to generalize to the other.
This data highlights the need for more robust models that can generalize across datasets and the importance of carefully considering the characteristics of the training data when evaluating model performance. Further investigation into the specific differences between these datasets could help develop strategies for improving cross-dataset generalization.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/llama_instruct_reduced.png Details</summary>

### Visual Description
\n
## Heatmap: Correlation Matrix of Dataset Performance
### Overview
The image presents a heatmap visualizing the correlation coefficients between different datasets when used as training data versus when used as test data. The color intensity represents the strength and direction of the correlation, with red indicating positive correlation and blue indicating negative correlation. The heatmap is labeled with dataset names along both the x and y axes.
### Components/Axes
* **X-axis:** "Test dataset" - Lists the following datasets: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC.
* **Y-axis:** "Train dataset" - Lists the following datasets: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC.
* **Color Scale (Legend):** Located in the top-right corner. Ranges from -0.3 (dark blue) to 0.3 (dark red), with white representing 0.
* **Cells:** Each cell represents the correlation coefficient between a specific training dataset (row) and a specific test dataset (column).
### Detailed Analysis
The heatmap displays correlation coefficients, which are numerical values ranging from approximately -0.3 to 0.3. I will analyze each row (train dataset) and its corresponding correlation values with each column (test dataset).
* **TriviaQA (Train):**
* TriviaQA (Test): -0.05
* HotpotQA (Test): -0.03
* Movies (Test): 0.07
* Winobias (Test): 0.09
* Winogrande (Test): -0.05
* NLI (Test): 0.00
* IMDB (Test): -0.32
* Math (Test): -0.06
* HotpotQA_WC (Test): -0.09
* NQ_WC (Test): 0.02
* **HotpotQA (Train):**
* TriviaQA (Test): 0.07
* HotpotQA (Test): -0.04
* Movies (Test): -0.12
* Winobias (Test): -0.12
* Winogrande (Test): 0.04
* NLI (Test): 0.37
* IMDB (Test): -0.17
* Math (Test): -0.03
* HotpotQA_WC (Test): 0.08
* NQ_WC (Test): -0.03
* **Movies (Train):**
* TriviaQA (Test): -0.10
* HotpotQA (Test): -0.07
* Movies (Test): 0.04
* Winobias (Test): 0.07
* Winogrande (Test): -0.05
* NLI (Test): -0.07
* IMDB (Test): -0.16
* Math (Test): -0.37
* HotpotQA_WC (Test): -0.15
* NQ_WC (Test): -0.08
* **Winobias (Train):**
* TriviaQA (Test): -0.22
* HotpotQA (Test): -0.21
* Movies (Test): 0.28
* Winobias (Test): -0.08
* Winogrande (Test): 0.18
* NLI (Test): -0.18
* IMDB (Test): -0.38
* Math (Test): -0.16
* HotpotQA_WC (Test): -0.12
* NQ_WC (Test): -0.14
* **Winogrande (Train):**
* TriviaQA (Test): -0.25
* HotpotQA (Test): -0.20
* Movies (Test): -0.10
* Winobias (Test): 0.02
* Winogrande (Test): 0.11
* NLI (Test): -0.07
* IMDB (Test): -0.39
* Math (Test): -0.15
* HotpotQA_WC (Test): -0.14
* NQ_WC (Test): -0.09
* **NLI (Train):**
* TriviaQA (Test): -0.24
* HotpotQA (Test): -0.13
* Movies (Test): -0.17
* Winobias (Test): -0.02
* Winogrande (Test): 0.03
* NLI (Test): 0.32
* IMDB (Test): -0.07
* Math (Test): -0.30
* HotpotQA_WC (Test): -0.15
* NQ_WC (Test): -0.10
* **IMDB (Train):**
* TriviaQA (Test): -0.24
* HotpotQA (Test): -0.16
* Movies (Test): -0.12
* Winobias (Test): 0.05
* Winogrande (Test): -0.03
* NLI (Test): -0.04
* IMDB (Test): 0.08
* Math (Test): -0.35
* HotpotQA_WC (Test): -0.06
* NQ_WC (Test): -0.07
* **Math (Train):**
* TriviaQA (Test): -0.21
* HotpotQA (Test): -0.09
* Movies (Test): -0.22
* Winobias (Test): -0.07
* Winogrande (Test): -0.01
* NLI (Test): -0.34
* IMDB (Test): -0.06
* Math (Test): 0.04
* HotpotQA_WC (Test): -0.16
* NQ_WC (Test): -0.04
* **HotpotQA_WC (Train):**
* TriviaQA (Test): -0.19
* HotpotQA (Test): -0.05
* Movies (Test): -0.16
* Winobias (Test): -0.03
* Winogrande (Test): -0.05
* NLI (Test): -0.21
* IMDB (Test): -0.06
* Math (Test): 0.08
* HotpotQA_WC (Test): -0.12
* NQ_WC (Test): -0.02
* **NQ_WC (Train):**
* TriviaQA (Test): -0.06
* HotpotQA (Test): -0.05
* Movies (Test): -0.10
* Winobias (Test): 0.08
* Winogrande (Test): -0.08
* NLI (Test): -0.36
* IMDB (Test): -0.13
* Math (Test): 0.01
* HotpotQA_WC (Test): -0.03
* NQ_WC (Test): -0.03
### Key Observations
* **Strong Negative Correlations:** IMDB consistently shows strong negative correlations with several datasets when used as a test set (e.g., -0.39 with Winogrande, -0.35 with Math).
* **NLI Self-Correlation:** NLI exhibits the strongest positive correlation with itself (0.32), as expected.
* **Winobias and Movies:** Winobias and Movies show a relatively strong positive correlation (0.28).
* **Math and Movies:** Math and Movies show a relatively strong negative correlation (-0.37).
* **WC Datasets:** The "WC" (presumably "With Context") datasets (HotpotQA_WC and NQ_WC) generally have lower correlation coefficients compared to their non-WC counterparts.
### Interpretation
This heatmap reveals how well performance on one dataset generalizes to other datasets. A positive correlation suggests that models performing well on one dataset are also likely to perform well on the other. Conversely, a negative correlation indicates that good performance on one dataset might not translate to good performance on the other.
The strong negative correlations involving IMDB suggest that this dataset is fundamentally different from the others, and models trained on IMDB might not generalize well to other tasks. The high self-correlation of NLI confirms that performance on this dataset is consistent.
The lower correlations involving the "WC" datasets suggest that adding context might alter the relationships between datasets, potentially making generalization more challenging.
The heatmap provides valuable insights for dataset selection and model training. It highlights which datasets are complementary (positive correlation) and which are divergent (negative correlation), guiding the development of more robust and generalizable models. The data suggests that the datasets are not interchangeable and that performance on one dataset does not guarantee performance on another.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 9: Generalization between datasets, Llama-3-8b-instruct.
Appendix D Taxonomy of Errors
<details>
<summary>extracted/6450693/figures/correctness_across_resamples.png Details</summary>

### Visual Description
\n
## Chart: Correctness vs. Number of Retries
### Overview
The image presents a line chart illustrating the relationship between the number of retries and the resulting correctness. The chart shows a clear upward trend, indicating that correctness increases with the number of retries, but with diminishing returns as the number of retries increases.
### Components/Axes
* **X-axis Label:** "# retries" - Represents the number of attempts or retries. The scale ranges from approximately 1 to 31.
* **Y-axis Label:** "Correctness" - Represents the accuracy or correctness of a process or result. The scale ranges from approximately 0.69 to 0.87.
* **Data Series:** A single blue line with circular data points.
* **No Legend:** There is no explicit legend, but the single line represents the relationship between the two variables.
### Detailed Analysis
The line representing correctness starts at a value of approximately 0.695 when the number of retries is 1. The line then exhibits a steep upward slope, rapidly increasing correctness as the number of retries increases.
Here's a breakdown of approximate data points:
* **Retry = 1:** Correctness â 0.695
* **Retry = 2:** Correctness â 0.725
* **Retry = 3:** Correctness â 0.755
* **Retry = 4:** Correctness â 0.775
* **Retry = 5:** Correctness â 0.790
* **Retry = 6:** Correctness â 0.800
* **Retry = 8:** Correctness â 0.815
* **Retry = 10:** Correctness â 0.825
* **Retry = 12:** Correctness â 0.835
* **Retry = 14:** Correctness â 0.840
* **Retry = 16:** Correctness â 0.845
* **Retry = 18:** Correctness â 0.850
* **Retry = 20:** Correctness â 0.853
* **Retry = 22:** Correctness â 0.855
* **Retry = 24:** Correctness â 0.857
* **Retry = 26:** Correctness â 0.858
* **Retry = 28:** Correctness â 0.860
* **Retry = 31:** Correctness â 0.862
The slope of the line decreases as the number of retries increases. The increase in correctness becomes smaller with each additional retry, suggesting diminishing returns. The line appears to be approaching an asymptote, indicating that further retries will yield only marginal improvements in correctness.
### Key Observations
* The initial increase in correctness with retries is substantial.
* The rate of improvement in correctness decreases significantly after approximately 16 retries.
* The correctness appears to plateau around a value of 0.86.
* There are no apparent outliers or anomalies in the data.
### Interpretation
The chart demonstrates the concept of diminishing returns in a process that involves retries. Initially, each retry significantly improves the correctness of the outcome. However, as the number of retries increases, the marginal benefit of each additional retry decreases. This suggests that there is an optimal number of retries beyond which further attempts are unlikely to yield substantial improvements.
This type of data could represent a variety of scenarios, such as:
* **Error Correction:** The number of attempts to correct an error in a system.
* **Machine Learning Training:** The number of training iterations to improve model accuracy.
* **Iterative Problem Solving:** The number of attempts to solve a problem through trial and error.
The chart highlights the importance of balancing the cost of retries with the potential benefits in terms of improved correctness. It suggests that a strategy of limiting the number of retries to a point where the marginal benefit is low could be more efficient than continuing to retry indefinitely.
</details>
Figure 10: The percentage of answers for which at least one generated answer was correct. The first step is greedy decoding.
Figure 10 presents, for each amount of resamples, the amount percentage of answers for which at least one generated answer was correct. The experiment was done on Mistral-7b-instruct with the TriviaQA dataset. For many answers that the greedy decoding fails to correctly provide an answer, the LLM is still able to generate the correct answer in at least one resample. The plot plateues around 30 resamples.
D.1 Error Taxonomy Design Choices
The error taxonomy proposed in this paper is intentionally non-orthogonal, as some errors may simultaneously belong to multiple categories. For instance, an error might fall under both âconsistently incorrectâ (e.g., the same incorrect answer appears at least 15 times) and âmany different answersâ (e.g., the remaining answers show over 10 distinct variants).
Our taxonomy is designed to capture such nuanced cases, as restricting classification to a single category would hinder the generalizability of insights. Instead, we aim to learn general properties across different error types, providing LLM providers with actionable insights into questions exhibiting overlapping error patterns.
To support this non-orthogonal framework, our probes function as one-to-many classifiers, enabling precise error analysis and tailored solutions.
D.2 Results on Additional Datasets
Table 8 presents the results of error type classification on the Winobias dataset and Table 9 on the Math dataset.
Table 8: AUC scores for error type classification (Winobias).
| Error type | Mistral-7b | Mistral-Instr-7b | Llama3-8b | Llama3-Instr-8b |
| --- | --- | --- | --- | --- |
| (A) Refuses to answer | - | - | - | - |
| (B) Consistently correct | $0.83\scriptscriptstyle{± 0.004}$ | $0.88\scriptscriptstyle{± 0.002}$ | $0.84\scriptscriptstyle{± 0.003}$ | $0.89\scriptscriptstyle{± 0.003}$ |
| (C) Consistently incorrect | $0.83\scriptscriptstyle{± 0.004}$ | $0.88\scriptscriptstyle{± 0.002}$ | $0.79\scriptscriptstyle{± 0.004}$ | $0.90\scriptscriptstyle{± 0.003}$ |
| (D) Two competing | $0.68\scriptscriptstyle{± 0.004}$ | $0.58\scriptscriptstyle{± 0.015}$ | $0.74\scriptscriptstyle{± 0.005}$ | $0.88\scriptscriptstyle{± 0.004}$ |
| (E) Many answers | - | - | - | - |
Table 9: AUC scores for error type classification (Math). Error types are predictable from the inner model representations, indicating the encoding of fine-grained information on errors.
| Error type | Mistral-7b | Mistral-Instr-7b | Llama3-8b | Llama3-Instr-8b |
| --- | --- | --- | --- | --- |
| (A) Refuses to answer | - | - | - | - |
| (B) Consistently correct | $0.85\scriptscriptstyle{± 0.017}$ | $0.84\scriptscriptstyle{± 0.007}$ | $0.83\scriptscriptstyle{± 0.020}$ | $0.87\scriptscriptstyle{± 0.006}$ |
| (C) Consistently incorrect | $0.85\scriptscriptstyle{± 0.026}$ | $0.85\scriptscriptstyle{± 0.003}$ | $0.69\scriptscriptstyle{± 0.032}$ | $0.91\scriptscriptstyle{± 0.007}$ |
| (D) Two competing | - | $0.76\scriptscriptstyle{± 0.020}$ | $0.57\scriptscriptstyle{± 0.001}$ | $0.79\scriptscriptstyle{± 0.006}$ |
| (E) Many answers | $0.74\scriptscriptstyle{± 0.010}$ | $0.79\scriptscriptstyle{± 0.015}$ | $0.69\scriptscriptstyle{± 0.041}$ | $0.90\scriptscriptstyle{± 0.008}$ |
D.3 Qualitative Examples
Tables 10 and 11 present qualitative examples of the error types in the TriviaQA and Math datasets.
Table 10: Examples of error types in TriviaQA, Mistral-7B-Instruct. Correct answer is in bold.
| Type of error | Question | Answers |
| --- | --- | --- |
| Consistently correct | What clothing-part metaphorically classifies workers/jobs according to white or blue? | â collar â: $30$ |
| Consistently incorrect | Which town in southeast Wales became a UNESCO World Heritage Site in 2000? | âBlaenavonâ: 1, âCaerleonâ: 29 |
| Many different answers | Published in 2013 who wrote the novel âThe Kill Listâ? | â Frederick Forsyth â: 1, âJerry Pattersonâ: $1$ , âEdward Leeâ: $1$ , âBarry Lancetâ: $4$ âJeremy Holidayâ: $1$ , âBarry Lincoffâ: $1$ , âJim Marrsâ: $1$ , âJohn Marrsâ: $1$ , âAnthony Lacyâ: $1$ , âDaniel Krausâ: $1$ , âRon Bassâ: $1$ , âDavid Martinielloâ: $2$ , âEric Lustbaderâ: $1$ , âBarbie Latza Nadeauâ: $1$ , âJames Swallowâ: $1$ , âMark Sullivanâ: $1$ , âAlex Binottoâ: $1$ , âDavid Baldacciâ: $1$ , âBill Cosoresâ: $1$ , âFrederic J. Brownâ: $1$ , âRon Capps and Tate Foleyâ: $1$ , âBarbie Wildeâ: $1$ , âNO ANSWERâ: $3$ |
| Two competing answers | What is the only letter of the alphabet which does not appear in any of the names of the 50 American states? | â The letter q â: 15, âThe letter Xâ: 15 |
Table 11: Examples of error types in Math, Mistral-7B-Instruct. Correct answer is in bold.
| Type of error | Question | Answers |
| --- | --- | --- |
| Consistently correct | If John travels 15 miles on a bike ride, and Jill travels 5 miles less, how many miles does Jim travel if he travels only 20% as far as Jill? | â 2 â: $30$ |
| Consistently incorrect | Joy has 30 pencils, and Colleen has 50 pencils. If they bought the pencils at $4 each at the store, how much more money did Colleen pay than Joy for her pencils? | â 80$ â: 1, â16$â: $29$ (correct) |
| Many different answers | If the first skyscraper was built 100 years ago, how many years in the future will it be 5 years before its 200th anniversary of being built? | â 95 â: $14$ , â91â: $1$ , â87â: $1$ , â15â: $2$ , â96â: $1$ , âSixâ: $1$ , â202 â: $1$ , â2035â: $1$ , â195â: $1$ , â49â: $1$ , â101â: $1$ , â199â: $1$ , â3 years before the 200th anniversaryâ: $1$ , â203 years after it was builtâ: $1$ , â196â: $1$ , â2043â: $1$ |
| Two competing answers | David did 27 more push-ups but 7 less crunches than Zachary in gym class today. If Zachary did 5 push- ups and 17 crunches.How many more crunches than push-ups did Zachary do? | â 12 â:5, â1â: 5 $x$ |
Appendix E Detecting the Correct Answer Full Results
In Table 12 we present some qualitative samples from Mistral-7b-instruct, for the phenomenon we observe at error type (C2) Consistently incorrect but generates the correct answer at least one time. The samples in the table represent cases where the probe chose the correct answer. Table 13 compares different decoding mechanisms, including the choice via probe, on non-instruct models, and Table 14 compares on the instruct models. For all datasets and models, we observe similar conclusions to those in the main paper: significant improvement is observed for error types where the LLM shows no preference to the correct answer.
Table 12: Examples of questions where Mistral-7b-Instruct consistently provided incorrect answers but occasionally generated the correct one. In these instances, the probe successfully identified the right answer. For each question, the model was sampled 30 times.
Table 13: Various answer choice strategies, non-instruct models.
| | Mistral-7b | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing |
| All | $0.63$ $± 0.003$ | $0.54$ $± 0.004$ | $0.65$ $± 0.002$ | $0.62$ $± 0.003$ | $0.25$ $± 0.018$ | $0.36$ $± 0.022$ | $0.49$ $± 0.019$ | $0.60$ $± 0.017$ | $0.69$ $± 0.016$ | $0.58$ $± 0.009$ | $0.62$ $± 0.009$ | $0.83$ $± 0.006$ |
| (A) Refuses to answer | $0.08$ $± 0.015$ | $0.04$ $± 0.009$ | $0.00$ $± 0.000$ | $0.13$ $± 0.007$ | $0.01$ $± 0.009$ | $0.04$ $± 0.019$ | $0.00$ $± 0.000$ | $0.22$ $± 0.033$ | - | - | - | - |
| (B1) All | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | - | - | - | - | - | - | - | - |
| (B2) Most | $0.98$ $± 0.001$ | $0.84$ $± 0.009$ | $1.00$ $± 0.000$ | $0.91$ $± 0.002$ | $0.96$ $± 0.024$ | $0.84$ $± 0.031$ | $1.00$ $± 0.000$ | $0.86$ $± 0.041$ | $0.96$ $± 0.004$ | $0.73$ $± 0.009$ | $0.95$ $± 0.003$ | $0.91$ $± 0.009$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $± 0.003$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - | - | - | - | - |
| (C2) Most | $0.03$ $± 0.014$ | $0.20$ $± 0.008$ | $0.00$ $± 0.000$ | $0.27$ $± 0.036$ | - | - | - | - | $0.19$ $± 0.010$ | $0.30$ $± 0.026$ | $0.00$ $± 0.000$ | $0.70$ $± 0.007$ |
| (D) Two competing | $0.48$ $± 0.006$ | $0.36$ $± 0.008$ | $0.52$ $± 0.015$ | $0.54$ $± 0.016$ | - | - | - | - | $0.73$ $± 0.018$ | $0.54$ $± 0.022$ | $0.47$ $± 0.030$ | $0.85$ $± 0.019$ |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.01$ $± 0.004$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.01$ $± 0.010$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - |
| (E2) Correct appears | $0.38$ $± 0.009$ | $0.21$ $± 0.006$ | $0.42$ $± 0.015$ | $0.38$ $± 0.009$ | $0.09$ $± 0.010$ | $0.17$ $± 0.034$ | $0.36$ $± 0.020$ | $0.62$ $± 0.035$ | - | - | - | - |
| Llama-8b | | | | | | | | | | | | |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing |
| All | $0.66$ $± 0.002$ | $0.58$ $± 0.003$ | 0.68 $± 0.003$ | 0.68 $± 0.002$ | $0.30$ $± 0.023$ | $0.47$ $± 0.022$ | $0.62$ $± 0.014$ | $0.70$ $± 0.021$ | $0.73$ $± 0.011$ | $0.61$ $± 0.005$ | $0.66$ $± 0.016$ | 0.84 $± 0.006$ |
| (A) Refuses to answer | $0.08$ $± 0.005$ | $0.07$ $± 0.011$ | $0.00$ $± 0.000$ | 0.16 $± 0.011$ | $0.00$ $± 0.007$ | $0.04$ $± 0.015$ | $0.00$ $± 0.000$ | $0.25$ $± 0.025$ | - | - | - | - |
| (B) Consistently correct | | | | | | | | | | | | |
| (B1) All | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | - | - | - | - | - | - | - | - |
| (B2) Most | $0.98$ $± 0.001$ | $0.87$ $± 0.002$ | 1.00 $± 0.000$ | $0.95$ $± 0.002$ | $0.77$ $± 0.024$ | $0.88$ $± 0.025$ | $1.00$ $± 0.000$ | $0.97$ $± 0.014$ | $0.98$ $± 0.005$ | $0.75$ $± 0.004$ | 1.00 $± 0.000$ | $0.94$ $± 0.003$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - | - | - | - | - |
| (C2) Most | $0.06$ $± 0.013$ | $0.18$ $± 0.009$ | $0.00$ $± 0.000$ | 0.35 $± 0.043$ | - | - | - | - | $0.25$ $± 0.026$ | $0.29$ $± 0.023$ | $0.00$ $± 0.000$ | 0.65 $± 0.022$ |
| (D) Two competing | $0.44$ $± 0.029$ | $0.42$ $± 0.035$ | $0.53$ $± 0.020$ | 0.66 $± 0.030$ | - | - | - | - | $0.73$ $± 0.025$ | $0.47$ $± 0.019$ | $0.41$ $± 0.037$ | 0.86 $± 0.014$ |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - |
| (E2) Correct appears | $0.46$ $± 0.009$ | $0.34$ $± 0.009$ | $0.53$ $± 0.007$ | 0.54 $± 0.005$ | $0.14$ $± 0.015$ | $0.17$ $± 0.025$ | $0.44$ $± 0.047$ | $0.65$ $± 0.031$ | - | - | - | - |
Table 14: Various answer choice strategies, instruct models.
| | Mistral-7b-Instruct | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing |
| All | $0.63$ $± 0.003$ | $0.64$ $± 0.002$ | $0.67$ $± 0.004$ | $0.71$ $± 0.003$ | $0.55$ $± 0.021$ | $0.52$ $± 0.019$ | $0.57$ $± 0.025$ | $0.70$ $± 0.014$ | $0.77$ $± 0.012$ | $0.77$ $± 0.008$ | $0.77$ $± 0.010$ | $0.79$ $± 0.008$ |
| (A) Refuses to answer | $0.06$ $± 0.005$ | $0.06$ $± 0.011$ | $0.00$ $± 0.000$ | $0.28$ $± 0.009$ | - | - | - | - | - | - | - | - |
| (B1) All | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ |
| (B2) Most | $0.88$ $± 0.007$ | $0.83$ $± 0.009$ | $0.99$ $± 0.002$ | $0.89$ $± 0.010$ | $0.87$ $± 0.013$ | $0.84$ $± 0.024$ | $1.00$ $± 0.000$ | $0.96$ $± 0.007$ | $0.91$ $± 0.031$ | $0.87$ $± 0.029$ | $0.96$ $± 0.017$ | $0.89$ $± 0.032$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $± 0.003$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.05$ $± 0.020$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ |
| (C2) Most | $0.11$ $± 0.009$ | $0.15$ $± 0.012$ | $0.00$ $± 0.000$ | $0.53$ $± 0.005$ | $0.10$ $± 0.040$ | $0.20$ $± 0.050$ | $0.00$ $± 0.000$ | $0.82$ $± 0.037$ | $0.18$ $± 0.057$ | $0.20$ $± 0.039$ | $0.00$ $± 0.000$ | $0.54$ $± 0.067$ |
| (D) Two competing | $0.32$ $± 0.010$ | $0.45$ $± 0.023$ | $0.50$ $± 0.024$ | $0.78$ $± 0.017$ | - | - | - | - | - | - | - | - |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.01$ $± 0.003$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - | - | - | - | - |
| (E2) Correct appears | $0.23$ $± 0.020$ | $0.19$ $± 0.022$ | $0.38$ $± 0.009$ | $0.56$ $± 0.025$ | - | - | - | - | - | - | - | - |
| Llama-8b-Instruct | | | | | | | | | | | | |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing |
| All | $0.69$ $± 0.003$ | $0.67$ $± 0.001$ | $0.71$ $± 0.002$ | 0.73 $± 0.004$ | $0.89$ $± 0.010$ | $0.87$ $± 0.012$ | 0.91 $± 0.013$ | 0.91 $± 0.010$ | $0.75$ $± 0.009$ | $0.74$ $± 0.009$ | $0.76$ $± 0.012$ | 0.83 $± 0.009$ |
| (A) Refuses to answer | $0.06$ $± 0.011$ | $0.05$ $± 0.011$ | $0.00$ $± 0.000$ | 0.27 $± 0.025$ | - | - | - | - | - | - | - | - |
| (B) Consistently correct | | | | | | | | | | | | |
| (B1) All | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ | $1.00$ $± 0.000$ |
| (B2) Most | $0.93$ $± 0.002$ | $0.86$ $± 0.009$ | 1.00 $± 0.001$ | $0.92$ $± 0.004$ | $0.94$ $± 0.014$ | $0.92$ $± 0.014$ | 1.00 $± 0.000$ | $0.95$ $± 0.013$ | $0.94$ $± 0.006$ | $0.88$ $± 0.010$ | $1.00$ $± 0.000$ | $0.93$ $± 0.011$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $± 0.001$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ |
| (C2) Most | $0.12$ $± 0.018$ | $0.22$ $± 0.010$ | $0.00$ $± 0.000$ | 0.43 $± 0.010$ | - | - | - | - | $0.11$ $± 0.018$ | $0.15$ $± 0.025$ | $0.00$ $± 0.000$ | 0.67 $± 0.016$ |
| (D) Two competing | $0.43$ $± 0.017$ | $0.42$ $± 0.014$ | $0.46$ $± 0.016$ | 0.60 $± 0.010$ | - | - | - | - | $0.39$ $± 0.068$ | $0.39$ $± 0.047$ | $0.38$ $± 0.042$ | 0.83 $± 0.050$ |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.00$ $± 0.002$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | $0.00$ $± 0.000$ | - | - | - | - | - | - | - | - |
| (E2) Correct appears | $0.28$ $± 0.006$ | $0.28$ $± 0.008$ | $0.40$ $± 0.009$ | 0.52 $± 0.009$ | - | - | - | - | - | - | - | - |
Appendix F Practical Guidance on Integrating Insights from this Paper into Model Development Workflows
The findings of this study reveal critical insights into the internal mechanisms of Large Language Models (LLMs) and their implications for truthfulness and error handling. To effectively incorporate these insights into model development, consider the following strategies:
Error Detection.
Focus on representations of exact answer tokens to train the error detection probe. These tokens encode significant truthfulness signals and improve the reliability of error detection mechanisms. The trained probe should be integrated as part of the pipeline for specific task, e.g., math calculations. The probe provides a confidence score which can be used to warn the user for unreliable outputs, or to perform an intervention to fix the answer.
Error-Specific Interventions.
The taxonomy of errors outlined in this study can be utilized to classify and analyze the types of errors that an LLMs may produce. Identifying these error types is useful for customizing strategies for error mitigation. The probes for detecting error types can be deployed as part of the LLM pipeline and create interventions based on their predictions. For example, Retrieval Augmented Generation (RAG) (Lewis et al., 2020) can help for âconsistently incorrectâ errors, as well as resampling and choosing the answer ranked highest by the error detection probe, or weight-update, if possible, as a more consistent solution. For âconsistently correctâ error types, an intervention on the LLMâs internal representations can increase the confidence in generating a correct answer (Simhi et al., 2024).
Cross-Task Generalization.
Universal generalization of probing classifiers across unrelated tasks should be approached with caution. The results in this work show that probes are mainly useful for task-specific error detection.