2410.02707v4
Model: nemotron-free
# LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
> Corresponding author; Work partially done during internship at Apple.
## Abstract
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as âhallucinationsâ. Recent studies have demonstrated that LLMsâ internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying thatâcontrary to prior claimsâtruthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMsâ internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the modelâs internal perspective, which can guide future research on enhancing error analysis and mitigation. Our code is available in https://github.com/technion-cs-nlp/LLMsKnow.
## 1 Introduction
The ever-growing popularity of large language models (LLM) across many domains has brought a significant limitation to center stage: their tendency to âhallucinateâ â which is often used to describe the generation of inaccurate information. But what are hallucinations, and what causes them? A considerable body of research has sought to define, taxonomize, and understand hallucinations through extrinsic, behavioral analysis, primarily examining how users perceive such errors (Bang et al., 2023; Ji et al., 2023; Huang et al., 2023a; Rawte et al., 2023). However, this approach does not adequately address how these errors are encoded within the LLMs. Alternatively, another line of work has explored the internal representations of LLMs, suggesting that LLMs encode signals of truthfulness (Kadavath et al., 2022; Li et al., 2024; Chen et al., 2024, inter alia). However, these analyses were typically restricted to detecting errorsâdetermining whether a generated output contains inaccuraciesâwithout delving deeper into how such signals are represented and could be leveraged to understand or mitigate hallucinations.
In this work, we reveal that the internal representations of LLMs encode much more information about truthfulness than previously recognized. Through a series of experiments, we train classifiers on these internal representations to predict various features related to the truthfulness of generated outputs. Our findings reveal the patterns and types of information encoded in model representations, linking this intrinsic data to extrinsic LLM behavior. This enhances our ability to detect errors (while understanding the limitations of error detection), and may guide the development of more nuanced strategies based on error types and mitigation methods that make use of the modelâs internal knowledge. Our experiments are designed to be general, covering a broad array of LLM limitations. While the term âhallucinationsâ is widely used, it lacks a universally accepted definition (Venkit et al., 2024). Our framework adopts a broad interpretation, considering hallucinations to encompass all errors produced by an LLM, including factual inaccuracies, biases, common-sense reasoning failures, and other real-world errors. This approach enables us to draw general conclusions about model errors from a broad perspective.
Our first step is identifying where truthfulness signals are encoded in LLMs. Previous studies have suggested methods for detecting errors in LLM outputs using intermediate representations, logits, or probabilities, implying that LLMs may encode signals of truthfulness (Kadavath et al., 2022; Li et al., 2024; Chen et al., 2024). Focusing on long-form generations, which reflect real-world usage of LLMs, our analysis uncovers a key oversight: the choice of token used to extract these signals (Section 3). We find that truthfulness information is concentrated in the exact answer tokens â e.g., âHartfordâ in âThe capital of Connecticut is Hartford, an iconic cityâŠâ. Recognizing this nuance significantly improves error detection strategies across the board, revealing that truthfulness encoding is stronger than previously observed.
From this point forward, we concentrate on our most effective strategy: a classifier trained on intermediate LLM representations within the exact answer tokens, referred to as âprobing classifiersâ (Belinkov, 2021). This approach helps us explore what these representations reveal about LLMs. Our demonstration that a trained probing classifier can predict errors suggests that LLMs encode information related to their own truthfulness. However, we find that probing classifiers do not generalize across different tasks (Section 4). Generalization occurs only within tasks requiring similar skills (e.g., factual retrieval), indicating the truthfulness information is âskill-specificâ and varies across different tasks. For tasks involving different skills, e.g., sentiment analysis, these classifiers are no betterâor worseâthan logit-based uncertainty predictors, challenging the idea of a âuniversal truthfulnessâ encoding proposed in previous work (Marks & Tegmark, 2023; Slobodkin et al., 2023). Instead, our results indicate that LLMs encode multiple, distinct notions of truth. Thus, deploying trainable error detectors in practical applications should be undertaken with caution.
We next find evidence that LLMs encode not only error detection signals but also more nuanced information about error types. Delving deeper into errors within a single task, we taxonomize its errors based on responses across repeated samples (Section 5). For example, the same error being consistently generated is different from an error that is generated occasionally among many other distinct errors. Using a different set of probing classifiers, we find that error types are predictable from the LLM representations, drawing a connection between the modelsâs internal representations and its external behavior. This classification offers a more nuanced understanding of errors, enabling developers to predict error patterns and implement more targeted mitigation strategies.
Finally, we find that the truthfulness signals encoded in LLMs can also differentiate between correct and incorrect answers for the same question (Section 6). Results highlight a significant misalignment between LLMâs internal representations and its external behavior in some cases. The modelâs internal encoding may identify the correct answerâyet it frequently generates an incorrect response. This discrepancy reveals that the LLMâs external behavior may misrepresent its abilities, potentially pointing to new strategies for reducing errors by utilizing its existing strengths. Overall, our model-centric framework provides a deeper understanding of LLM errors, suggesting potential directions for improvements in error analysis and mitigation.
## 2 Background
Defining and characterizing LLM errors.
The term âhallucinationsâ is widely used across various subfields such as conversational AI (Liu et al., 2022), abstractive summarization (Zhang et al., 2019), and machine translation (Wang & Sennrich, 2020), each interpreting the term differently. Yet, no consensus exists on defining hallucinations: Venkit et al. (2024) identified 31 distinct frameworks for conceptualizing hallucinations, revealing the diversity of perspectives. Research efforts aim to define and taxonomize hallucinations, distinguishing them from other error types (Liu et al., 2022; Ji et al., 2023; Huang et al., 2023a; Rawte et al., 2023). On the other hand, recent scholarly conversations introduce terms like âconfabulationsâ (Millidge, 2023) and âfabricationsâ (McGowan et al., 2023), attributing a possible âintentionâ to LLMs, although the notions of LLM âintentionâ and other human-like traits are still debated (Salles et al., 2020; Serapio-GarcĂa et al., 2023; Harnad, 2024). These categorizations, however, adopt a human-centric view by focusing on the subjective interpretations of LLM hallucinations, which does not necessarily reflect how these errors are encoded within the models themselves. This gap limits our ability to address the root causes of hallucinations, or to reason about their nature. For example, it is unclear whether conclusions about hallucinations defined in one framework can be applied to another framework. Liang et al. (2024) defined hallucinations as inconsistencies with the training data. While this approach engage with the possible root causes of hallucinations, our study focuses on insights from the model itself, without requiring training data access. Instead, we adopt a broad interpretation of hallucinations. Here, we define hallucinations as any type of error generated by an LLM, including factual inaccuracies, biases, failures in common-sense reasoning, and others.
Another line of research suggests that LLMs either encode information about their own errors (Kadavath et al., 2022; Azaria & Mitchell, 2023) or exhibit discrepancies between their outputs and internal representations (Liu et al., 2023; Gottesman & Geva, 2024), indicating the presence of underlying mechanisms not reflected in their final outputs. Moreover, Yona et al. (2024) found that current LLMs fail to effectively convey their uncertainty through their generated outputs. Hence, we propose shifting the focus from human-centric interpretations of hallucinations to a model-centric perspective, examining the modelâs intermediate activations.
Error detection in LLMs.
Error detection is a longstanding task in NLP, crucial for maintaining high standards in various practical applications and for constructing more reliable systems that ensure user trust (Bommasani et al., 2021). Over the years, many studies have proposed task-specific solutions (see Section A.1). However, the recent shift towards general-purpose LLMs necessitates a holistic approach capable of addressing any error type, rather than focusing on specific ones, making it suitable for the diverse errors generated by these models.
A line of work has addressed this challenge by leveraging external knowledge sources (Lewis et al., 2020; Gao et al., 2023) or an external LLM judge (Lin et al., 2021; Rawte et al., 2023) to identify erroneous outputs. On the other hand, our work focuses on detection methods that rely solely on the computations of the LLMâspecifically, output logits, probabilities after softmax, and hidden states.
Error detection in LLMs is also closely linked to uncertainty estimation, where low certainty signals potential inaccuracies and possible errors. Popular methods to derive calibrated confidence include inspecting the model logit output values (Varshney et al., 2023; Taubenfeld et al., 2025), agreement across multiple sampled answers (Kuhn et al., 2023; Manakul et al., 2023; Tian et al., 2023a), verbalized probability (Tian et al., 2023b), and direct prompting (Kadavath et al., 2022).
Another line of work trains probing classifiers to discover and utilize truthfulness features. This approach has shown some success by probing the final token of an answerâeither generated (Kadavath et al., 2022; Snyder et al., 2023; Yuksekgonul et al., 2023; Zou et al., 2023; Yin et al., 2024; Chen et al., 2024; Simhi et al., 2024; Gekhman et al., 2025) or not (Li et al., 2024; Marks & Tegmark, 2023; Burns et al., 2022; Azaria & Mitchell, 2023; Rateike et al., 2023). Others probe the final token of the prompt before the response is generated (Slobodkin et al., 2023; Snyder et al., 2023; Simhi et al., 2024; Gottesman & Geva, 2024). Many previous studies simplify the analysis by generating answers in a few-shot setting or limiting generation to a single token. In contrast, we simulate real-world usage of LLMs by allowing unrestricted answer generation. By probing exact answer tokens, we achieve significant improvements in error detection.
## 3 Better Error Detection
This section presents our experiments on detecting LLM errors through their own computations, focusing on token selectionâs impact and introducing a method that outperforms other approaches.
### 3.1 Task Definition
Given an LLM $M$ , an input prompt $p$ and the LLM-generated response $\hat{y}$ , the task is to predict whether $\hat{y}$ is correct or wrong. We assume that there is access to the LLMâs internal states (i.e., white-box setting), but no access to any external resources (e.g., search engine or additional LLMs).
We use a dataset $D=\{(q_{i},y_{i})\}_{i=1}^{N}$ , consisting of $N$ question-label pairs, where $\{q_{i}\}_{i=1}^{N}$ represents a series of questions (e.g., âWhat is the capital of Connecticut?â) and $\{y_{i}\}_{i=1}^{N}$ the corresponding ground-truth answers (âHartfordâ). For each question $q_{i}$ , we prompt the model $M$ to generate a response $y_{i}$ , resulting in the set of predicted answers $\{\hat{y}_{i}\}_{i=1}^{N}$ (âThe capital of Connecticut is HartfordâŠâ). Next, to build our error-detection dataset, we evaluate the correctness of each generated response $\hat{y}_{i}$ by comparing it to the ground-truth label $y_{i}$ . This comparison yields a correctness label $z_{i}\in\{0,1\}$ ( $1$ correct, $0 0$ wrong). The comparison can be done either via automatic heuristics or with the assistance of an instruct-LLM. For most datasets, we use heuristics to predict correctness, except for one case. See Appendix A.2. Our error detection dataset is: $\{(q_{i},\hat{y}_{i},z_{i})\}_{i=1}^{N}$ . Note that this dataset is defined based on the analyzed LLM and its generated answers. Any instances where the LLM refuses to answer are excluded, as these can easily be classified as incorrect.
### 3.2 Experimental Setup
Datasets and models.
We perform all experiments on four LLMs: Mistral-7b (Jiang et al., 2023), Mistral-7b-instruct-v0.2 (denoted Mistral-7b-instruct), Llama3-8b (Touvron et al., 2023), and Llama3-8b-instruct. We consider 10 different datasets spanning various domains and tasks: TriviaQA (Joshi et al., 2017), HotpotQA with/without context (Yang et al., 2018), Natural Questions (Kwiatkowski et al., 2019), Winobias (Zhao et al., 2018), Winogrande (Sakaguchi et al., 2021), MNLI (Williams et al., 2018), Math (Sun et al., 2024), IMDB review sentiment analysis (Maas et al., 2011), and a dataset of movie roles (movies) that we curate. We allow unrestricted response generation to mimic real-world LLM usage, with answers decoded greedily. For more details on the datasets and the prompts used to generate answers, refer to Appendix A.3.
Performance metric.
We measure the area under the ROC curve to evaluate error detectors, providing a single metric that reflects their ability to distinguish between positive and negative cases across many thresholds, balancing sensitivity (true positive rate) and specificity (false positive rate).
Error detection methods. We compare methods from both uncertainty and hallucinations literature.
- Aggregated probabilities / logits: Previous studies (Guerreiro et al., 2023; Kadavath et al., 2022; Varshney et al., 2023; Huang et al., 2023b) aggregate output token probabilities or logits to score LLM confidence for error detection. We implement several methods from the literature, calculating the minimum, maximum, or mean of these values. The main paper reports results for the most common approach, Logits-mean, and the best-performing one, Logits-min, with additional baselines in Appendix B.
- P(True): Kadavath et al. (2022) showed that LLMs are relatively calibrated when asked to evaluate the correctness of their generation via prompting. We implement this evaluation using the same prompt.
- Probing: Probing classifiers involve training a small classifier on a modelâs intermediate activations to predict features of processed text (Belinkov, 2021). Recent studies show their effectiveness for error detection in generated text (Kadavath et al., 2022, inter alia). An intermediate activation is a vector $h_{l,t}$ from a specific LLM layer $l$ and (either read or generated) token $t$ . Thus, each LLM generation produces multiple such activations. Following prior work, we use a linear probing classifier for error detection (Li et al., 2024, inter alia) on static tokens: the last generated token ( $h_{l,-1}$ ), the one before it ( $h_{l,-2}$ ), and the final prompt token ( $h_{l,k}$ ). The layer $l$ is selected per token based on validation set performance.
For further details on the implementation of each method, refer to Appendix A.4.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Prompt-Response Interaction with Token Highlighting
### Overview
The diagram illustrates a question-answering interaction where a prompt asks for the capital of Connecticut, and the response provides "Hartford" as the answer. Key elements include color-coded tokens, arrows indicating relationships, and positional markers for evaluation metrics.
### Components/Axes
1. **Prompt Section**:
- Green box labeled `<s> [INST] What is the capital of the U.S. state of Connecticut? [/INST]`.
- Contains the question and instruction tokens (`[INST]`, `[/INST]`).
2. **Response Section**:
- Blue box containing the answer: "The capital city of the U.S. state of Connecticut is Hartford. It's one of the oldest cities...".
- Highlighted tokens:
- **Purple box**: "Hartford" (labeled `first_exact_answer_token`).
- **Blue box**: "Hartford" (labeled `last_exact_answer_token`).
- **Red box**: `</s>` (end-of-sequence token).
3. **Arrows and Labels**:
- Purple arrow from `first_exact_answer_token` to "Hartford".
- Blue arrow from `last_exact_answer_token` to "Hartford".
- Text labels: `last_q_token` (top-right of prompt), `last_exact_answer_token` (blue arrow).
4. **Evaluation Metrics**:
- Numerical values `-2 -1` at the bottom-left, possibly indicating scoring or positional offsets.
### Detailed Analysis
- **Prompt**: The question is framed within instruction tokens (`[INST]` and `[/INST]`), a common format for structured input in language models.
- **Response**: The answer "Hartford" is explicitly highlighted twice, suggesting it is the target of extraction or evaluation.
- **Token Highlighting**:
- Purple (`first_exact_answer_token`) and blue (`last_exact_answer_token`) boxes both point to "Hartford", indicating it is the extracted answer.
- Red box (`</s>`) marks the end of the response sequence.
- **Spatial Grounding**:
- Prompt is positioned at the top, response below.
- Arrows connect tokens to their respective labels, emphasizing the flow from question to answer.
- `-2 -1` is placed at the bottom-left, separate from the main content.
### Key Observations
- The answer "Hartford" is the only token highlighted with both `first_exact_answer_token` and `last_exact_answer_token`, confirming it as the correct response.
- The `-2 -1` values lack context but may represent positional indices or evaluation scores (e.g., BLEU, ROUGE).
- No other tokens in the response are highlighted, suggesting the model focused solely on "Hartford".
### Interpretation
This diagram demonstrates how a language model processes a question and identifies the exact answer within its response. The dual highlighting of "Hartford" emphasizes its role as the target answer, while the `</s>` token marks the response's completion. The `-2 -1` values likely relate to evaluation metrics but require additional context for precise interpretation. The diagram underscores the model's ability to isolate and extract factual answers from generated text, a critical capability for QA systems.
</details>
Figure 1: Example for the input and LLM output from the TriviaQA dataset, and the names of the tokens that can be probed.
Exact Answer Tokens.
Existing methods often overlook a critical nuance: the token selection for error detection, typically focusing on the last generated token or taking a mean. However, since LLMs typically generate long-form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to LLMsâ unidirectional nature, failing to account for the generated response and missing cases where different sampled answers from the same model vary in correctness. We investigate a previously unexamined token location: the exact answer tokens, which represent the most meaningful parts of the generated response. We define exact answer tokens as those whose modification alters the answerâs correctness, disregarding subsequent generated content. In practice, we do not use this definition for extracting the exact answer, but rather an instruct model in a few-shot setting. Still, the definition is useful to manually verify that automatic extractions work as expected. Figure 1 illustrates the different token locations. In the following experiments, we implement each error detection method with an âexact answerâ version, demonstrating that it often improves performance, especially in probing. Implementation details for detecting the exact answer token are given in Appendix A.2.
### 3.3 Results
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/triviaqa_auc.png Details</summary>

### Visual Description
## Heatmap: Attention Weights Across Transformer Layers and Tokens
### Overview
The image is a heatmap visualizing attention weights across 31 transformer layers (0â30) and 11 token categories. Attention values range from 0.5 (light blue) to 1.0 (dark blue), with a highlighted region (black rectangle) emphasizing specific layers and tokens. The heatmap suggests a focus on token relationships and layer-specific processing patterns.
### Components/Axes
- **X-axis (Tokens)**:
`last_q`, `first_answer`, `second_answer`, `exact_answer_before_last`, `exact_answer_last`, `exact_answer_after_last`
(Token categories related to question/answer positions in a sequence)
- **Y-axis (Layers)**:
Layers 0â30 (representing transformer decoder layers in a model)
- **Color Scale**:
Vertical bar on the right, ranging from 0.5 (lightest blue) to 1.0 (darkest blue), indicating attention weight magnitude.
- **Highlighted Region**:
Black rectangle spanning layers 10â20 and tokens `exact_answer_last` to `exact_answer_after_last`.
### Detailed Analysis
- **Token Categories**:
- `last_q`: Appears in all layers, with moderate attention (0.6â0.8).
- `first_answer`, `second_answer`: Lower attention (0.5â0.7) in early layers, increasing slightly in later layers.
- `exact_answer_before_last`, `exact_answer_last`, `exact_answer_after_last`: High attention (0.8â1.0) in layers 10â20, with `exact_answer_last` showing the strongest focus (darkest blue).
- **Layer Trends**:
- Early layers (0â10): Lower overall attention weights (0.5â0.7), with gradual increases toward `exact_answer_last`.
- Middle layers (10â20): Peak attention for `exact_answer_last` and `exact_answer_after_last` (0.9â1.0).
- Later layers (20â30): Attention weights decline slightly (0.7â0.9), with `exact_answer_last` remaining dominant.
- **Color Consistency**:
Darker blues in the highlighted region align with the legendâs 0.9â1.0 range, confirming high attention in this subregion.
### Key Observations
1. **Concentration of Attention**:
The highlighted region (`exact_answer_last`/`exact_answer_after_last` in layers 10â20) shows the highest attention weights, suggesting these tokens are critical for the modelâs decision-making.
2. **Layer-Specific Processing**:
Early layers focus on general context (`last_q`), while middle layers specialize in precise answer tokens. Later layers refine these relationships but show reduced intensity.
3. **Uniformity in Later Layers**:
Layers 20â30 exhibit more uniform attention across tokens, possibly indicating stabilized representations.
### Interpretation
This heatmap likely represents attention weights in a transformer-based model (e.g., for question answering). The highlighted region indicates that layers 10â20 prioritize tokens directly related to the exact answer, suggesting these layers are pivotal for extracting precise information. The decline in attention weights in later layers may reflect the modelâs consolidation of information rather than active processing. The trend implies that earlier layers handle contextual understanding, while middle layers focus on answer extraction, and later layers refine outputs. The uniformity in later layers could indicate over-smoothing or reduced discriminative power in deeper layers.
</details>
(a) TriviaQA
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/winobias_auc.png Details</summary>

### Visual Description
## Heatmap: Token-Layer Attention Distribution
### Overview
The image is a heatmap visualizing the distribution of attention weights across different tokens and layers in a neural network model. The x-axis represents tokens (e.g., "last_q," "first_answer," "exact_answer_before_first"), while the y-axis represents layers (0â30). The color intensity (blue gradient) indicates the magnitude of attention weights, with darker blue representing higher values (closer to 1.0) and lighter blue/lighter shades representing lower values (closer to 0.5). A vertical black box highlights a specific region of interest.
---
### Components/Axes
- **X-axis (Token)**:
Labels include:
`last_q`, `first_answer`, `second_answer`, `exact_answer_before_first`, `exact_answer_first`, `exact_answer_last`, `exact_answer_after_last`.
Tokens are grouped into three sections:
1. **Left Section**: `last_q`, `first_answer`, `second_answer`
2. **Middle Section (Highlighted)**: `exact_answer_before_first` to `exact_answer_after_last`
3. **Right Section**: `exact_answer_last` (repeated 8 times, labeled 1â8).
- **Y-axis (Layer)**:
Layers range from 0 to 30, with increments of 2 (e.g., 0, 2, 4, ..., 30).
- **Color Scale (Legend)**:
A vertical color bar on the right maps values from **0.5 (lightest blue)** to **1.0 (darkest blue)**.
- **Highlighted Region**:
A black box spans layers 10â20 and tokens from `exact_answer_before_first` to `exact_answer_after_last`.
---
### Detailed Analysis
- **Token-Layer Distribution**:
- **Left Section (Tokens: `last_q`, `first_answer`, `second_answer`)**:
Attention weights are uniformly low (light blue), with values approximately **0.5â0.7** across all layers.
- **Middle Section (Highlighted Tokens)**:
- **Layers 10â20**: Darkest blue, indicating the highest attention weights (**~0.9â1.0**).
- **Layers 0â9 and 21â30**: Gradual decrease in intensity, with values dropping to **~0.6â0.8**.
- **Right Section (Tokens: `exact_answer_last` 1â8)**:
Attention weights are moderately low (**~0.6â0.8**) across all layers, with no significant variation.
- **Color Consistency**:
The legend confirms that darker blue corresponds to higher values. All data points in the middle section align with this scale.
---
### Key Observations
1. **Concentration of Attention**:
The model exhibits the strongest attention to tokens in the middle section (`exact_answer_before_first` to `exact_answer_after_last`) during layers 10â20.
2. **Layer-Specific Focus**:
Layers 10â20 are critical for processing exact answer tokens, while earlier and later layers show diminished focus.
3. **Uniformity in Context Tokens**:
Tokens like `last_q` and `first_answer` receive minimal attention across all layers.
4. **Repetition in Right Section**:
The repeated `exact_answer_last` tokens (1â8) show consistent but low attention, suggesting redundancy or lack of importance.
---
### Interpretation
The heatmap reveals that the model prioritizes the exact answer tokens (`exact_answer_before_first` to `exact_answer_after_last`) during mid-layers (10â20), likely reflecting a focus on precise information extraction. The decline in attention toward the ends of the token sequence (e.g., `last_q`, `exact_answer_last`) suggests that contextual or framing tokens are less critical for the model's decision-making. This pattern aligns with typical transformer architectures, where mid-layers often encode higher-level semantic information. The highlighted region underscores the model's reliance on specific tokens for accurate output, emphasizing the importance of attention mechanisms in capturing relevant data.
</details>
(b) Winobias
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/answerable_math_auc.png Details</summary>

### Visual Description
## Heatmap: Layer-Token Value Distribution
### Overview
The image displays a heatmap visualizing the distribution of values across 31 layers (y-axis) and 11 token categories (x-axis). Values range from 0.5 (lightest blue) to 1.0 (darkest blue), with a prominent dark blue rectangular block dominating the center of the visualization.
### Components/Axes
- **X-axis (Token)**:
- Categories: `last_q`, `first_answer`, `second_answer`, `exact_answer_before_first`, `exact_answer_first`, `exact_answer_last`, `exact_answer_after_last`, `-8`, `-7`, `-6`, `-5`, `-4`, `-3`, `-2`, `-1`
- **Y-axis (Layer)**:
- Numerical scale from 0 to 30 (inclusive)
- **Legend**:
- Color bar on the right with gradient from light blue (0.5) to dark blue (1.0)
- **Key Feature**:
- Dark blue rectangular block spanning layers 10â20 and tokens `exact_answer_first` to `exact_answer_last`
### Detailed Analysis
- **Dark Blue Block**:
- Positioned centrally (layers 10â20, tokens `exact_answer_first` to `exact_answer_last`)
- Values approximate **0.9â1.0** (darkest blue)
- **Surrounding Gradient**:
- Layers 0â9 and 21â30 show lighter blue shades (values ~0.6â0.8)
- Tokens `-8` to `-1` exhibit moderate values (~0.7â0.8) in layers 10â20
- **Edge Cases**:
- `last_q` and `first_answer` tokens show minimal intensity (<0.6) across all layers
- Tokens `-8` to `-1` have sparse dark blue patches in layers 5â15
### Key Observations
1. **Central Cluster Dominance**: The dark blue block occupies ~40% of the heatmap, indicating a strong concentration of high values in specific layers and tokens.
2. **Layer-Specific Patterns**: Layers 10â20 consistently show higher values for `exact_answer_*` tokens compared to other layers.
3. **Negative Token Behavior**: Tokens `-8` to `-1` display intermediate values, suggesting partial correlation with the central cluster.
### Interpretation
The heatmap reveals that layers 10â20 are critical for processing `exact_answer_*` tokens, with values peaking near 1.0. This suggests these layers may specialize in precise answer extraction or validation. The gradient around the central block implies diminishing importance of these tokens in other layers. The sparse dark blue patches in negative tokens (-8 to -1) hint at potential secondary processing roles, though their values remain significantly lower than the central cluster. The minimal activity in `last_q` and `first_answer` tokens across all layers indicates these may serve distinct, less value-intensive functions in the system.
</details>
(c) Math
Figure 2: AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. Generation proceeds from left to right, with detection performance peaking at the exact answer tokens.
Patterns of truthfulness encoding.
We first focus on probing classifiers to gain insights into the internal representations of LLMs. Specifically, we analyze the effects of layer and token selection on the error detection performance of these probing classifiers. By systematically probing all model layers, starting from the last question token to the final generated token, we observe consistent truthfulness encoding patterns. Figure 2 shows AUC metrics of probes across Mistral-7b-Instruct layers and tokens. Middle to later layers often yield the most effective probing results (see Appendix B for more datasets and models), aligning with previous studies on truthfulness encoding (Burns et al., 2022; CH-Wang et al., 2023) and transformer representations (nostalgebraist, 2020; Meng et al., 2022; Geva et al., 2023). Regarding tokens, a strong truthfulness signal appears immediately after the prompt, suggesting that this representation encodes information on the modelâs general ability to answer the question correctly. This signal weakens as text generation progresses but peaks again at the exact answer tokens. Towards the end of the generation process, signal strength rises again, though it remains weaker than at the exact answer tokens. These patterns are consistent across nearly all datasets and models (see Appendix B), suggesting a general mechanism by which LLMs encode and process truthfulness during text generation.
Error Detection Results.
Table 1: Comparison of error detection techniques using AUC metric, across different models and datasets. The best-performing method is bolded. Using exact answer tokens is useful for many cases, especially probing.
| | Mistral-7b-Instruct | Llama 3-8b-Instruct | | | | |
| --- | --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | TriviaQA | Winobias | Math | |
| Logits-mean | $0.60$ $\pm 0.009$ | $0.56$ $\pm 0.017$ | $0.55$ $\pm 0.029$ | $0.66$ $\pm 0.005$ | $0.60$ $\pm 0.026$ | $0.75$ $\pm 0.018$ |
| Logits-mean-exact | $0.68$ $\pm 0.007$ | $0.54$ $\pm 0.012$ | $0.51$ $\pm 0.005$ | $0.71$ $\pm 0.006$ | $0.55$ $\pm 0.019$ | $0.80$ $\pm 0.021$ |
| Logits-min | $0.63$ $\pm 0.008$ | $0.59$ $\pm 0.012$ | $0.51$ $\pm 0.017$ | $0.74$ $\pm 0.007$ | $0.61$ $\pm 0.024$ | $0.75$ $\pm 0.016$ |
| Logits-min-exact | $0.75$ $\pm 0.006$ | $0.53$ $\pm 0.013$ | $0.71$ $\pm 0.009$ | $0.79$ $\pm 0.006$ | $0.61$ $\pm 0.019$ | $0.89$ $\pm 0.018$ |
| p(True) | $0.66$ $\pm 0.006$ | $0.45$ $\pm 0.021$ | $0.48$ $\pm 0.022$ | $0.73$ $\pm 0.008$ | $0.59$ $\pm 0.020$ | $0.62$ $\pm 0.017$ |
| p(True)-exact | $0.74$ $\pm 0.003$ | $0.40$ $\pm 0.021$ | $0.60$ $\pm 0.025$ | $0.73$ $\pm 0.005$ | $0.63$ $\pm 0.014$ | $0.59$ $\pm 0.018$ |
| Probe @ token | | | | | | |
| Last generated [-1] | $0.71$ $\pm 0.006$ | $0.82$ $\pm 0.004$ | $0.74$ $\pm 0.008$ | $0.81$ $\pm 0.005$ | $0.86$ $\pm 0.007$ | $0.82$ $\pm 0.016$ |
| Before last generated [-2] | $0.73$ $\pm 0.004$ | $0.85$ $\pm 0.004$ | $0.74$ $\pm 0.007$ | $0.75$ $\pm 0.005$ | $0.88$ $\pm 0.005$ | $0.79$ $\pm 0.020$ |
| End of question | $0.76$ $\pm 0.008$ | $0.82$ $\pm 0.011$ | $0.72$ $\pm 0.007$ | $0.77$ $\pm 0.007$ | $0.80$ $\pm 0.018$ | $0.72$ $\pm 0.023$ |
| Exact | 0.85 $\pm 0.004$ | 0.92 $\pm 0.005$ | 0.92 $\pm 0.008$ | 0.83 $\pm 0.002$ | 0.93 $\pm 0.004$ | 0.95 $\pm 0.027$ |
Next, we evaluate various error detection methods by comparing their performance with and without the use of exact answer tokens. Table 1 compares the AUC across three representative datasets (additional datasets and models in Appendix B, showing consistent patterns). Here we present results for the last exact answer token, which outperformed both the first exact answer token and the one preceding it, while the token following the last performed similarly. Incorporating the exact answer token improves the different error detection methods in almost all datasets. Notably, our probing technique (bottom line) consistently outperforms all other baselines across the board. While we did not compare all existing error detection methods, the primary conclusion is that information about truthfulness is highly localized in specific generated tokens, and that focusing on exact answer tokens leads to significant improvements in error detection.
## 4 Generalization Between Tasks
The effectiveness of a probing classifier in detecting errors suggests that LLMs encode information about the truthfulness of their outputs. This supports using probing classifiers for error detection in production, but their generalizability across tasks remains unclear. While some studies argue for a universal mechanism of truthfulness encoding in LLMs (Marks & Tegmark, 2023; Slobodkin et al., 2023), results on probe generalization across datasets are mixed (Kadavath et al., 2022; Marks & Tegmark, 2023; CH-Wang et al., 2023; Slobodkin et al., 2023; Levinstein & Herrmann, 2024) âobserving a decline in performance, yet it remains significantly above random chance. Understanding this is essential for real-world applications, where the error detector may encounter examples that significantly differ from those it was trained on. Therefore, we explore whether a probe trained on one dataset can detect errors in others.
Our generalization experiments are conducted between all of the ten datasets discussed in Section 3, covering a broader range of reaslistic task settings than previous work. This breadth of experiments has not been previously explored, and is crucial considering the mixed findings in previous work. We select the optimal token and layer combination for each dataset, train all probes using this combination on other datasets, and then test them on the original dataset. We evaluate generalization performance using the absolute AUC score, defined as $\max(\text{auc},1-\text{auc})$ , to also account for cases where the learned signal in one dataset is reversed in another.
Results.
<details>
<summary>extracted/6450693/figures/generalization/mistral_instruct.png Details</summary>

### Visual Description
## Heatmap: Correlation Between Training and Test Datasets
### Overview
This heatmap visualizes the correlation strength between different training and test datasets for question-answering models. Values range from 0.0 (no correlation, blue) to 1.0 (perfect correlation, red). The matrix reveals how well models trained on specific datasets perform on various test datasets.
### Components/Axes
- **X-axis (Test dataset)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train dataset)**: Same categories as X-axis
- **Color legend**: Blue (0.0) to Red (1.0), with intermediate shades representing incremental correlation strength
### Detailed Analysis
- **Diagonal values** (self-correlation):
- TriviaQA: 0.86
- HotpotQA: 0.80
- Movies: 0.74
- Winobias: 0.92
- Winogrande: 0.60
- NLI: 0.51
- IMDB: 0.63
- Math: 0.56
- HotpotQA_WC: 0.65
- NQ_WC: 0.69
- **Notable cross-dataset correlations**:
- **IMDB** (train) vs. **NLI** (test): 0.93
- **IMDB** (train) vs. **IMDB** (test): 0.97
- **HotpotQA_WC** (train) vs. **NQ_WC** (test): 0.87
- **Winobias** (train) vs. **Winobias** (test): 0.92
- **Math** (train) vs. **Math** (test): 0.92
- **Lowest correlations**:
- **Winobias** (train) vs. **Winogrande** (test): 0.57
- **Math** (train) vs. **Winobias** (test): 0.52
- **NLI** (train) vs. **TriviaQA** (test): 0.51
### Key Observations
1. **Self-correlation dominance**: All diagonal values exceed 0.5, indicating models retain strong performance on their training datasets.
2. **IMDB and NLI synergy**: High cross-correlation (0.93) suggests shared linguistic patterns or question structures.
3. **WC variants**: HotpotQA_WC and NQ_WC show moderate self-correlation (0.65â0.69) but strong mutual correlation (0.87).
4. **Math dataset**: High self-correlation (0.92) but poor transfer to Winobias (0.52), indicating domain specificity.
5. **Winobias/Winogrande**: Low mutual correlation (0.57) despite similar naming conventions, suggesting divergent task requirements.
### Interpretation
The heatmap demonstrates that **domain-specific training** (e.g., IMDB, Math) yields high performance on matching test sets but limited transferability to dissimilar datasets. **IMDB and NLI** exhibit exceptional cross-dataset compatibility, likely due to overlapping question types (e.g., factoid QA). The **WC variants** (HotpotQA_WC, NQ_WC) show moderate self-performance but strong mutual correlation, implying shared contextual features.
Models trained on **Winobias** or **Winogrande** struggle with cross-dataset generalization, highlighting their specialized nature. The **Math dataset**'s high self-correlation but poor transfer to Winobias underscores its narrow focus on numerical reasoning.
This analysis suggests that dataset selection for training should balance domain specificity (for targeted tasks) and generalizability (for broad applicability), with IMDB and NLI being optimal for cross-dataset performance.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/mistral_instruct_reduced.png Details</summary>

### Visual Description
## Heatmap: Correlation Between Train and Test Datasets
### Overview
This heatmap visualizes the correlation coefficients between different question-answering (QA) datasets when used as training and test sets. Values range from -0.3 (strong negative correlation) to +0.3 (strong positive correlation), with a color gradient from blue (negative) to red (positive).
### Components/Axes
- **X-axis (Test datasets)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train datasets)**: Same as X-axis, listed vertically
- **Color legend**: Blue (-0.3 to -0.1), white (0), red (+0.1 to +0.3)
- **Numerical values**: Embedded in each cell, representing Pearson correlation coefficients
### Detailed Analysis
1. **Diagonal dominance**:
- Highest positive correlations occur when train and test datasets match:
- Winobias: 0.33 (darkest red)
- NLI: 0.32
- Math: 0.22
- These values confirm that models trained on a dataset perform best on the same dataset.
2. **Cross-dataset correlations**:
- **Negative correlations**:
- TriviaQA trained on Winobias: -0.21 (darkest blue)
- HotpotQA trained on Movies: -0.08
- NLI trained on Winogrande: -0.17
- **Positive cross-correlations**:
- Math trained on Winobias: 0.22
- NQ_WC trained on HotpotQA_WC: 0.18
3. **Notable patterns**:
- **Winobias** shows strong negative correlations with most datasets when used as a test set (e.g., -0.18 with TriviaQA, -0.16 with Winogrande).
- **Math** exhibits moderate positive correlations with multiple datasets (e.g., 0.10 with IMDB, 0.04 with Movies).
- **HotpotQA_WC** and **NQ_WC** show mixed performance, with some negative correlations (e.g., -0.19 with Math) and positive ones (e.g., 0.18 for NQ_WC).
### Key Observations
- **Same-dataset training/test**: Consistently yields the highest correlations (0.22â0.33).
- **Cross-dataset performance**: Varies widely, with some datasets (e.g., Math) showing broader utility, while others (e.g., Winobias) struggle to generalize.
- **Strongest negative correlation**: TriviaQA trained on Winobias (-0.21), suggesting poor transferability.
### Interpretation
The heatmap reveals that QA models trained on specific datasets (e.g., Winobias, NLI) perform best on the same datasets, highlighting dataset-specific knowledge encoding. Cross-dataset correlations are mixed, indicating limited generalization for some datasets (e.g., Winobias) but moderate transferability for others (e.g., Math). The strongest negative correlation (-0.21) suggests that training on Winobias may introduce biases or patterns that conflict with TriviaQA, potentially harming performance. This underscores the importance of dataset selection for model robustness and the need for diverse training data to improve generalization.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 3: Generalization between datasets, Mistral-7b-instruct. After subtracting the logit-based methodâs performance, we observe that most datasets show limited or no meaningful generalization.
Figure 3(a) shows the generalization results for Mistral-7b-instruct, with similar patterns observed for other LLMs in Appendix C. In this context, values above $0.5$ indicate successful generalization. At first glance, the results appear consistent with previous research: most heatmap values exceed $0.5$ , implying some degree of generalization across tasks. This observation supports the existence of a universal mechanism for decoding truthfulness, since the same linear directionsâcaptured by the probeâencode truthfulness information across many datasets. However, upon closer inspection, it turns out that most of this performance can be achieved by logit-based truthfulness detection, which only observes the output logits. Figure 3(b) presents the same heatmap after subtracting results from our strongest logit-based baseline (Logit-min-exact). This adjusted heatmap reveals the probeâs generalization rarely exceeds what can be achieved by examining logits alone. This suggests that the observed generalization is not due to a universal internal encoding of truthfulness. Instead, it likely arises from information already available through external features, such as logits. Past evidence for generalization may therefore have been overstated.
Nonetheless, we do observe some successful generalization in tasks requiring similar skills, such as parametric factual retrieval (TriviaQA, HotpotQA, Movies) and common-sense reasoning (Winobias, Wingrande, NLI). This suggests that, although the overall pattern of truthfulness signals across tokens appeared consistent across tasks (as observed in Section 3.3), LLMs have many âskill-specificâ truthfulness mechanisms rather than universal ones. However, some patterns remain unexplained, such as the asymmetric generalization from TriviaQA to Math tasks. Overall, our findings indicate that models have a multifaceted representation of truthfulness. The internal mechanisms responsible for solving distinct problem are implemented as different mechanisms (e.g., circuits) within models (Elhage et al., 2021; Olah et al., 2023). Similarly, LLMs do not encode truthfulness through a single unified mechanism but rather through multiple mechanisms, each corresponding to different notions of truth. Further investigation is required to disentangle these mechanisms.
## 5 Investigating Error Types
Having established the limitations of error detection, we now shift to error analysis. Previously, we explored types of LLM limitations across different tasks, noting both commonalities and distinctions in their error representations. In this section, we focus on the types of errors LLMs make in a specific taskâTriviaQAâwhich represents factual errors, a commonly studied issue in LLMs (Kadavath et al., 2022; Snyder et al., 2023; Li et al., 2024; Chen et al., 2024; Simhi et al., 2024).
### 5.1 Taxonomy of Errors
Intuitively, not all mistakes are identical. In one case, an LLM may consistently generate an incorrect answer, considering it correct, while in another case, it could issue a best guess. To analyze errors from the LLMâs perspective, we sample $K=30$ responses at a temperature setting of $T=1$ We chose $K=30$ as the overall correctness seemed to plateau around this point; see Appendix D. We found that lower temperatures generally produced less truthful answers across repeated trials. for each example in the dataset and then analyze the resulting distribution of answers.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flowchart: Model's Reasoning Process for Answering "Otis Barton was a pioneer in exploring where?"
### Overview
The image depicts a decision tree or flowchart illustrating a language model's reasoning process to answer the question: "Otis Barton was a pioneer in exploring where?" The model evaluates three candidate answers with associated probabilities, using a hierarchical structure to represent confidence levels and correctness.
### Components/Axes
1. **Nodes**:
- **Question Node**: "Otis Barton was a pioneer in exploring where?" (Top-left, black dashed box).
- **Model Node**: "Model" (Central blue box).
- **Answer Nodes**:
- "the underwater world..." (93% probability, green checkmark).
- "his excavations in the Maya region..." (3% probability, red X).
- "Exploring the underground rivers to Tennessee..." (3% probability, red X).
2. **Arrows**:
- Directed edges connect the question to the model, and the model to each answer node.
- Arrows are labeled with probabilities (e.g., "93%", "3%").
3. **Annotations**:
- Green checkmark (â) for the correct answer.
- Red X (â) for incorrect answers.
### Detailed Analysis
- **Question Node**: Positioned at the top-left, initiating the reasoning flow.
- **Model Node**: Central hub processing the question and distributing probabilities to answers.
- **Answer Nodes**:
- **Correct Answer**: "the underwater world..." (93% probability, green checkmark).
- **Incorrect Answers**:
- "his excavations in the Maya region..." (3%, red X).
- "Exploring the underground rivers to Tennessee..." (3%, red X).
### Key Observations
1. **Probability Distribution**:
- The correct answer dominates with 93% confidence.
- Two incorrect answers share the remaining 3% probability each, suggesting low confidence in alternatives.
2. **Structural Flow**:
- Hierarchical progression from question â model â answers.
- Visual emphasis on the correct answer via size (larger box) and color (green).
3. **Symbolic Markers**:
- Checkmark/X symbols reinforce correctness/incorrectness, overriding numerical probabilities for immediate interpretation.
### Interpretation
The flowchart demonstrates a probabilistic reasoning mechanism where the model assigns high confidence to the correct answer ("underwater world") based on contextual knowledge of Otis Barton's historical contributions. The near-zero probabilities for alternative answers ("Maya region," "underground rivers") indicate these options are semantically or factually inconsistent with the question's intent. The use of checkmarks/Xs prioritizes human-readable validation over raw numerical outputs, aligning with user-facing interfaces that favor intuitive feedback. This structure highlights the model's ability to filter plausible answers through probabilistic weighting and semantic alignment.
</details>
(a) The LLM mostly answers correctly, but sometimes hallucinates.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Flowchart Diagram: American State Border Question Analysis
### Overview
The diagram illustrates a decision-making process where a model evaluates a question about U.S. state borders. It shows two competing answers with confidence percentages and correctness indicators.
### Components/Axes
- **Input Box**: Contains the question "Which American state borders on only one other state?"
- **Model Box**: Central processing unit labeled "Model"
- **Output Boxes**: Two competing answers with:
- **Missouri**: 87% confidence (marked incorrect with red X)
- **Maine**: 13% confidence (marked correct with green checkmark)
- **Arrows**:
- Primary flow from question â model â answers
- Secondary flow from model â each answer with confidence percentages
### Detailed Analysis
1. **Question**: "Which American state borders on only one other state?"
2. **Model Processing**: Central node labeled "Model" with:
- 87% confidence arrow to Missouri answer
- 13% confidence arrow to Maine answer
3. **Answer Boxes**:
- **Missouri**:
- Text: "Missouri is the... The only state to border... is Missouri..."
- Incorrect (red X)
- **Maine**:
- Text: "Maine is the... The US state that... is Maine, which..."
- Correct (green checkmark)
### Key Observations
- **Confidence vs. Accuracy**: The model shows 87% confidence in the incorrect answer (Missouri) and only 13% in the correct answer (Maine)
- **Structural Flow**: Linear progression from question â model â branching to answers
- **Visual Hierarchy**:
- Red X (larger size) emphasizes incorrectness
- Green checkmark (smaller size) indicates correctness
- Confidence percentages decrease from left to right
### Interpretation
The diagram reveals a critical failure in model confidence calibration:
1. **Majority Confidence Fallacy**: The model's high confidence (87%) in Missouri suggests over-reliance on pattern recognition without contextual understanding
2. **Correctness Paradox**: Despite low confidence (13%), Maine is the factually correct answer
3. **Geographic Knowledge Gap**: The model appears to conflate Missouri's central location with border characteristics, while failing to recognize Maine's unique position as a northeastern state bordering only New Hampshire and Canada
This visualization highlights the importance of:
- **Confidence Calibration**: Models should not equate high confidence with correctness
- **Domain-Specific Knowledge**: Geographic questions require specialized reasoning beyond statistical patterns
- **Error Analysis**: The red X/green checkmark system provides immediate feedback on model performance
</details>
(b) The LLM mostly answers incorrectly, but seems to have some knowledge on the correct answer.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Model Response to Football Commentator Question
### Overview
The diagram illustrates a question-answering workflow where a model generates responses to the query: *"Who became the first female to deliver football commentary on 'match of the day'?"* Three candidate answers are presented with confidence percentages and feedback indicators (â
/â).
### Components/Axes
- **Input Question**:
*"Who became the first female to deliver football commentary on 'match of the day'?"*
- **Model Output**:
Three candidate answers with confidence scores and feedback:
1. *"... In 2007, Gabby Logan ..." (20% confidence, â)*
2. *"The first ... is Clare Balding" (6% confidence, â)*
3. *"Jackie Oatley is the first woman ..." (6% confidence, â
)*
### Detailed Analysis
- **Question Box**: Positioned on the left, framed by dashed lines.
- **Model Box**: Central node labeled "Model," connected via arrows to candidate answers.
- **Candidate Answers**:
- **Top Answer**: *"... In 2007, Gabby Logan ..." (20% confidence, â)*.
- Confidence: 20% (highest among options).
- Feedback: Red circle with white "X" (incorrect).
- **Middle Answer**: *"The first ... is Clare Balding" (6% confidence, â)*.
- Confidence: 6% (lowest among incorrect answers).
- Feedback: Red circle with white "X" (incorrect).
- **Bottom Answer**: *"Jackie Oatley is the first woman ..." (6% confidence, â
)*.
- Confidence: 6% (tied with middle answer).
- Feedback: Green circle with white checkmark (correct).
### Key Observations
1. **Confidence vs. Correctness**:
- The model assigns the highest confidence (20%) to an incorrect answer (Gabby Logan).
- The correct answer (Jackie Oatley) receives the lowest confidence (6%), tied with another incorrect option.
2. **Ambiguity in Responses**:
- All answers contain ellipses ("..."), suggesting incomplete or truncated text.
3. **Feedback Mechanism**:
- Visual indicators (â
/â) explicitly label correctness, but confidence scores do not align with accuracy.
### Interpretation
- **Model Limitations**:
The modelâs low confidence in the correct answer (Jackie Oatley) suggests potential gaps in training data or reasoning capabilities. The high confidence in an incorrect answer (Gabby Logan) may reflect biases or errors in the modelâs knowledge base.
- **Workflow Design**:
The inclusion of confidence scores and feedback symbols highlights the modelâs uncertainty but raises questions about its reliability. This could indicate a need for improved calibration or additional training on historical sports commentary data.
- **Historical Context**:
The question references a factual milestone in sports broadcasting. The correct answer (Jackie Oatley) aligns with real-world records, while the incorrect options (Gabby Logan, Clare Balding) represent notable figures in sports media but not the specific achievement queried.
This diagram underscores the challenges of knowledge-based question answering, particularly when models struggle to balance confidence and accuracy. Further analysis of the modelâs training data and reasoning pathways would be critical to address these discrepancies.
</details>
(c) The LLM generates many different answers, one of them is the correct one which is generated a small fraction of the resamples.
Figure 4: Different error types in free-form generation, exposed when resampled many times.
Figure 4 illustrates three representative error types. In one (Figure 4(a)), the model usually gives the correct answer but occasionally make an error, implying correct information is present but sampling may lead to mistakes. In another (Figure 4(b), the model often responds incorrectly, though it is capable of providing the right answer, indicating some retained knowledge despite consistently making the same error. In a third type (Figure 4(c)), the model generates a wide array of mostly incorrect answers, reflecting low confidence in any generated answer.
More generally, we categorize the errors by logging three specific features for each example: (a) the number of different answers generated; (b) the frequency of the correct answer; and (c) the frequency of the most common incorrect answer. These features reveal the following error patterns:
- (A) Refuses to answer: The model responds that it cannot answer the question in at least half the cases.
- (B) Consistently correct: Answers correctly in at least half of the cases. This category is divided into: (B1) always correct; and (B2) mostly correct with occasional errors.
- (C) Consistently incorrect: Consistently generates the same incorrect response in at least half of the cases. Similarly to type B, we subdivide this type into (C1) correct answer is never produced; and (C2) correct answer appears at least once.
- (D) Two competing: Generates both correct and incorrect responses at similar ratesâdifference in rates is 5 or less, and each response is generated at least 5 times.
- (E) Many answers: Generates over 10 distinct answers. Like types C and D, Subtypes include (E1) correct answer is never generated; and (E2) correct answer is generated at least once.
This taxonomy covers 96% of the errors in TriviaQA for Mistral-7b-instruct. For more qualitative examples of each type of error, see Appendix D.3. Although some overlap exists between types, our goal is to identify general patterns and explore their connection to the modelsâs internal representations. For a discussion on the design choices of this taxonomy, refer to Appendix D.1. This taxonomy classifies LLM errors based on an extrinsic, behavior-based analysis. Similarly, previous work analyzed repeated samples to assess an LLMâs knowledge of the correct answer (Simhi et al., 2024; Gekhman et al., 2024). Our approach is distinct because it also examines the nature of errors that the LLM makes. Furthermore, as we discuss next, we analyze the connection between these behavioral patterns and the modelâs internal encoding.
### 5.2 Predicting Error Types
Our taxonomy offers an external, behavioral analysis of LLMs, which we complement by an intrinsic evaluation. We explore whether LLMs encode information on potential error types within their intermediate activations, offering a deeper insight into the underlying mechanisms. To investigate this, we train a probe in a one-to-many setting, where a single probe identifies a specific error type from all others. We use representations extracted from the answers produced via greedy decoding.
Table 2 presents the results. Our findings show that error types can be predicted from the intermediate representations of the greedy decoding generations, suggesting that they may capture not only output correctness but also fine-grained information about potential errors. While detection performance varies between types, the predictability of each type is valuable on its own, as it opens the possibility of tailoring targeted interventions for specific error types. Additionally, although performance on error types C and D is lower, it remains well above random, providing meaningful insights. These results suggest that internal representations encode more than just binary correctness, revealing a nuanced taxonomy of error types and offering deeper insights into how these models process and encode knowledge.
Table 2: AUC scores for error type classification (TriviaQA). Error types are predictable from the inner model representations, indicating the encoding of fine-grained information on errors.
| Error type | Mistral-7b | Mistral-Instr-7b | Llama3-8b | Llama3-Instr-8b |
| --- | --- | --- | --- | --- |
| (A) Refuses to answer | $0.86\scriptscriptstyle{\pm 0.002}$ | $0.85\scriptscriptstyle{\pm 0.011}$ | $0.87\scriptscriptstyle{\pm 0.002}$ | $0.88\scriptscriptstyle{\pm 0.014}$ |
| (B) Consistently correct | $0.88\scriptscriptstyle{\pm 0.001}$ | $0.82\scriptscriptstyle{\pm 0.008}$ | $0.86\scriptscriptstyle{\pm 0.001}$ | $0.81\scriptscriptstyle{\pm 0.002}$ |
| (C) Consistently incorrect | $0.59\scriptscriptstyle{\pm 0.002}$ | $0.67\scriptscriptstyle{\pm 0.002}$ | $0.59\scriptscriptstyle{\pm 0.002}$ | $0.64\scriptscriptstyle{\pm 0.003}$ |
| (D) Two competing | $0.63\scriptscriptstyle{\pm 0.002}$ | $0.68\scriptscriptstyle{\pm 0.006}$ | $0.61\scriptscriptstyle{\pm 0.001}$ | $0.65\scriptscriptstyle{\pm 0.004}$ |
| (E) Many answers | $0.90\scriptscriptstyle{\pm 0.001}$ | $0.84\scriptscriptstyle{\pm 0.003}$ | $0.89\scriptscriptstyle{\pm 0.001}$ | $0.89\scriptscriptstyle{\pm 0.001}$ |
## 6 Detecting the Correct Answer
After identifying that models encode diverse truthfulness-related information, we examine how this internal truthfulness aligns with their external behavior during response generation. To this end, we use our probe, We choose the best-performing probe for each task, which is trained on the last exact answer token. trained on error detection, to select an answer from a pool of 30 generated responses to the same question. We then measure the modelâs accuracy based on the selected answers. A case where this accuracy does not significantly differ from traditional decoding methods (such as greedy decoding), suggests that the LLMâs internal representation of truthfulness is consistent with its external behavior. In simpler terms, that the model is generating answers that it also internally considers as correct. Conversely, a case where using the probe alters performance either way, would suggest a misalignment between the LLMâs internal representations and its actual behavior.
Experimental Setup
The experiments were conducted on TriviaQA, Winobias, and Math. We resample each model answer in the same strategy described in Section 5.1. The final chosen answer is the one with the highest correctness probability, as assessed by the probe. We compare to three baselines: (1) greedy decoding, (2) random selection from the $K=30$ answer candidates; and (3) majority vote wherein the most frequently generated answer is chosen.
Results
The results for Mistral-7b-instruct are summarized in Figure 5, with additional results for other LLMs and datasets as well as qualitative examples provided in Appendix E. We only present results on error types that appear 30 times or more in our test dataset. Overall, using the probe to select answers enhances the LLMs accuracy across all examined tasks. However, the extent of improvement varies by error type. For instance, in the TriviaQA dataset, there is minimal gain in the âmostly correctâ category (B2). In contrast, substantial gainsâranging from 30 to 40 points in some casesâare observed in the âmostly incorrectâ (C2), âtwo competing answersâ (D), and âmany answersâ (E1) categories. Interestingly, and perhaps surprisingly, the probe is most effective in cases where the LLM lacks any (external) preference for the correct answer during generation. The fact that the probe can effectively identify the correct answer in these scenarios, points at a significant disconnect between the LLMâs internal encoding and its external behavior. These results suggest that even when the model encodes information of which answer is correct, it can still generate an incorrect answer in practice.
While using the probe to select the answer proves effective, it is not proposed here as an error mitigation strategy but rather as a diagnostic tool. However, these findings indicate that further research in this area could leverage the existing knowledge within LLMs to significantly reduce errors. We recommend exploring this direction in future investigations.
<details>
<summary>extracted/6450693/figures/choose_answer/probe_choose_answer_triviaqa_mistral_instruct.png Details</summary>

### Visual Description
## Bar Chart: Performance of Decision-Making Strategies Across Response Categories
### Overview
The chart compares four decision-making strategies (greedy, random, majority, probing) across nine response categories, showing percentage performance. Categories range from "All" responses to specific behavioral patterns like "Consistently correct" and "Many answers (Correct appears)." Probing consistently outperforms other strategies in most categories.
### Components/Axes
- **X-axis**: Response categories (e.g., "All," "Refuses to answer," "Consistently correct (All)," "Consistently incorrect (Most)," "Two competing," "Many answers (Non correct)," "Many answers (Correct appears)")
- **Y-axis**: Percentage (0â100% in 25% increments)
- **Legend**:
- Green: Greedy
- Blue: Random
- Yellow: Majority
- Red: Probing
- **Spatial Grounding**:
- Legend positioned top-right
- Bars clustered under each category, ordered left-to-right per legend
### Detailed Analysis
1. **All Responses**
- Greedy: 63%
- Random: 64%
- Majority: 67%
- Probing: 71%
*Trend*: Probing leads, with incremental gains over other strategies.
2. **Refuses to answer**
- Greedy: 6%
- Random: 6%
- Majority: 0%
- Probing: 28%
*Trend*: Probing dominates, while majority fails entirely.
3. **Consistently correct (All)**
- All strategies: 100%
*Trend*: Perfect agreement across methods for fully correct responses.
4. **Consistently correct (Most)**
- Greedy: 88%
- Random: 83%
- Majority: 99%
- Probing: 89%
*Trend*: Majority excels, probing slightly trails.
5. **Consistently incorrect (All)**
- All strategies: 0%
*Trend*: Unanimous failure to identify incorrect responses.
6. **Consistently incorrect (Most)**
- Greedy: 11%
- Random: 15%
- Majority: 0%
- Probing: 53%
*Trend*: Probing identifies errors effectively despite majority failure.
7. **Two competing**
- Greedy: 32%
- Random: 45%
- Majority: 50%
- Probing: 78%
*Trend*: Probing dominates in binary-choice scenarios.
8. **Many answers (Non correct)**
- Greedy: 1%
- Random: 0%
- Majority: 0%
- Probing: 0%
*Trend*: Probing marginally identifies non-correct answers.
9. **Many answers (Correct appears)**
- Greedy: 23%
- Random: 19%
- Majority: 38%
- Probing: 56%
*Trend*: Probing leads in multi-answer contexts with correct options.
### Key Observations
- **Probing Dominance**: Outperforms all strategies in 7/9 categories, especially in error detection ("Consistently incorrect") and multi-answer scenarios.
- **Majority Strength**: Excels in "Consistently correct (Most)" but fails in error-prone categories.
- **Greedy/Random Limitations**: Underperform in specialized categories despite mid-tier performance in "All" responses.
- **Consistency Paradox**: All methods achieve 100% accuracy for fully correct responses but fail entirely for fully incorrect ones.
### Interpretation
The data suggests **probing** is the most robust strategy, excelling in error detection, multi-answer contexts, and binary choices. Its success likely stems from iterative validation or uncertainty-aware mechanisms. **Majority** performs well in consensus-driven scenarios but collapses when errors dominate. **Greedy** and **random** strategies show mediocrity, lacking specialization. The 100% accuracy for "Consistently correct" responses highlights a systemic bias toward rewarding correctness but failing to penalize or detect errors. This pattern aligns with Peircean principles of abduction (hypothesis testing) in probing, which adapts better to ambiguous or erroneous data than static strategies.
</details>
(a) TriviaQA
<details>
<summary>extracted/6450693/figures/choose_answer/probe_choose_answer_math_mistral_instruct.png Details</summary>

### Visual Description
## Bar Chart: Performance Metrics Across Categories
### Overview
The chart displays performance metrics across five categories: "All," "Consistently correct (All)," "Consistently correct (Most)," "Consistently incorrect (All)," and "Consistently incorrect (Most)." Each category has four bars representing different groups: "All," "Consistently correct (All)," "Consistently correct (Most)," and "Consistently incorrect (Most)." The y-axis ranges from 0 to 100, with values labeled at intervals of 20.
### Components/Axes
- **X-axis (Categories)**:
- "All"
- "Consistently correct (All)"
- "Consistently correct (Most)"
- "Consistently incorrect (All)"
- "Consistently incorrect (Most)"
- **Y-axis (Values)**: 0 to 100, with increments of 20.
- **Legend**:
- **Green**: "All"
- **Blue**: "Consistently correct (All)"
- **Yellow**: "Consistently correct (Most)"
- **Red**: "Consistently incorrect (Most)"
### Detailed Analysis
- **"All" Category**:
- Green ("All"): 55
- Blue ("Consistently correct (All)"): 52
- Yellow ("Consistently correct (Most)"): 57
- Red ("Consistently incorrect (Most)"): 70
- **"Consistently correct (All)" Category**:
- All bars (green, blue, yellow, red): 100
- **"Consistently correct (Most)" Category**:
- Green ("All"): 87
- Blue ("Consistently correct (All)"): 84
- Yellow ("Consistently correct (Most)"): 100
- Red ("Consistently incorrect (Most)"): 96
- **"Consistently incorrect (All)" Category**:
- Green ("All"): 5
- Blue ("Consistently correct (All)"): 0
- Yellow ("Consistently correct (Most)"): 0
- Red ("Consistently incorrect (Most)"): 0
- **"Consistently incorrect (Most)" Category**:
- Green ("All"): 10
- Blue ("Consistently correct (All)"): 20
- Yellow ("Consistently correct (Most)"): 0
- Red ("Consistently incorrect (Most)"): 0
### Key Observations
1. **Perfect Performance in "Consistently correct (All)"**: All groups achieve 100% in this category, indicating uniform success when criteria are met for all.
2. **Highest Performance in "Consistently correct (Most)"**: The "Consistently correct (Most)" group (yellow) scores 100%, followed by "Consistently incorrect (Most)" (red) at 96%, "All" (green) at 87%, and "Consistently correct (All)" (blue) at 84%.
3. **Minimal Performance in "Consistently incorrect (All)"**: Only the "All" group (green) has a non-zero value (5), while others are 0.
4. **Moderate Performance in "Consistently incorrect (Most)"**: The "Consistently correct (All)" group (blue) scores 20%, while "All" (green) scores 10%, and others are 0.
### Interpretation
- **Consistency and Performance**: The data suggests that consistency in correctness (e.g., "Consistently correct (All)") correlates with high performance across all groups. However, when criteria are met for "Most" (not all), the "Consistently correct (Most)" group outperforms others, while the "Consistently incorrect (Most)" group shows slightly lower but still notable performance.
- **Resilience in Inconsistency**: The "Consistently correct (All)" group (blue) demonstrates resilience in the "Consistently incorrect (Most)" category, scoring 20% compared to "All" (green) at 10%. This may indicate that this group maintains some level of performance even when criteria are not fully met.
- **Outliers**: The "Consistently incorrect (All)" category is an outlier, with only the "All" group showing minimal performance (5%), while others have 0. This could reflect a scenario where no group meets the criteria for "All" but the "All" group has a baseline value.
### Spatial Grounding and Trend Verification
- **Legend Placement**: The legend is positioned to the right of the chart, with colors clearly mapped to categories.
- **Trend Verification**:
- "Consistently correct (All)": All bars are flat at 100, confirming uniform performance.
- "Consistently correct (Most)": Yellow (100%) > Red (96%) > Green (87%) > Blue (84%), showing a gradual decline.
- "Consistently incorrect (All)": Only green (5%) is non-zero, aligning with the data.
- "Consistently incorrect (Most)": Blue (20%) > Green (10%), with others at 0, matching the chart.
### Final Notes
The chart highlights the relationship between consistency in correctness and performance metrics. The "Consistently correct (Most)" and "Consistently incorrect (Most)" categories show nuanced differences, while the "Consistently correct (All)" category represents a baseline of perfection. The "Consistently incorrect (All)" category is an anomaly, suggesting a unique scenario where only the "All" group has minimal performance.
</details>
(b) Math
Figure 5: Different answer choice strategies, Mistral-7B-Instruct. A notable improvement in accuracy by using the error-detection probe is observed for error types where the LLM shows no preference for the correct answer across repeated generations.
## 7 Discussion and Conclusions
In this study, we analyzed LLM errors through their internal representations. Our approach depends on access to internal representations, restricting its use to open-source models. We focus on QA tasks with clear gold labels, which are key for benchmarking truthfulness detection and valued by the community. To ensure robustness, we tested 10 datasets across 4 model architectures. Open-ended tasks are left for future research, with our work laying the groundwork for broader applications. For instance, we found that truthfulness-related information is localized in specific tokens within long-responses, enabling practical improvements in error detection for production models. This insight could extend to tasks like summarization, by probing the most meaningful entities in an answer.
Truthfulness features showed poor generalization across tasks and datasets, highlighting the need for caution when applying trained error detectors in varied settings. Some unexplained patterns suggest hidden links between unrelated tasks that warrant further research. Improving generalization could involve exploring the effects of layer-token combinations and training on diverse datasets, as demonstrated by BĂŒrger et al. (2024). Deciphering task-specific truthfulness features and their overlaps across tasks might also enhance classifier design. Still, task-specific probes could be highly valuable in critical fields like medicine and law, where reliability matters. These probes can detect errors, predict error types, and guide response selection from resampled outputs, offering significant practical benefits. Guidelines for applying these probes are provided in Appendix F.
Finally, we identified a significant discrepancy between the modelâs external behavior and internal states, where it repeatedly outputs incorrect responses despite internally encoding the correct answer. It is possible that mechanisms favoring likelihood override those promoting truthfulness, as LLMs are trained to predicting likely tokens, which does not necessarily align with factual accuracy. Our findings imply that these models already encode valuable information that could possibly be harnessed to reduce errors. Work by Chuang et al. (2024) shows promising results in this area, while a subsequent work by Gekhman et al. (2025) focused exclusively on this âhidden knowledgeâ phenomenon, formally defining it and studying its extent. In conclusion, our findings suggest that LLMsâ internal representations provide useful insights into their errors, highlights the complex link between the internal processes of models and their external outputs, and hopefully paves the way for further improvements in error detection and mitigation.
## 8 Reproducibility Statement
To ensure reproducibility of our work, we provide detailed instructions and necessary code. The source code, including scripts for generating model answers, probing, resampling, and error type analysis, is available in the supplementary material, where we also provide command examples and specific seeds used for experiment reproducibility. This repository includes documentation on how to set up the environment, download and preprocess datasets, and execute the experiments outlined in Sections 3â6 of the paper. Additionally, all datasets, models, and results generation steps are described in the Appendix A.
#### Acknowledgments
This research was supported by the Israel Science Foundation (grant No. 448/20), an Azrieli Foundation Early Career Faculty Fellowship, an AI Alignment grant from Open Philanthropy, and a Google gift. HO is supported by the Apple AIML PhD fellowship. This research was funded by the European Union (ERC, Control-LM, 101165402). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
## References
- Allauzen (2007) Alexandre Allauzen. Error detection in confusion network. In 8th Annual Conference of the International Speech Communication Association, INTERSPEECH 2007, Antwerp, Belgium, August 27-31, 2007, pp. 1749â1752. ISCA, 2007. doi: 10.21437/INTERSPEECH.2007-490. URL https://doi.org/10.21437/Interspeech.2007-490.
- Azaria & Mitchell (2023) Amos Azaria and Tom Mitchell. The internal state of an llm knows when itâs lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 967â976, 2023.
- Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Belinkov (2021) Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances, 2021. URL https://arxiv.org/abs/2102.12452.
- Bell et al. (2019) Samuel J. Bell, Helen Yannakoudakis, and Marek Rei. Context is key: Grammatical error detection with contextual word representations. In Helen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, IldikĂł PilĂĄn, and Torsten Zesch (eds.), Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2019, Florence, Italy, August 2, 2019, pp. 103â115. Association for Computational Linguistics, 2019. doi: 10.18653/V1/W19-4410. URL https://doi.org/10.18653/v1/w19-4410.
- Bommasani et al. (2021) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Brunner et al. (2020) Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. On identifiability in transformers. In 8th International Conference on Learning Representations (ICLR 2020)(virtual). International Conference on Learning Representations, 2020.
- BĂŒrger et al. (2024) Lennart BĂŒrger, Fred A Hamprecht, and Boaz Nadler. Truth is universal: Robust detection of lies in llms. arXiv preprint arXiv:2407.12831, 2024.
- Burns et al. (2022) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Caines et al. (2020) Andrew Caines, Christian Bentz, Kate M. Knill, Marek Rei, and Paula Buttery. Grammatical error detection in transcriptions of spoken english. In Donia Scott, NĂșria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp. 2144â2162. International Committee on Computational Linguistics, 2020. doi: 10.18653/V1/2020.COLING-MAIN.195. URL https://doi.org/10.18653/v1/2020.coling-main.195.
- CH-Wang et al. (2023) Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. Do androids know theyâre only dreaming of electric sheep?, 2023.
- Chen et al. (2024) Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMsâ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Zj12nzlQbz.
- Chen et al. (2013) Wei Chen, Sankaranarayanan Ananthakrishnan, Rohit Kumar, Rohit Prasad, and Prem Natarajan. ASR error detection in a conversational spoken language translation system. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, pp. 7418â7422. IEEE, 2013. doi: 10.1109/ICASSP.2013.6639104. URL https://doi.org/10.1109/ICASSP.2013.6639104.
- Cheng & Duan (2020) Yong Cheng and Mofan Duan. Chinese grammatical error detection based on BERT model. In Erhong YANG, Endong XUN, Baolin ZHANG, and Gaoqi RAO (eds.), Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 108â113, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlptea-1.15.
- Chuang et al. (2024) Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Th6NyL07na.
- Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021.
- Errattahi et al. (2015) Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. Automatic speech recognition errors detection and correction: A review. In Mourad Abbas and Ahmed Abdelali (eds.), 1st International Conference on Natural Language and Speech Processing, ICNLSP 2015, Algiers, Algeria, October 18-19, 2015, volume 128 of Procedia Computer Science, pp. 32â37. Elsevier, 2015. doi: 10.1016/J.PROCS.2018.03.005. URL https://doi.org/10.1016/j.procs.2018.03.005.
- Flickinger et al. (2016) Dan Flickinger, Michael Wayne Goodman, and Woodley Packard. Uw-stanford system description for AESW 2016 shared task on grammatical error detection. In Joel R. Tetreault, Jill Burstein, Claudia Leacock, and Helen Yannakoudakis (eds.), Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@NAACL-HLT 2016, June 16, 2016, San Diego, California, USA, pp. 105â111. The Association for Computer Linguistics, 2016. doi: 10.18653/V1/W16-0511. URL https://doi.org/10.18653/v1/w16-0511.
- Gao et al. (2023) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16477â16508, 2023.
- Gekhman et al. (2020) Zorik Gekhman, Roee Aharoni, Genady Beryozkin, Markus Freitag, and Wolfgang Macherey. KoBE: Knowledge-based machine translation evaluation. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3200â3207, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.287. URL https://aclanthology.org/2020.findings-emnlp.287.
- Gekhman et al. (2022) Zorik Gekhman, Dina Zverinski, Jonathan Mallinson, and Genady Beryozkin. RED-ACE: Robust error detection for ASR using confidence embeddings. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2800â2808, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.180. URL https://aclanthology.org/2022.emnlp-main.180.
- Gekhman et al. (2023) Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2053â2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
- Gekhman et al. (2024) Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations?, 2024.
- Gekhman et al. (2025) Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpector, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in llms. arXiv preprint arXiv:2503.15299, 2025.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.
- Gottesman & Geva (2024) Daniela Gottesman and Mor Geva. Estimating knowledge in large language models without generating a single token. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, Florida, 2024. Association for Computational Linguistics.
- Guerreiro et al. (2023) Nuno M Guerreiro, Elena Voita, and AndrĂ© FT Martins. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1059â1075, 2023.
- Harnad (2024) Stevan Harnad. Language writ large: Llms, chatgpt, grounding, meaning and understanding. arXiv preprint arXiv:2402.02243, 2024.
- Honovich et al. (2021) Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. $q^{2}$ : Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7856â7870, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
- Honovich et al. (2022) Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. TRUE: Re-evaluating factual consistency evaluation. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3905â3920, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.287. URL https://aclanthology.org/2022.naacl-main.287.
- Huang et al. (2023a) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023a.
- Huang et al. (2023b) Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023b.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1â38, 2023.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825.
- Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601â1611, 2017.
- Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Kasewa et al. (2018) Sudhanshu Kasewa, Pontus Stenetorp, and Sebastian Riedel. Wronging a right: Generating better errors to improve grammatical error detection. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Junâichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 4977â4983. Association for Computational Linguistics, 2018. URL https://aclanthology.org/D18-1541/.
- Kotek et al. (2023) Hadas Kotek, Rikker Dockum, and David Sun. Gender bias and stereotypes in large language models. In Proceedings of the ACM collective intelligence conference, pp. 12â24, 2023.
- Kryscinski et al. (2020) Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9332â9346, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.750. URL https://aclanthology.org/2020.emnlp-main.750.
- Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VD-AYtP0dve.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
- Laban et al. (2022) Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163â177, 2022. doi: 10.1162/taclËaË00453. URL https://aclanthology.org/2022.tacl-1.10.
- Levinstein & Herrmann (2024) Benjamin A Levinstein and Daniel A Herrmann. Still no lie detector for language models: Probing empirical and conceptual roadblocks. Philosophical Studies, pp. 1â27, 2024.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich KĂŒttler, Mike Lewis, Wen-tau Yih, Tim RocktĂ€schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459â9474, 2020.
- Li et al. (2024) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
- Li & Wang (2024) Wei Li and Houfeng Wang. Detection-correction structure via general language model for grammatical error correction. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pp. 1748â1763. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.acl-long.96.
- Liang et al. (2024) Xun Liang, Shichao Song, Zifan Zheng, Hanyu Wang, Qingchen Yu, Xunkai Li, Rong-Hua Li, Yi Wang, Zhonghao Wang, Feiyu Xiong, et al. Internal consistency and self-feedback in large language models: A survey. arXiv preprint arXiv:2407.14507, 2024.
- Lin et al. (2021) Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Liu et al. (2023) Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4791â4797, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.291. URL https://aclanthology.org/2023.emnlp-main.291.
- Liu et al. (2022) Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. A token-level reference-free hallucination detection benchmark for free-form text generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6723â6737, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.464. URL https://aclanthology.org/2022.acl-long.464.
- Lo (2019) Chi-kiu Lo. YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In OndĆej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, AndrĂ© Martins, Christof Monz, Matteo Negri, AurĂ©lie NĂ©vĂ©ol, Mariana Neves, Matt Post, Marco Turchi, and Karin Verspoor (eds.), Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 507â513, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5358. URL https://aclanthology.org/W19-5358.
- Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142â150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9004â9017, 2023.
- Marks & Tegmark (2023) Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
- McGowan et al. (2023) Alessia McGowan, Yunlai Gui, Matthew Dobbs, Sophia Shuster, Matthew Cotter, Alexandria Selloni, Marianne Goodman, Agrima Srivastava, Guillermo A Cecchi, and Cheryl M Corcoran. Chatgpt and bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Research, 326:115334, 2023.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262.
- Millidge (2023) Beren Millidge. LLMs confabulate not hallucinate. Berenâs Blog, March 2023. URL https://www.beren.io/2023-03-19-LLMs-confabulate-not-hallucinate/.
- Mishra & Kaur (2013) Ritika Mishra and Navjot Kaur. A survey of spelling error detection and correction techniques. International Journal of Computer Trends and Technology, 4(3):372â374, 2013.
- nostalgebraist (2020) nostalgebraist. Interpreting gpt: The logit lens. LessWrong blog post, 2020. URL https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2024-11-18.
- Olah et al. (2023) Chris Olah, Nelson Elhage, Neel Nanda, Catherine Schubert, Daniel Filan, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825â2830, 2011.
- Pellegrini & Trancoso (2009) Thomas Pellegrini and Isabel Trancoso. Error detection in broadcast news ASR using markov chains. In Zygmunt Vetulani (ed.), Human Language Technology. Challenges for Computer Science and Linguistics - 4th Language and Technology Conference, LTC 2009, Poznan, Poland, November 6-8, 2009, Revised Selected Papers, volume 6562 of Lecture Notes in Computer Science, pp. 59â69. Springer, 2009. doi: 10.1007/978-3-642-20095-3âË6. URL https://doi.org/10.1007/978-3-642-20095-3_6.
- Pu et al. (2021) Amy Pu, Hyung Won Chung, Ankur Parikh, Sebastian Gehrmann, and Thibault Sellam. Learning compact metrics for MT. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 751â762, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.58. URL https://aclanthology.org/2021.emnlp-main.58.
- Rao et al. (2020) Gaoqi Rao, Erhong Yang, and Baolin Zhang. Overview of NLPTEA-2020 shared task for Chinese grammatical error diagnosis. In Erhong YANG, Endong XUN, Baolin ZHANG, and Gaoqi RAO (eds.), Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 25â35, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlptea-1.4.
- Rateike et al. (2023) Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, and Skyler Speakman. Weakly supervised detection of hallucinations in llm activations. arXiv preprint arXiv:2312.02798, 2023.
- Rawte et al. (2023) Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, SM Tonmoy, Aman Chadha, Amit P Sheth, and Amitava Das. The troubling emergence of hallucination in large language modelsâan extensive definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2310.04988, 2023.
- Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685â2702, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.213. URL https://aclanthology.org/2020.emnlp-main.213.
- Rei et al. (2022a) Ricardo Rei, JosĂ© G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and AndrĂ© F. T. Martins. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Philipp Koehn, LoĂŻc Barrault, OndĆej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussĂ , Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, AndrĂ© Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, AurĂ©lie NĂ©vĂ©ol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri (eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 578â585, Abu Dhabi, United Arab Emirates (Hybrid), December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.52.
- Rei et al. (2022b) Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, JosĂ© G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and AndrĂ© F. T. Martins. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Philipp Koehn, LoĂŻc Barrault, OndĆej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussĂ , Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, AndrĂ© Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, AurĂ©lie NĂ©vĂ©ol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri (eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 634â645, Abu Dhabi, United Arab Emirates (Hybrid), December 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.60.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99â106, 2021.
- Salles et al. (2020) Arleen Salles, Kathinka Evers, and Michele Farisco. Anthropomorphism in ai. AJOB neuroscience, 11(2):88â95, 2020.
- Scialom et al. (2021) Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. QuestEval: Summarization asks for fact-based evaluation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6594â6604, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.529. URL https://aclanthology.org/2021.emnlp-main.529.
- Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881â7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704.
- Serapio-GarcĂa et al. (2023) Greg Serapio-GarcĂa, Mustafa Safdari, ClĂ©ment Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja MatariÄ. Personality traits in large language models. arXiv preprint arXiv:2307.00184, 2023.
- Simhi et al. (2024) Adi Simhi, Jonathan Herzig, Idan Szpektor, and Yonatan Belinkov. Constructing benchmarks and interventions for combating hallucinations in llms, 2024.
- Slobodkin et al. (2023) Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, and Shauli Ravfogel. The curious case of hallucinatory (un)answerability: Finding truths in the hidden states of over-confident large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3607â3625, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.220. URL https://aclanthology.org/2023.emnlp-main.220.
- Snyder et al. (2023) Ben Snyder, Marius Moisescu, and Muhammad Bilal Zafar. On early detection of hallucinations in factual question answering, 2023. URL https://arxiv.org/abs/2312.14183.
- Sun et al. (2024) Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Hui Zhao. Benchmarking hallucination in large language models based on unanswerable math word problem. CoRR, 2024.
- Taubenfeld et al. (2025) Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in llms. arXiv preprint arXiv:2502.06233, 2025.
- Tian et al. (2023a) Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023a.
- Tian et al. (2023b) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023b.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste RoziÚre, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023.
- Venkit et al. (2024) Pranav Narayanan Venkit, Tatiana Chakravorti, Vipul Gupta, Heidi Biggs, Mukund Srinath, Koustava Goswami, Sarah Rajtmajer, and Shomir Wilson. â confidently nonsensical?â: A critical survey on the perspectives and challenges ofâhallucinationsâ in nlp. arXiv preprint arXiv:2404.07461, 2024.
- Wang & Sennrich (2020) Chaojun Wang and Rico Sennrich. On exposure bias, hallucination and domain shift in neural machine translation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3544â3552, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.326. URL https://aclanthology.org/2020.acl-main.326.
- Wang & Tan (2020) Quanbin Wang and Ying Tan. Grammatical error detection with self attention by pairwise training. In 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020, pp. 1â7. IEEE, 2020. doi: 10.1109/IJCNN48605.2020.9206715. URL https://doi.org/10.1109/IJCNN48605.2020.9206715.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112â1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369â2380, 2018.
- Yin et al. (2024) Fan Yin, Jayanth Srinivasa, and Kai-Wei Chang. Characterizing truthfulness in large language model generations with local intrinsic dimension. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024.
- Yona et al. (2024) Gal Yona, Roee Aharoni, and Mor Geva. Can large language models faithfully express their intrinsic uncertainty in words?, 2024. URL https://arxiv.org/abs/2405.16908.
- Yuksekgonul et al. (2023) Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, and Besmira Nushi. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, 2023.
- Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. ERNIE: Enhanced language representation with informative entities. In Anna Korhonen, David Traum, and LluĂs MĂ rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441â1451, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1139. URL https://aclanthology.org/P19-1139.
- Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
- Zhou et al. (2005) Lina Zhou, Yongmei Shi, Jinjuan Feng, and Andrew Sears. Data mining for detecting errors in dictation speech recognition. IEEE Trans. Speech Audio Process., 13(5-1):681â688, 2005. doi: 10.1109/TSA.2005.851874. URL https://doi.org/10.1109/TSA.2005.851874.
- Zou et al. (2023) Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai transparency, 2023. URL https://arxiv.org/abs/2310.01405.
## Appendix A Implementation Details
### A.1 Task Specific Error Detection
In this work, we specifically address errors produced by modern large language models (LLMs). Given the diverse range of tasks these models are applied to, our focus is on general error detection across all categories, rather than isolating specific types. Prior to the emergence of LLMs, much research targeted error detection for specific tasks, with common examples including grammatical errors (Kasewa et al., 2018; Bell et al., 2019; Cheng & Duan, 2020; Wang & Tan, 2020; Flickinger et al., 2016), spelling mistakes (Mishra & Kaur, 2013), machine translation inaccuracies (Lo, 2019; Pu et al., 2021; Sellam et al., 2020; Gekhman et al., 2020; Rei et al., 2020; 2022a; 2022b), speech recognition faults (Caines et al., 2020; Rao et al., 2020; Li & Wang, 2024; Zhou et al., 2005; Allauzen, 2007; Gekhman et al., 2022; Errattahi et al., 2015; Pellegrini & Trancoso, 2009; Chen et al., 2013), and factual consistency failures (Honovich et al., 2022; Laban et al., 2022; Honovich et al., 2021; Gekhman et al., 2023; Scialom et al., 2021; Kryscinski et al., 2020).
### A.2 Probing: Implementation Details
We examine the intermediate representations of the exact answer tokens generated by a large language model (LLM) during the answer generation process. The intermediate representation selected for this analysis is derived from the output of the final multi-layer perceptron (MLP). This choice is based on preliminary experiments comparing the MLP output, the residual stream, and the attention heads, which showed no significant differences. We leave the in-depth analysis for future work.
For the probing classifier, we employ a logistic regression model from the scikit-learn library (Pedregosa et al., 2011). We used the default hyperparameters, which include a norm penalty of L2 and an LBFGS solver. We initially experimented with other hyper-parameters and did not find a singnificant difference. For each random seed, the dataset was split to training and validation in a 80-20 ratio, and the test dataset was bootstrap sampled.
Obtaining correctness label for the probing dataset.
An answer is generally considered correct if it includes the correct answer label and appears before any alternative incorrect labels. We manually analyzed the results of this heuristic to confirm that it is accurate in almost all cases. However, one exception is the Natural Questions with Context (NQ_WC) dataset, where we identified false negatives and thus deployed a more precise validation using an instruct LLM, as demonstrated below:
[backgroundcolor=blue!5, skipabove=0.5] Evaluate the following answers to questions. For each question you would be given an LLM answer and the correct answer. You would have to determine if the LLM answer is correct or not. If the LLM answer is correct, write â1â and if it is not correct, write â0â. For example:
Question: [Question 1]
Ground Truth: [Gold label 1]
LLM Answer: [LLM long answer 1]
Correctness: 0
Question: [Question 2]
Ground Truth: [Gold label 2]
LLM Answer: [LLM long answer 2]
Correctness: 1
Question: [Question]
Ground Truth: [Label]
LLM Answer: [LLM long answer]
Correctness:
Detecting and using exact answer tokens.
Exact answers are identified from a lengthy generated answer using an external algorithm, which processes the question and the LLMâs response, $A(q_{i},\hat{y_{i}})$ , to extract the exact answer. After extraction, we identify the exact answer tokens via a simple search process, focusing on four key tokens: the one before the first exact answer token, the first and last exact answer tokens, and the one after the last.
For the implementation of $A$ that detects the exact locations of answer tokens, we use a combination of heuristic methods and an instruction-tuned LLM. Specifically, when the set of possible answers is finite, we rely on heuristics. For more open-ended scenarios, such as factual questions, we automatically locate the answer if it matches the gold label. Otherwise, we prompt an instruction-tuned LLM, specifically Mistral-7b-Instruct (Jiang et al., 2023), to identify and extract the exact answer substring using the following prompt:
[backgroundcolor=blue!5, skipabove=0.5] Extract from the following long answer the short answer, only the relevant tokens. If the long answer does not answer the question, output NO ANSWER.
Q: [Question 1]
A: [LLM long answer 1]
Exact answer: [Short exact answer 1]
Q: [Question 2]
A: [LLM long answer that does not answer the question]
Exact answer: NO ANSWER
Q: [Question]
A: [LLM long answer] Exact answer:
To extract a valid exact answer from a long response, we prompt the instruct LLM up to five times. This process involves verifying that the exact answer is a substring of the long answer unless the instruct LLM indicates that there is no answer. To avoid bias in our probing task, we only retain questions for which a valid exact answer was successfully extracted. This ensures there is no unfair correlation between invalid answers and incorrect answers in the experiments.
We note the following: (a) While it is possible to use an instruct LLM to extract every answer regardless of its correctness, we chose the aforementioned strategy to improve the efficiency of our experiments; (b) This is just one possible implementation. For each LLM, one could use the same LLM to extract its own exact answer token, as demonstrated in a proof-of-concept over 1000 samples of TriviaQA in Table 3. Alternatively, it may be more effective to train a smaller system specifically designed for detecting exact answer tokens, which would be more suitable for real-world scenarios. We choose to keep the extraction process as abstract as possible, as our primary focus is not on the specific implementation, but on analyzing the potential gains from probing these locations.
Additionally, if the exact answer token is not among the first generated tokens, we examine the token immediately preceding it (âbefore exact answer tokenâ). If the exact answer token is not the last one, we also examine the following token. When the exact answer spans multiple tokens, the first and last exact answer tokens are probed separately.
Table 3: Success rate of extracting exact answer from a long model answer. Each model is used to extract answers from its own output.
| Mistral-7b | Mistral-Instruct-7b | Llama3-8b | Llama3-Instruct-8b |
| --- | --- | --- | --- |
| 0.99 | 0.96 | 0.99 | 0.95 |
### A.3 Datasets
We outline here all ten datasets that we investigate in our work. In our analysis, we aimed at covering a wide range of tasks, skills required to solve the tasks, diversity of datasets and as a result also different LLM limitations such as factual inaccuracies (often referred to as âhallucinationsâ), biases, arithmetic mistakes, and more. For each dataset, we explain how it covers something different from all the previous datasets. For all datasets, we present the LLM with non or a short instruct, a context (if exists for the task), and let it generate a free text. We follow this paradigm as it better mimics real-world usage of LLMs by humans, as opposed to using few-shot to force a short answer that is generated on the first token (Yuksekgonul et al., 2023; Chen et al., 2024; Simhi et al., 2024). One exception to this is a the sentiment analysis (IMDB) for which we apply 1-shot for the LLM to use the allowed labels, as it did not follow the instruction alone and we could not identify if the answer is correct or not even with manual analysis. Additionally, we implemented a different prompting strategy to the instruct and non-instruct LLMs. To see the exact formats we used to prompt each dataset and LLM, refer to our code implementation at https://github.com/technion-cs-nlp/LLMsKnow.
For each dataset we used a split of 10K training samples and 10K test samples, unless the dataset is too small, in which case we mention the size.
- TriviaQA (Joshi et al., 2017): a collection of trivia question-answer pairs. The questions are presented to the LLM without any context, allowing it to generate responses based solely on its internal, parametric knowledge. The dataset includes various acceptable variations of the correct answer, which are used to automatically evaluate the accuracy of the generated res.
- HotpotQA (Yang et al., 2018): a dataset designed for diverse multi-hop question answering. Each entry includes Wikipedia documents that help answering the questions. We use two different settings: (1) without context, where questions are asked directly, which covers slightly different skills from TriviaQA as it requires reasoning in addition to factual knowledge; and (2) with context (HotpotQA_WC), where the additional context is provided, emphasizing the ability to adhere to and utilize contextual information to solve the task.
- Movies: to further investigate generalization, we focused on a case of classic âhallucinationsâ, involving factual knowledge, within a non-diverse dataset. This approach allowed us to test whether generalization to other types of errors is influenced by the type of error (factual versus others) or by the datasetâs diversity. For this purpose, we created the movies dataset consisting of prompts in the form: âWho acted as [figure name] in the movie [movie name]?â The figures, movies, and correct answers were sourced from âThe Movies Datasetâ in Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset, which is based on the MovieLens website.
- Winogrande (Sakaguchi et al., 2021): we use this dataset to explore errors in common-sense reasoning. It consists of Winograd-style coreference challenges, where each example presents a sentence containing two entities and a pronoun. The objective is to determine which entity the pronoun refers to, relying on common-sense reasoning. For example, in the sentence: âThe trophy doesnât fit into the suitcase because itâs too large,â the pronoun âitâ refers to the trophy, not the suitcase.
- Winobias (Zhao et al., 2018): this benchmark focuses on coreference resolution in the context of gender bias, revealing a different type of limitation in LLMs. Each example consists of two professions: one stereotypically male and one stereotypically female, along with a gendered pronoun. The task requires the LLM to determine which profession the pronoun refers to. The sentences are unambiguous, with one correct answer. In some cases, the correct answer aligns with the stereotype, while in others, it is anti-stereotypical. For example, in the sentence âThe developer argued with the designer because she did not like the design,â âsheâ refers to the developer, which is an anti-stereotypical case since âdeveloperâ is considered a stereotypically male profession. Research has shown that LLMs often perform poorly on anti-stereotypical sentences (Zhao et al., 2018) and tend to base their decisions on stereotypes rather than on common-sense reasoning or linguistic rules (Kotek et al., 2023). Each split contains around 1500 samples.
- NLI (Natural Language Inference): NLI involves determining whether a given âhypothesisâ is true (entailment), false (contradiction), or undetermined (neutral) based on a provided âpremise.â For this purpose, we use the MNLI dataset (Williams et al., 2018). NLI tasks address a distinct aspect of common-sense reasoning and are generally considered complex. This complexity allows us to investigate whether a modelâs generalization ability is related to the difficulty of the task it was trained on, or to other factors, such as the limited diversity of labels (NLI has only three valid labels) or the type of task.
- Math (Sun et al., 2024): this dataset includes both unanswerable and answerable math problems. In our study, we focus exclusively on the answerable problems, as our aim is to assess the correctness of the LLMâs outputs, which requires a known correct answer (gold standard). This task introduces an additional, previously unexplored skill of arithmetic reasoning. The train-test split consists of approximately 2,000 and 650 samples, respectively.
- IMDB (Maas et al., 2011): contains movie reviews used for the task of sentiment classification.
- Natural Questions With Context (Kwiatkowski et al., 2019): the Natural Questions (NQ) dataset is designed to evaluate and train automatic question-answering systems. It consists of real, anonymized queries submitted by users to Google, with answers extracted from Wikipedia, as well as the relevant Wikipedia pages which can be given in context. We included this dataset to introduce an additional challenge that requires adherence to context, complementing the HotpotQA with context dataset.
### A.4 Baselines: Implementation Details
Aggregated probabilities / logits.
Inspired by prior work (Kadavath et al., 2022; Guerreiro et al., 2023), we compute an aggregated score using the log-probabilities or raw probabilities of the generated text tokens $y_{1},y_{2},\ldots,y_{N}$ produced by the generative large language model (LLM). For instance, the following formulation is used to compute the Logits-mean baseline on the entire generated answer:
$$
\centering\frac{1}{N}\sum_{i=1}^{N}\mathbb{P}(y_{i}|Q,y_{1},...,y_{i-1})\@add@centering \tag{1}
$$
We also explore aggregation strategies that focus solely on the exact answer tokens (PE-Exact). Following Varshney et al. (2023), we also experiment with aggregating the minimum and maximum values (PE-[MinâMax]-[Exact]), alongside the mean aggregation described in Equation 1.
P(True):
We follow Kadavath et al. (2022) and prompt the LLM to judge whether its answer is correct. Our prompt followed the following template, from Kadavath et al. (2022):
[backgroundcolor=blue!5, skipabove=0.5]
Question: [Question]
Proposed Answer: [LLM long answer]
Is the proposed answer:
(A) True
(B) False
The proposed answer is:
## Appendix B Full Error Detection Results
Figure 6 presents the AUC values of a traind probe across layers and token for Mistral-7b-instruct, showing a similar pattern across all datasets. We also observe similar patterns across other models. See our repo https://github.com/technion-cs-nlp/LLMsKnow for the figures.
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/hotpotqa_auc.png Details</summary>

### Visual Description
## Heatmap: Token-Layer Attention Distribution
### Overview
The image is a heatmap visualizing the distribution of attention weights or similarity scores between input tokens and transformer model layers. The x-axis represents input tokens (e.g., question/answer components), while the y-axis represents model layers (0â30). Darker blue shades indicate higher values (closer to 1.0), and lighter shades represent lower values (closer to 0.5).
---
### Components/Axes
- **X-axis (Tokens)**:
- `last_q`
- `first_answer`
- `second_answer`
- `exact_answer_before_first`
- `exact_answer_first`
- `exact_answer_last`
- `exact_answer_after_last`
- **Y-axis (Layers)**:
- Layer indices: 0 (bottom) to 30 (top)
- **Color Scale**:
- Legend on the right: Dark blue (1.0) to light gray (0.5)
- **Spatial Layout**:
- Legend positioned vertically on the right side of the heatmap.
- Tokens labeled at the bottom, layers labeled on the left.
---
### Detailed Analysis
1. **Token-Layer Patterns**:
- **`last_q`**: High values (dark blue) concentrated in layers 28â30, suggesting strong attention to the final question token in later layers.
- **`first_answer`**: Peaks in layers 12â16, with moderate values in layers 18â22.
- **`second_answer`**: Similar to `first_answer`, with peaks in layers 12â16 and 18â22.
- **`exact_answer_before_first`**: High values in layers 24â28, indicating late-layer focus.
- **`exact_answer_first`**: Peaks in layers 24â28, with gradual decline toward layer 30.
- **`exact_answer_last`**: Strongest values in layers 28â30, mirroring `last_q`.
- **`exact_answer_after_last`**: High values in layers 28â30, similar to `exact_answer_last`.
2. **Value Distribution**:
- Most tokens show elevated values in mid-to-late layers (12â30), with the highest concentrations in layers 24â30.
- Early layers (0â11) exhibit uniformly low values (<0.6) across all tokens.
---
### Key Observations
- **Layer-Specific Attention**:
- Early layers (0â11) show minimal engagement with all tokens, suggesting initial processing focuses on basic tokenization or positional encoding.
- Mid-layers (12â22) handle answer-related tokens (`first_answer`, `second_answer`), while late layers (24â30) dominate for question and exact answer tokens.
- **Token Hierarchy**:
- `last_q` and `exact_answer_last`/`after_last` share the highest attention in the final layers, implying the model prioritizes terminal input components for final output generation.
- `exact_answer_before_first` and `exact_answer_first` show slightly earlier peaks (layers 24â28), possibly reflecting intermediate processing of answer boundaries.
---
### Interpretation
This heatmap reveals how a transformer model allocates attention across input tokens at different processing depths:
1. **Early Layers (0â11)**: Likely handle low-level features (e.g., token embeddings, positional encoding) with minimal token-specific attention.
2. **Mid-Layers (12â22)**: Focus on answer-related tokens (`first_answer`, `second_answer`), suggesting these layers refine contextual relationships between question and answer components.
3. **Late Layers (24â30)**: Dominated by question and exact answer tokens, indicating these layers integrate high-level semantic understanding, particularly for terminal input elements.
The concentration of high values in late layers for `last_q` and `exact_answer_last`/`after_last` suggests the modelâs final output (e.g., generated answers) is heavily influenced by the last question and precise answer tokens. This aligns with transformer architectures, where deeper layers capture abstract, context-rich representations.
**Notable Anomaly**: The `exact_answer_before_first` token shows elevated attention in layers 24â28 but declines sharply in layer 30, unlike other late-layer tokens. This could indicate a transitional role in answer boundary detection before final refinement in later layers.
---
**Conclusion**: The heatmap demonstrates a clear progression of attention from low-level processing in early layers to high-level semantic integration in late layers, with terminal tokens (`last_q`, `exact_answer_last/after_last`) receiving the strongest focus in the modelâs final stages.
</details>
(a) HotpotQA
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/hotpotqa_with_context_auc.png Details</summary>

### Visual Description
## Heatmap: Attention Weights Across Transformer Layers and Tokens
### Overview
The image is a heatmap visualizing attention weights across transformer model layers and tokens. Darker blue shades represent higher attention values (closer to 1.0), while lighter shades indicate lower values (closer to 0.5). The chart spans 30 layers (vertical axis) and 31 tokens (horizontal axis), with a color scale on the right.
### Components/Axes
- **Vertical Axis (Layer)**:
- Labels: `last_q`, `first_answer`, `second_answer`, `exact_answer_before_first`, `exact_answer_first`, `exact_answer_last`, `exact_answer_after_last`.
- Numerical indices: 1 to 30 (bottom to top).
- **Horizontal Axis (Token)**:
- Labels: `-1`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `10`, `11`, `12`, `13`, `14`, `15`, `16`, `17`, `18`, `19`, `20`, `21`, `22`, `23`, `24`, `25`, `26`, `27`, `28`, `29`, `30`.
- **Legend**:
- Positioned on the right, with a gradient from light gray (0.5) to dark blue (1.0).
### Detailed Analysis
- **Layer Categories**:
- `last_q` (Layer 30): High attention at token `-1` (dark blue).
- `first_answer` (Layer 29): High attention at token `12` (dark blue).
- `second_answer` (Layer 28): High attention at token `14` (dark blue).
- `exact_answer_before_first` (Layer 27): High attention at token `12` (dark blue).
- `exact_answer_first` (Layer 26): High attention at token `14` (dark blue).
- `exact_answer_last` (Layer 25): High attention at token `12` (dark blue).
- `exact_answer_after_last` (Layer 24): High attention at token `14` (dark blue).
- **Token Categories**:
- Tokens `12` and `14` consistently show the highest attention values across layers 24â29.
- Token `-1` (first token) has high attention only in layer 30 (`last_q`).
- Tokens `1â11` and `15â30` show lower attention values (lighter shades).
### Key Observations
1. **Central Focus**: Layers 24â29 (exact answer-related layers) exhibit concentrated attention on tokens `12` and `14`, suggesting these tokens are critical for answer extraction.
2. **Edge Layers**: Layers 1â23 and 30 show sparse high-attention regions, with layer 30 (`last_q`) uniquely focusing on token `-1`.
3. **Color Consistency**: Dark blue regions align with the legendâs highest values (0.9â1.0), confirming the scaleâs accuracy.
### Interpretation
The heatmap reveals that transformer layers 24â29 (associated with answer extraction) prioritize tokens `12` and `14`, likely corresponding to key words or phrases in the input sequence. The unique focus of layer 30 on token `-1` (often a special token like `[CLS]` or `[SEP]`) suggests it may encode global context. The sparse attention in earlier layers implies hierarchical processing, with later layers refining focus to critical tokens. This pattern aligns with typical transformer behavior, where deeper layers capture semantic relationships.
</details>
(b) HotpotQA with context
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/movies_auc.png Details</summary>

### Visual Description
## Heatmap: Token-Layer Interaction Intensity
### Overview
The image is a heatmap visualizing the interaction intensity between tokens and transformer layers in a neural network. Darker blue shades represent higher interaction values (closer to 1.0), while lighter shades indicate lower values (closer to 0.5). The x-axis lists token types and their instances, while the y-axis shows layer numbers (0-30). The color scale on the right quantifies interaction strength.
### Components/Axes
- **X-axis (Token)**:
- Categories: `last_q`, `first_answer`, `second_answer`, `exact_answer_before_first`, `exact_answer_first`, `exact_answer_last`, `exact_answer_after_last`
- Sub-categories: Numeric suffixes (e.g., `first_answer_1`, `first_answer_2`, ..., `first_answer_30`)
- **Y-axis (Layer)**: Layer numbers 0 to 30 (bottom to top)
- **Legend**: Color scale from 0.5 (light gray) to 1.0 (dark blue), positioned on the right
### Detailed Analysis
- **Token-Layer Distribution**:
- **`last_q`**: High intensity (dark blue) in layers 0-10, decreasing to light gray in layers 20-30.
- **`first_answer`**: Peaks in layers 10-20 (dark blue at layer 15), fading in layers 0-5 and 25-30.
- **`second_answer`**: Similar to `first_answer` but slightly lower intensity overall.
- **`exact_answer_before_first`**: High intensity in layers 10-20, with a sharp drop after layer 20.
- **`exact_answer_first`**: Concentrated in layers 10-20, with moderate intensity.
- **`exact_answer_last`**: Low intensity (<0.6) across all layers, with slight peaks in layers 5-10.
- **`exact_answer_after_last`**: Uniformly low intensity (<0.55) across all layers.
### Key Observations
1. **Layer-Specific Token Dominance**:
- Early layers (0-10) prioritize `last_q` and `first_answer`.
- Middle layers (10-20) show strong activity for `first_answer`, `second_answer`, and `exact_answer_before_first`.
- Late layers (20-30) exhibit minimal interaction with most tokens, except faint traces of `exact_answer_last`.
2. **Token Hierarchy**:
- `last_q` dominates early layers, suggesting it anchors initial processing.
- Answer-related tokens (`first_answer`, `exact_answer_*`) cluster in middle layers, indicating layered refinement.
- `exact_answer_after_last` shows negligible interaction, possibly indicating redundancy or post-processing roles.
3. **Color Consistency**:
- Dark blue regions align with the legendâs 0.9-1.0 range, confirming high interaction.
- Light gray areas (<0.6) match the legendâs lower end, validating weak/no interaction.
### Interpretation
The heatmap reveals a hierarchical token processing pipeline:
- **Layer 0-10**: Focus on input (`last_q`) and initial answer generation (`first_answer`).
- **Layer 10-20**: Refinement of answers (`exact_answer_before_first`, `exact_answer_first`), with `second_answer` acting as a secondary refinement step.
- **Layer 20-30**: Minimal token interaction, suggesting these layers may handle higher-level tasks (e.g., context integration) or have sparse relevance to these tokens.
Notably, `exact_answer_after_last`âs uniform low intensity implies it may not be actively processed in this architecture, or its role is abstracted into other mechanisms. The sharp drop in `exact_answer_before_first` after layer 20 suggests a cutoff in answer refinement stages.
</details>
(c) Movies
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/winogrande_auc.png Details</summary>

### Visual Description
## Heatmap: Layer-Token Activation Intensity
### Overview
The image is a heatmap visualizing the intensity of activation or attention values across different layers and tokens in a neural network. The x-axis represents tokens (ranging from -1 to 30), and the y-axis represents layers (labeled with terms like "last_q", "first_answer", "second_answer", etc.). The color intensity corresponds to activation values, with darker blue indicating higher values (closer to 1.0) and lighter blue indicating lower values (closer to 0.5).
### Components/Axes
- **X-axis (Token)**: Labeled with integers from -1 to 30, representing token positions.
- **Y-axis (Layer)**: Labeled with terms such as:
- "last_q"
- "first_answer"
- "second_answer"
- "exact_answer_before_first"
- "exact_answer_first"
- "exact_answer_last"
- "exact_answer_after_last"
- Numerical layers 1â30 (e.g., "1", "2", ..., "30").
- **Legend**: Positioned on the right, showing a gradient from light blue (0.5) to dark blue (1.0). No explicit legend labels are present, but the colorbar implies a continuous scale.
### Detailed Analysis
- **Token Ranges**:
- **Tokens -1, 0, 1**: Lightest blue (values ~0.5â0.6), indicating low activation.
- **Tokens 2â10**: Gradual increase in intensity, peaking at ~0.7â0.8.
- **Tokens 11â17**: Moderate activation (~0.7â0.9).
- **Tokens 18â20**: Darkest blue (values ~0.9â1.0), suggesting peak activation.
- **Tokens 21â30**: Decreasing intensity, returning to ~0.6â0.7.
- **Layer Ranges**:
- **Layers 1â10**: Light to moderate blue (~0.5â0.7), with sporadic darker patches.
- **Layers 11â14**: Darkest blue (~0.9â1.0), indicating highest activation.
- **Layers 15â20**: Moderate activation (~0.7â0.9), with some variability.
- **Layers 21â30**: Light blue (~0.5â0.6), showing minimal activation.
### Key Observations
1. **Peak Activation**: The highest values (darkest blue) are concentrated in **layers 12â14** and **tokens 18â20**, suggesting these regions are critical for processing.
2. **Edge Effects**: The lowest activation values (~0.5) occur at the edges of the heatmap (layers 28â30 and tokens -1, 0, 1).
3. **Gradient Pattern**: Activation intensity increases toward the center (layers 12â14, tokens 18â20) and decreases toward the edges, forming a "peak" structure.
### Interpretation
The heatmap likely represents attention weights or activation magnitudes in a transformer-based model. The central layers (12â14) and tokens (18â20) exhibit the strongest activation, implying they play a pivotal role in encoding or decoding information. The gradual decline in activation toward the edges suggests diminishing relevance of peripheral tokens or layers. This pattern aligns with typical attention mechanisms, where central tokens (e.g., key words in a sentence) dominate processing, while peripheral tokens (e.g., punctuation or filler words) have weaker influence. The absence of extreme outliers indicates a relatively balanced distribution of activation across the model.
</details>
(d) Winogrande
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/mnli_auc.png Details</summary>

### Visual Description
## Heatmap: Token-Layer Attention Distribution
### Overview
The image displays a heatmap visualizing the distribution of attention weights across different layers (y-axis) and tokens (x-axis) in a neural network model. The color intensity represents the magnitude of attention values, with darker blue indicating higher values (closer to 1.0) and lighter blue indicating lower values (closer to 0.5).
### Components/Axes
- **X-axis (Token)**: Categorical labels representing input/output tokens:
- `last_q` (last question token)
- `first_answer` (first answer token)
- `second_answer` (second answer token)
- `exact_answer_before_first` (exact answer before first)
- `exact_answer_first` (exact answer first)
- `exact_answer_last` (exact answer last)
- `exact_answer_after_last` (exact answer after last)
- Numerical tokens: `30`, `28`, `26`, `24`, `22`, `20`, `18`, `16`, `14`, `12`, `10`, `8`, `7`, `6`, `5`, `4`, `3`, `2`, `1`, `-1`
- **Y-axis (Layer)**: Numerical labels from `0` to `30`, representing model layers.
- **Legend**: Color scale from `0.5` (lightest blue) to `1.0` (darkest blue), indicating attention value magnitude.
### Detailed Analysis
- **Token-Layer Patterns**:
- **`first_answer` and `second_answer`**: Show the darkest blue values (highest attention) in layers `10â20`, suggesting these layers prioritize answer-related tokens.
- **`exact_answer_before_first` and `exact_answer_first`**: High attention in layers `12â18`, indicating mid-layers focus on exact answer alignment.
- **`last_q`**: Moderate attention (medium blue) in layers `14â22`, showing mid-to-late layers process the last question token.
- **Numerical tokens (`30`, `28`, etc.)**: Lighter blue values across most layers, with slight increases in layers `10â20`, possibly reflecting positional encoding or sequential processing.
- **`exact_answer_after_last`**: High attention in layers `20â28`, suggesting late layers refine answer precision post-final token.
- **Color Consistency**:
- Darkest blue (`1.0`) appears in mid-layers (`10â20`) for answer-related tokens, matching the legend.
- Lightest blue (`0.5`) dominates top (`0â5`) and bottom (`26â30`) layers, indicating lower attention in these regions.
### Key Observations
1. **Mid-Layer Dominance**: Layers `10â20` exhibit the highest attention values for answer-related tokens, suggesting critical processing occurs in these layers.
2. **Token-Specific Trends**:
- `first_answer` and `second_answer` show peak attention in layers `12â16`.
- `exact_answer_after_last` peaks in layers `24â28`, indicating late-layer refinement.
3. **Layer Gradient**: Attention values generally decrease toward the top (`0â5`) and bottom (`26â30`) layers, with mid-layers acting as the "attention hub."
### Interpretation
The heatmap reveals that the model's attention is concentrated in mid-layers (`10â20`) for answer-related tokens, highlighting their role in processing and refining answers. The `last_q` token's moderate attention in mid-to-late layers suggests it provides contextual grounding for answer generation. The lower attention in top and bottom layers may indicate these regions handle initial input encoding (`0â5`) and final output formatting (`26â30`), respectively. The numerical tokens' lighter values imply they serve as positional markers rather than semantic content. This pattern aligns with typical transformer architectures, where mid-layers capture complex relationships while early/late layers handle simpler tasks.
</details>
(e) NLI
<details>
<summary>extracted/6450693/figures/probing_heatmaps/mistral-7b-instruct/imdb_auc.png Details</summary>

### Visual Description
## Heatmap: Token-Layer Interaction Intensity
### Overview
The image is a heatmap visualizing the intensity of interactions between specific tokens and model layers. Darker blue shades represent higher intensity values (closer to 1.0), while lighter blue shades indicate lower intensity values (closer to 0.5). The visualization spans 31 layers (0-30) and 10 distinct token categories.
### Components/Axes
- **X-axis (Token)**:
- Categories: `last_q`, `exact_answer_first`, `exact_answer_last`, `exact_answer_after_last`, `-7`, `-6`, `-5`, `-4`, `-3`, `-2`, `-1`
- Positioning: Bottom axis, left-aligned labels
- **Y-axis (Layer)**:
- Scale: 0 to 30 (integer increments)
- Positioning: Left axis, vertical numbering
- **Legend**:
- Color gradient: Light blue (0.5) to dark blue (1.0)
- Positioning: Right side, vertical orientation
- **Color Scale**:
- Numerical range: 0.5 (lightest) to 1.0 (darkest)
- Positioning: Right of legend, horizontal bar
### Detailed Analysis
1. **Token Categories**:
- `last_q`: Consistent medium intensity (0.6-0.8) across all layers
- `exact_answer_first`: High intensity (0.9-1.0) in layers 10-20, drops to 0.5-0.6 in layers 0-5 and 25-30
- `exact_answer_last`: Similar pattern to `exact_answer_first` but with slightly lower peak intensity (0.85-0.95)
- `exact_answer_after_last`: Moderate intensity (0.7-0.9) concentrated in layers 5-15
- Negative tokens (`-7` to `-1`): Gradual intensity increase from 0.5 (layer 0) to 0.8 (layer 30)
2. **Layer Trends**:
- **Low layers (0-5)**:
- Dominated by light blue (0.5-0.6)
- Only `last_q` and negative tokens show moderate values (0.6-0.7)
- **Middle layers (10-20)**:
- Peak intensity for `exact_answer_first` (1.0) and `exact_answer_last` (0.95)
- `exact_answer_after_last` shows secondary peak (0.85)
- **High layers (25-30)**:
- Return to low intensity (0.5-0.6) for all tokens except `last_q` (0.7)
3. **Color Consistency**:
- All dark blue cells (1.0) correspond to `exact_answer_first` in layers 10-20
- Light blue cells (0.5) match negative tokens in layer 0
- Intermediate values (0.6-0.8) align with `last_q` across all layers
### Key Observations
1. **Concentration of High Values**:
- 70% of cells with intensity >0.9 are clustered in layers 10-20
- `exact_answer_first` shows perfect 1.0 values in this range
2. **Symmetry in Negative Tokens**:
- `-7` to `-1` show linear progression from 0.5 to 0.8
- No negative tokens exceed 0.8 intensity
3. **Layer-Specific Patterns**:
- Layer 0: Uniform low intensity (0.5-0.6) except `last_q` (0.7)
- Layer 15: Secondary peak for `exact_answer_after_last` (0.85)
- Layer 25: Sharp drop in `exact_answer_first` to 0.6
### Interpretation
This heatmap reveals a clear architectural pattern in the model's processing:
1. **Token Specialization**:
- `exact_answer_first` and `exact_answer_last` demonstrate strong layer-specific activation, suggesting dedicated processing units for these tokens
- Negative tokens show gradual activation, possibly indicating positional encoding effects
2. **Layer Hierarchy**:
- Middle layers (10-20) act as primary processing hubs, handling 85% of high-intensity interactions
- Top and bottom layers serve as transitional zones with minimal specialized processing
3. **Performance Implications**:
- The perfect 1.0 values for `exact_answer_first` suggest optimal token representation in these layers
- The drop in intensity for `exact_answer_last` in extreme layers may indicate information degradation or attention decay
4. **Potential Anomalies**:
- `exact_answer_after_last` shows unexpected secondary peak at layer 15, possibly indicating a specialized sub-network
- The consistent performance of `last_q` across all layers suggests it might be a positional marker rather than content token
The visualization demonstrates a clear correlation between layer depth and token processing intensity, with critical tokens showing bimodal distribution patterns that could inform model optimization strategies.
</details>
(f) IMDB
Figure 6: AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. The detection performance spikes at the exact answer tokens.
Tables 4, 5, 6, and 7 present the full error detection results across all baselines and datasets, which are consistent with the main paper results.
Table 4: Comparison of error detection performance (AUC) on Mistral-7B.
| | Mistral-7B | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.67$ $\pm 0.004$ | $0.49$ $\pm 0.010$ | $0.41$ $\pm 0.015$ | $0.67$ $\pm 0.007$ | $0.88$ $\pm 0.064$ |
| Logits-mean-exact | $0.67$ $\pm 0.004$ | $0.50$ $\pm 0.010$ | $0.56$ $\pm 0.026$ | $0.68$ $\pm 0.008$ | $0.57$ $\pm 0.080$ |
| Logits-min | $0.80$ $\pm 0.003$ | $0.45$ $\pm 0.014$ | $0.48$ $\pm 0.021$ | $0.73$ $\pm 0.006$ | $0.78$ $\pm 0.056$ |
| Logits-min-exact | $0.80$ $\pm 0.005$ | $0.53$ $\pm 0.014$ | $0.78$ $\pm 0.032$ | $0.72$ $\pm 0.005$ | $0.57$ $\pm 0.080$ |
| Logits-max | $0.53$ $\pm 0.008$ | $0.49$ $\pm 0.010$ | $0.42$ $\pm 0.023$ | $0.54$ $\pm 0.005$ | $0.83$ $\pm 0.076$ |
| Logits-max-exact | $0.54$ $\pm 0.009$ | $0.50$ $\pm 0.010$ | $0.40$ $\pm 0.024$ | $0.58$ $\pm 0.007$ | $0.57$ $\pm 0.080$ |
| Probas-mean | $0.76$ $\pm 0.003$ | $0.53$ $\pm 0.018$ | $0.66$ $\pm 0.016$ | $0.72$ $\pm 0.007$ | $0.87$ $\pm 0.041$ |
| Probas-mean-exact | $0.78$ $\pm 0.002$ | $0.55$ $\pm 0.014$ | $0.62$ $\pm 0.016$ | $0.74$ $\pm 0.007$ | $0.83$ $\pm 0.057$ |
| Probas-min | $0.82$ $\pm 0.003$ | $0.52$ $\pm 0.013$ | $0.82$ $\pm 0.020$ | $0.73$ $\pm 0.006$ | $0.86$ $\pm 0.032$ |
| Probas-min-exact | 0.85 $\pm 0.003$ | $0.58$ $\pm 0.011$ | $0.84$ $\pm 0.015$ | $0.74$ $\pm 0.006$ | $0.83$ $\pm 0.057$ |
| Probas-max | $0.53$ $\pm 0.008$ | $0.50$ $\pm 0.016$ | $0.43$ $\pm 0.025$ | $0.55$ $\pm 0.008$ | $0.80$ $\pm 0.074$ |
| Probas-max-exact | $0.55$ $\pm 0.009$ | $0.51$ $\pm 0.013$ | $0.39$ $\pm 0.019$ | $0.59$ $\pm 0.009$ | $0.83$ $\pm 0.057$ |
| p(True) | $0.57$ $\pm 0.007$ | $0.53$ $\pm 0.019$ | $0.56$ $\pm 0.027$ | $0.51$ $\pm 0.003$ | $0.65$ $\pm 0.004$ |
| p(True)-exact | $0.56$ $\pm 0.006$ | $0.55$ $\pm 0.026$ | $0.57$ $\pm 0.036$ | $0.52$ $\pm 0.003$ | $0.65$ $\pm 0.003$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.83$ $\pm 0.002$ | $0.65$ $\pm 0.008$ | $0.82$ $\pm 0.023$ | $0.79$ $\pm 0.002$ | $0.85$ $\pm 0.007$ |
| Before last generated [-2] | $0.82$ $\pm 0.003$ | $0.84$ $\pm 0.012$ | $0.83$ $\pm 0.019$ | $0.78$ $\pm 0.003$ | $0.95$ $\pm 0.004$ |
| End of question | $0.74$ $\pm 0.005$ | $0.78$ $\pm 0.012$ | $0.83$ $\pm 0.016$ | $0.77$ $\pm 0.002$ | $0.81$ $\pm 0.009$ |
| Exact answer last | $0.84$ $\pm 0.005$ | 0.89 $\pm 0.007$ | 0.96 $\pm 0.008$ | $0.78$ $\pm 0.003$ | 0.95 $\pm 0.004$ |
| Exact answer last+1 | $0.84$ $\pm 0.004$ | $0.84$ $\pm 0.012$ | $0.95$ $\pm 0.010$ | 0.80 $\pm 0.002$ | $0.85$ $\pm 0.007$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.63$ $\pm 0.005$ | $0.52$ $\pm 0.009$ | $0.49$ $\pm 0.004$ | $0.51$ $\pm 0.004$ | $0.69$ $\pm 0.006$ |
| Logits-mean-exact | $0.57$ $\pm 0.008$ | $0.52$ $\pm 0.007$ | $0.50$ $\pm 0.003$ | 0.93 $\pm 0.004$ | $0.72$ $\pm 0.005$ |
| Logits-min | $0.72$ $\pm 0.008$ | $0.59$ $\pm 0.006$ | $0.50$ $\pm 0.007$ | $0.53$ $\pm 0.005$ | $0.65$ $\pm 0.009$ |
| Logits-min-exact | $0.72$ $\pm 0.007$ | $0.65$ $\pm 0.004$ | $0.51$ $\pm 0.007$ | $0.49$ $\pm 0.006$ | $0.70$ $\pm 0.005$ |
| Logits-max | $0.54$ $\pm 0.007$ | $0.49$ $\pm 0.010$ | $0.48$ $\pm 0.005$ | $0.48$ $\pm 0.005$ | $0.59$ $\pm 0.012$ |
| Logits-max-exact | $0.48$ $\pm 0.010$ | $0.44$ $\pm 0.007$ | $0.50$ $\pm 0.003$ | $0.48$ $\pm 0.005$ | $0.58$ $\pm 0.009$ |
| Probas-mean | $0.65$ $\pm 0.004$ | $0.55$ $\pm 0.006$ | $0.51$ $\pm 0.007$ | $0.49$ $\pm 0.003$ | $0.63$ $\pm 0.008$ |
| Probas-mean-exact | $0.62$ $\pm 0.006$ | $0.56$ $\pm 0.007$ | $0.51$ $\pm 0.005$ | $0.02$ $\pm 0.001$ | $0.66$ $\pm 0.007$ |
| Probas-min | $0.73$ $\pm 0.005$ | $0.58$ $\pm 0.007$ | $0.52$ $\pm 0.009$ | $0.53$ $\pm 0.004$ | $0.63$ $\pm 0.011$ |
| Probas-min-exact | $0.78$ $\pm 0.005$ | $0.66$ $\pm 0.004$ | $0.52$ $\pm 0.008$ | $0.49$ $\pm 0.005$ | $0.69$ $\pm 0.006$ |
| Probas-max | $0.54$ $\pm 0.008$ | $0.49$ $\pm 0.007$ | $0.50$ $\pm 0.005$ | $0.47$ $\pm 0.004$ | $0.52$ $\pm 0.004$ |
| Probas-max-exact | $0.48$ $\pm 0.010$ | $0.44$ $\pm 0.005$ | $0.50$ $\pm 0.004$ | $0.48$ $\pm 0.003$ | $0.53$ $\pm 0.012$ |
| p(True) | $0.55$ $\pm 0.007$ | $0.54$ $\pm 0.006$ | $0.51$ $\pm 0.005$ | $0.51$ $\pm 0.003$ | $0.52$ $\pm 0.008$ |
| p(True)-exact | $0.61$ $\pm 0.005$ | $0.54$ $\pm 0.006$ | $0.61$ $\pm 0.006$ | $0.51$ $\pm 0.006$ | $0.53$ $\pm 0.014$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.78$ $\pm 0.006$ | $0.67$ $\pm 0.004$ | $0.51$ $\pm 0.007$ | $0.77$ $\pm 0.004$ | $0.78$ $\pm 0.003$ |
| Before last generated [-2] | $0.79$ $\pm 0.007$ | $0.69$ $\pm 0.007$ | $0.66$ $\pm 0.004$ | $0.81$ $\pm 0.002$ | $0.75$ $\pm 0.006$ |
| End of question | $0.72$ $\pm 0.007$ | $0.56$ $\pm 0.003$ | $0.51$ $\pm 0.007$ | $0.88$ $\pm 0.004$ | $0.70$ $\pm 0.005$ |
| Exact answer last | $0.80$ $\pm 0.008$ | 0.74 $\pm 0.007$ | 0.69 $\pm 0.006$ | $0.84$ $\pm 0.004$ | $0.81$ $\pm 0.009$ |
| Exact answer last+1 | 0.81 $\pm 0.008$ | $0.72$ $\pm 0.005$ | $0.59$ $\pm 0.005$ | $0.75$ $\pm 0.006$ | 0.84 $\pm 0.007$ |
Table 5: Comparison of error detection performance (AUC) on Mistral-7B-Instruct.
| | Mistral-7B-Instruct | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.60$ $\pm 0.009$ | $0.56$ $\pm 0.017$ | $0.55$ $\pm 0.029$ | $0.63$ $\pm 0.005$ | $0.57$ $\pm 0.006$ |
| Logits-mean-exact | $0.68$ $\pm 0.007$ | $0.54$ $\pm 0.012$ | $0.51$ $\pm 0.005$ | $0.70$ $\pm 0.004$ | $0.87$ $\pm 0.007$ |
| Logits-min | $0.63$ $\pm 0.008$ | $0.59$ $\pm 0.012$ | $0.51$ $\pm 0.017$ | $0.66$ $\pm 0.008$ | $0.52$ $\pm 0.007$ |
| Logits-min-exact | $0.75$ $\pm 0.006$ | $0.53$ $\pm 0.013$ | $0.71$ $\pm 0.009$ | $0.74$ $\pm 0.005$ | $0.87$ $\pm 0.007$ |
| Logits-max | $0.54$ $\pm 0.005$ | $0.53$ $\pm 0.012$ | $0.54$ $\pm 0.039$ | $0.54$ $\pm 0.004$ | $0.47$ $\pm 0.004$ |
| Logits-max-exact | $0.55$ $\pm 0.004$ | $0.54$ $\pm 0.011$ | $0.32$ $\pm 0.015$ | $0.61$ $\pm 0.006$ | $0.87$ $\pm 0.007$ |
| Probas-mean | $0.60$ $\pm 0.007$ | $0.58$ $\pm 0.018$ | $0.56$ $\pm 0.028$ | $0.61$ $\pm 0.002$ | $0.54$ $\pm 0.008$ |
| Probas-mean-exact | $0.71$ $\pm 0.003$ | $0.57$ $\pm 0.015$ | $0.71$ $\pm 0.014$ | $0.74$ $\pm 0.006$ | $0.84$ $\pm 0.007$ |
| Probas-min | $0.59$ $\pm 0.008$ | $0.58$ $\pm 0.014$ | $0.50$ $\pm 0.025$ | $0.60$ $\pm 0.008$ | $0.51$ $\pm 0.010$ |
| Probas-min-exact | $0.74$ $\pm 0.004$ | $0.57$ $\pm 0.016$ | $0.75$ $\pm 0.011$ | $0.73$ $\pm 0.006$ | $0.84$ $\pm 0.007$ |
| Probas-max | $0.50$ $\pm 0.006$ | $0.41$ $\pm 0.010$ | $0.53$ $\pm 0.009$ | $0.51$ $\pm 0.005$ | $0.48$ $\pm 0.004$ |
| Probas-max-exact | $0.51$ $\pm 0.007$ | $0.54$ $\pm 0.010$ | $0.45$ $\pm 0.015$ | $0.60$ $\pm 0.003$ | $0.84$ $\pm 0.007$ |
| p(True) | $0.68$ $\pm 0.005$ | $0.45$ $\pm 0.021$ | $0.48$ $\pm 0.026$ | $0.62$ $\pm 0.005$ | $0.62$ $\pm 0.009$ |
| p(True)-exact | $0.74$ $\pm 0.003$ | $0.40$ $\pm 0.021$ | $0.60$ $\pm 0.025$ | $0.69$ $\pm 0.008$ | $0.60$ $\pm 0.009$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.71$ $\pm 0.006$ | $0.82$ $\pm 0.004$ | $0.74$ $\pm 0.008$ | $0.72$ $\pm 0.005$ | $0.92$ $\pm 0.010$ |
| Before last generated [-2] | $0.73$ $\pm 0.004$ | $0.85$ $\pm 0.004$ | $0.74$ $\pm 0.007$ | $0.72$ $\pm 0.006$ | $0.94$ $\pm 0.006$ |
| End of question | $0.76$ $\pm 0.008$ | $0.82$ $\pm 0.011$ | $0.72$ $\pm 0.007$ | $0.74$ $\pm 0.003$ | $0.96$ $\pm 0.006$ |
| Exact answer last | $0.85$ $\pm 0.004$ | 0.92 $\pm 0.005$ | 0.92 $\pm 0.008$ | $0.81$ $\pm 0.003$ | 0.97 $\pm 0.005$ |
| Exact answer last+1 | 0.86 $\pm 0.006$ | $0.88$ $\pm 0.006$ | $0.90$ $\pm 0.010$ | 0.82 $\pm 0.003$ | $0.96$ $\pm 0.006$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.61$ $\pm 0.002$ | $0.55$ $\pm 0.009$ | $0.59$ $\pm 0.004$ | $0.64$ $\pm 0.006$ | $0.71$ $\pm 0.008$ |
| Logits-mean-exact | $0.66$ $\pm 0.009$ | $0.55$ $\pm 0.004$ | $0.49$ $\pm 0.004$ | $0.57$ $\pm 0.004$ | $0.69$ $\pm 0.009$ |
| Logits-min | $0.61$ $\pm 0.003$ | $0.53$ $\pm 0.013$ | $0.61$ $\pm 0.003$ | $0.62$ $\pm 0.002$ | $0.67$ $\pm 0.008$ |
| Logits-min-exact | $0.77$ $\pm 0.004$ | $0.67$ $\pm 0.013$ | $0.48$ $\pm 0.004$ | $0.54$ $\pm 0.005$ | $0.69$ $\pm 0.006$ |
| Logits-max | $0.53$ $\pm 0.008$ | $0.51$ $\pm 0.011$ | $0.52$ $\pm 0.006$ | $0.59$ $\pm 0.008$ | $0.63$ $\pm 0.011$ |
| Logits-max-exact | $0.51$ $\pm 0.011$ | $0.41$ $\pm 0.010$ | $0.49$ $\pm 0.007$ | $0.64$ $\pm 0.003$ | $0.63$ $\pm 0.013$ |
| Probas-mean | $0.63$ $\pm 0.003$ | $0.56$ $\pm 0.010$ | $0.58$ $\pm 0.005$ | $0.62$ $\pm 0.005$ | $0.68$ $\pm 0.010$ |
| Probas-mean-exact | $0.72$ $\pm 0.006$ | $0.66$ $\pm 0.010$ | $0.46$ $\pm 0.004$ | $0.57$ $\pm 0.003$ | $0.65$ $\pm 0.008$ |
| Probas-min | $0.58$ $\pm 0.003$ | $0.52$ $\pm 0.008$ | $0.59$ $\pm 0.002$ | $0.58$ $\pm 0.008$ | $0.65$ $\pm 0.014$ |
| Probas-min-exact | $0.76$ $\pm 0.004$ | $0.68$ $\pm 0.010$ | $0.46$ $\pm 0.005$ | $0.57$ $\pm 0.003$ | $0.66$ $\pm 0.008$ |
| Probas-max | $0.50$ $\pm 0.005$ | $0.53$ $\pm 0.003$ | $0.48$ $\pm 0.007$ | $0.52$ $\pm 0.007$ | $0.51$ $\pm 0.005$ |
| Probas-max-exact | $0.46$ $\pm 0.010$ | $0.46$ $\pm 0.010$ | $0.48$ $\pm 0.004$ | $0.53$ $\pm 0.004$ | $0.52$ $\pm 0.018$ |
| p(True) | $0.54$ $\pm 0.006$ | $0.54$ $\pm 0.004$ | $0.53$ $\pm 0.003$ | $0.58$ $\pm 0.003$ | $0.57$ $\pm 0.006$ |
| p(True)-exact | $0.60$ $\pm 0.008$ | $0.48$ $\pm 0.005$ | $0.57$ $\pm 0.011$ | $0.65$ $\pm 0.004$ | $0.57$ $\pm 0.009$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.72$ $\pm 0.005$ | $0.64$ $\pm 0.005$ | $0.74$ $\pm 0.005$ | $0.85$ $\pm 0.004$ | $0.82$ $\pm 0.006$ |
| Before last generated [-2] | $0.73$ $\pm 0.006$ | $0.64$ $\pm 0.004$ | $0.76$ $\pm 0.004$ | $0.87$ $\pm 0.002$ | $0.84$ $\pm 0.009$ |
| End of question | $0.80$ $\pm 0.003$ | $0.63$ $\pm 0.003$ | $0.71$ $\pm 0.007$ | $0.79$ $\pm 0.004$ | $0.85$ $\pm 0.010$ |
| Exact answer last | $0.85$ $\pm 0.003$ | $0.75$ $\pm 0.006$ | 0.84 $\pm 0.005$ | 0.93 $\pm 0.003$ | $0.86$ $\pm 0.003$ |
| Exact answer last+1 | 0.85 $\pm 0.002$ | 0.76 $\pm 0.004$ | $0.80$ $\pm 0.004$ | $0.92$ $\pm 0.004$ | 0.87 $\pm 0.006$ |
Table 6: Comparison of error detection performance (AUC) on Llama-8b.
| | Llama-8b | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.58$ $\pm 0.006$ | $0.44$ $\pm 0.015$ | $0.43$ $\pm 0.026$ | $0.64$ $\pm 0.008$ | $0.77$ $\pm 0.007$ |
| Logits-mean-exact | $0.63$ $\pm 0.007$ | $0.50$ $\pm 0.015$ | $0.50$ $\pm 0.028$ | $0.64$ $\pm 0.008$ | $0.77$ $\pm 0.007$ |
| Logits-min | $0.75$ $\pm 0.007$ | $0.50$ $\pm 0.022$ | $0.45$ $\pm 0.042$ | $0.73$ $\pm 0.005$ | $0.73$ $\pm 0.007$ |
| Logits-min-exact | $0.76$ $\pm 0.003$ | $0.53$ $\pm 0.009$ | $0.75$ $\pm 0.022$ | $0.73$ $\pm 0.005$ | $0.77$ $\pm 0.007$ |
| Logits-max | $0.48$ $\pm 0.006$ | $0.48$ $\pm 0.009$ | $0.42$ $\pm 0.027$ | $0.53$ $\pm 0.005$ | $0.72$ $\pm 0.007$ |
| Logits-max-exact | $0.52$ $\pm 0.007$ | $0.49$ $\pm 0.014$ | $0.35$ $\pm 0.026$ | $0.53$ $\pm 0.005$ | $0.77$ $\pm 0.007$ |
| Probas-mean | $0.64$ $\pm 0.006$ | $0.41$ $\pm 0.008$ | $0.61$ $\pm 0.029$ | $0.71$ $\pm 0.007$ | $0.70$ $\pm 0.008$ |
| Probas-mean-exact | $0.72$ $\pm 0.005$ | $0.50$ $\pm 0.018$ | $0.54$ $\pm 0.026$ | $0.72$ $\pm 0.006$ | $0.88$ $\pm 0.003$ |
| Probas-min | $0.79$ $\pm 0.008$ | $0.43$ $\pm 0.004$ | $0.75$ $\pm 0.044$ | $0.74$ $\pm 0.005$ | $0.68$ $\pm 0.005$ |
| Probas-min-exact | $0.82$ $\pm 0.003$ | $0.53$ $\pm 0.014$ | $0.78$ $\pm 0.022$ | $0.74$ $\pm 0.005$ | $0.88$ $\pm 0.003$ |
| Probas-max | $0.49$ $\pm 0.006$ | $0.50$ $\pm 0.009$ | $0.46$ $\pm 0.032$ | $0.53$ $\pm 0.007$ | $0.60$ $\pm 0.009$ |
| Probas-max-exact | $0.53$ $\pm 0.008$ | $0.50$ $\pm 0.018$ | $0.36$ $\pm 0.032$ | $0.54$ $\pm 0.007$ | $0.88$ $\pm 0.003$ |
| p(True) | $0.62$ $\pm 0.005$ | $0.48$ $\pm 0.011$ | $0.53$ $\pm 0.027$ | $0.61$ $\pm 0.005$ | $0.51$ $\pm 0.010$ |
| p(True)-exact | $0.67$ $\pm 0.002$ | $0.53$ $\pm 0.017$ | $0.63$ $\pm 0.028$ | $0.58$ $\pm 0.005$ | $0.52$ $\pm 0.008$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.77$ $\pm 0.005$ | $0.59$ $\pm 0.024$ | $0.83$ $\pm 0.013$ | $0.82$ $\pm 0.005$ | $0.94$ $\pm 0.002$ |
| Before last generated [-2] | $0.76$ $\pm 0.012$ | $0.58$ $\pm 0.021$ | $0.82$ $\pm 0.032$ | $0.79$ $\pm 0.004$ | $0.96$ $\pm 0.002$ |
| End of question | $0.73$ $\pm 0.005$ | $0.77$ $\pm 0.012$ | $0.80$ $\pm 0.027$ | $0.78$ $\pm 0.005$ | $0.68$ $\pm 0.009$ |
| Exact answer last | 0.82 $\pm 0.006$ | 0.91 $\pm 0.007$ | 0.96 $\pm 0.010$ | $0.80$ $\pm 0.005$ | 0.97 $\pm 0.001$ |
| Exact answer last+1 | $0.82$ $\pm 0.006$ | $0.86$ $\pm 0.008$ | $0.95$ $\pm 0.007$ | 0.82 $\pm 0.006$ | $0.95$ $\pm 0.003$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.65$ $\pm 0.004$ | $0.62$ $\pm 0.006$ | $0.48$ $\pm 0.003$ | $0.47$ $\pm 0.002$ | $0.53$ $\pm 0.010$ |
| Logits-mean-exact | $0.55$ $\pm 0.003$ | $0.54$ $\pm 0.006$ | $0.49$ $\pm 0.004$ | $0.48$ $\pm 0.002$ | $0.58$ $\pm 0.009$ |
| Logits-min | $0.57$ $\pm 0.004$ | $0.49$ $\pm 0.003$ | $0.48$ $\pm 0.003$ | $0.48$ $\pm 0.007$ | $0.58$ $\pm 0.009$ |
| Logits-min-exact | $0.69$ $\pm 0.002$ | $0.68$ $\pm 0.006$ | $0.49$ $\pm 0.003$ | $0.48$ $\pm 0.007$ | $0.61$ $\pm 0.010$ |
| Logits-max | $0.61$ $\pm 0.005$ | $0.60$ $\pm 0.004$ | $0.48$ $\pm 0.003$ | $0.52$ $\pm 0.003$ | $0.51$ $\pm 0.008$ |
| Logits-max-exact | $0.47$ $\pm 0.003$ | $0.46$ $\pm 0.005$ | $0.49$ $\pm 0.004$ | $0.51$ $\pm 0.002$ | $0.54$ $\pm 0.005$ |
| Probas-mean | $0.67$ $\pm 0.002$ | $0.62$ $\pm 0.006$ | $0.49$ $\pm 0.002$ | $0.48$ $\pm 0.004$ | $0.57$ $\pm 0.003$ |
| Probas-mean-exact | $0.62$ $\pm 0.005$ | $0.56$ $\pm 0.005$ | $0.51$ $\pm 0.002$ | $0.46$ $\pm 0.006$ | $0.64$ $\pm 0.007$ |
| Probas-min | $0.62$ $\pm 0.006$ | $0.51$ $\pm 0.002$ | $0.49$ $\pm 0.003$ | $0.50$ $\pm 0.010$ | $0.62$ $\pm 0.005$ |
| Probas-min-exact | $0.76$ $\pm 0.005$ | $0.67$ $\pm 0.004$ | $0.51$ $\pm 0.002$ | $0.50$ $\pm 0.010$ | $0.69$ $\pm 0.008$ |
| Probas-max | $0.61$ $\pm 0.004$ | $0.58$ $\pm 0.004$ | $0.48$ $\pm 0.002$ | $0.48$ $\pm 0.003$ | $0.51$ $\pm 0.012$ |
| Probas-max-exact | $0.49$ $\pm 0.003$ | $0.44$ $\pm 0.004$ | $0.51$ $\pm 0.003$ | $0.47$ $\pm 0.002$ | $0.56$ $\pm 0.005$ |
| p(True) | $0.52$ $\pm 0.007$ | $0.45$ $\pm 0.005$ | $0.54$ $\pm 0.004$ | $0.54$ $\pm 0.007$ | $0.56$ $\pm 0.006$ |
| p(True)-exact | $0.58$ $\pm 0.005$ | $0.50$ $\pm 0.007$ | $0.64$ $\pm 0.004$ | $0.62$ $\pm 0.005$ | $0.61$ $\pm 0.002$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.76$ $\pm 0.007$ | $0.57$ $\pm 0.006$ | $0.59$ $\pm 0.006$ | $0.89$ $\pm 0.002$ | $0.66$ $\pm 0.010$ |
| Before last generated [-2] | $0.74$ $\pm 0.007$ | $0.58$ $\pm 0.005$ | $0.59$ $\pm 0.005$ | $0.94$ $\pm 0.002$ | $0.63$ $\pm 0.008$ |
| End of question | $0.71$ $\pm 0.006$ | $0.53$ $\pm 0.004$ | $0.48$ $\pm 0.003$ | $0.91$ $\pm 0.001$ | $0.66$ $\pm 0.004$ |
| Exact answer last | $0.81$ $\pm 0.006$ | $0.77$ $\pm 0.004$ | 0.65 $\pm 0.004$ | 0.94 $\pm 0.002$ | 0.75 $\pm 0.008$ |
| Exact answer last+1 | 0.82 $\pm 0.004$ | 0.79 $\pm 0.001$ | $0.57$ $\pm 0.004$ | $0.90$ $\pm 0.002$ | $0.75$ $\pm 0.007$ |
Table 7: Comparison of error detection performance (AUC) on Llama-8b-Instruct.
| | Llama-8b-Instruct | | | | |
| --- | --- | --- | --- | --- | --- |
| TriviaQA | Winobias | Math | Movies | IMDB | |
| Logits-mean | $0.66$ $\pm 0.005$ | $0.60$ $\pm 0.026$ | $0.75$ $\pm 0.018$ | $0.75$ $\pm 0.005$ | $0.59$ $\pm 0.017$ |
| Logits-mean-exact | $0.71$ $\pm 0.006$ | $0.55$ $\pm 0.019$ | $0.80$ $\pm 0.021$ | $0.72$ $\pm 0.004$ | $0.88$ $\pm 0.012$ |
| Logits-min | $0.74$ $\pm 0.007$ | $0.61$ $\pm 0.024$ | $0.75$ $\pm 0.016$ | $0.71$ $\pm 0.005$ | $0.55$ $\pm 0.016$ |
| Logits-min-exact | $0.79$ $\pm 0.006$ | $0.61$ $\pm 0.019$ | $0.89$ $\pm 0.018$ | $0.77$ $\pm 0.006$ | $0.88$ $\pm 0.012$ |
| Logits-max | $0.54$ $\pm 0.007$ | $0.55$ $\pm 0.013$ | $0.73$ $\pm 0.027$ | $0.67$ $\pm 0.003$ | $0.51$ $\pm 0.009$ |
| Logits-max-exact | $0.58$ $\pm 0.005$ | $0.54$ $\pm 0.019$ | $0.64$ $\pm 0.014$ | $0.61$ $\pm 0.003$ | $0.88$ $\pm 0.012$ |
| Probas-mean | $0.67$ $\pm 0.006$ | $0.63$ $\pm 0.024$ | $0.66$ $\pm 0.033$ | $0.73$ $\pm 0.006$ | $0.73$ $\pm 0.015$ |
| Probas-mean-exact | $0.75$ $\pm 0.009$ | $0.61$ $\pm 0.014$ | $0.83$ $\pm 0.022$ | $0.74$ $\pm 0.005$ | $0.74$ $\pm 0.021$ |
| Probas-min | $0.67$ $\pm 0.009$ | $0.65$ $\pm 0.019$ | $0.64$ $\pm 0.036$ | $0.65$ $\pm 0.004$ | $0.57$ $\pm 0.016$ |
| Probas-min-exact | $0.79$ $\pm 0.008$ | $0.62$ $\pm 0.014$ | $0.86$ $\pm 0.024$ | $0.74$ $\pm 0.005$ | $0.74$ $\pm 0.021$ |
| Probas-max | $0.54$ $\pm 0.003$ | $0.49$ $\pm 0.020$ | $0.57$ $\pm 0.022$ | $0.64$ $\pm 0.006$ | $0.49$ $\pm 0.008$ |
| Probas-max-exact | $0.56$ $\pm 0.007$ | $0.55$ $\pm 0.016$ | $0.57$ $\pm 0.018$ | $0.61$ $\pm 0.003$ | $0.74$ $\pm 0.021$ |
| p(True) | $0.73$ $\pm 0.008$ | $0.59$ $\pm 0.020$ | $0.62$ $\pm 0.017$ | $0.66$ $\pm 0.004$ | $0.60$ $\pm 0.006$ |
| p(True)-exact | $0.73$ $\pm 0.005$ | $0.63$ $\pm 0.014$ | $0.59$ $\pm 0.018$ | $0.63$ $\pm 0.006$ | $0.76$ $\pm 0.004$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.81$ $\pm 0.005$ | $0.86$ $\pm 0.007$ | $0.82$ $\pm 0.016$ | $0.78$ $\pm 0.004$ | $0.81$ $\pm 0.014$ |
| Before last generated [-2] | $0.75$ $\pm 0.005$ | $0.88$ $\pm 0.005$ | $0.79$ $\pm 0.020$ | $0.82$ $\pm 0.005$ | $0.83$ $\pm 0.006$ |
| End of question | $0.77$ $\pm 0.007$ | $0.80$ $\pm 0.018$ | $0.72$ $\pm 0.023$ | $0.76$ $\pm 0.005$ | $0.87$ $\pm 0.006$ |
| Exact answer last | 0.83 $\pm 0.002$ | 0.93 $\pm 0.004$ | 0.95 $\pm 0.027$ | $0.85$ $\pm 0.005$ | 0.96 $\pm 0.003$ |
| Exact answer last+1 | $0.83$ $\pm 0.006$ | $0.90$ $\pm 0.005$ | $0.94$ $\pm 0.023$ | 0.86 $\pm 0.004$ | $0.95$ $\pm 0.004$ |
| HotpotQA | HotpotQA-WC | Winogrande | NLI | NQ-WC | |
| Logits-mean | $0.65$ $\pm 0.002$ | $0.56$ $\pm 0.004$ | $0.58$ $\pm 0.007$ | $0.59$ $\pm 0.009$ | $0.65$ $\pm 0.006$ |
| Logits-mean-exact | $0.66$ $\pm 0.008$ | $0.57$ $\pm 0.005$ | $0.48$ $\pm 0.003$ | $0.49$ $\pm 0.010$ | $0.67$ $\pm 0.005$ |
| Logits-min | $0.67$ $\pm 0.008$ | $0.55$ $\pm 0.007$ | $0.60$ $\pm 0.008$ | $0.53$ $\pm 0.009$ | $0.68$ $\pm 0.004$ |
| Logits-min-exact | $0.76$ $\pm 0.010$ | $0.65$ $\pm 0.010$ | $0.48$ $\pm 0.004$ | $0.50$ $\pm 0.009$ | $0.68$ $\pm 0.004$ |
| Logits-max | $0.59$ $\pm 0.005$ | $0.56$ $\pm 0.005$ | $0.46$ $\pm 0.004$ | $0.55$ $\pm 0.013$ | $0.56$ $\pm 0.006$ |
| Logits-max-exact | $0.52$ $\pm 0.006$ | $0.48$ $\pm 0.002$ | $0.48$ $\pm 0.003$ | $0.49$ $\pm 0.009$ | $0.63$ $\pm 0.008$ |
| Probas-mean | $0.61$ $\pm 0.002$ | $0.56$ $\pm 0.010$ | $0.57$ $\pm 0.007$ | $0.58$ $\pm 0.007$ | $0.65$ $\pm 0.007$ |
| Probas-mean-exact | $0.68$ $\pm 0.008$ | $0.65$ $\pm 0.006$ | $0.51$ $\pm 0.006$ | $0.57$ $\pm 0.009$ | $0.67$ $\pm 0.003$ |
| Probas-min | $0.60$ $\pm 0.004$ | $0.51$ $\pm 0.007$ | $0.59$ $\pm 0.007$ | $0.55$ $\pm 0.005$ | $0.64$ $\pm 0.008$ |
| Probas-min-exact | $0.74$ $\pm 0.007$ | $0.67$ $\pm 0.007$ | $0.51$ $\pm 0.006$ | $0.59$ $\pm 0.008$ | $0.66$ $\pm 0.004$ |
| Probas-max | $0.56$ $\pm 0.005$ | $0.53$ $\pm 0.005$ | $0.46$ $\pm 0.003$ | $0.51$ $\pm 0.004$ | $0.55$ $\pm 0.004$ |
| Probas-max-exact | $0.49$ $\pm 0.007$ | $0.47$ $\pm 0.002$ | $0.51$ $\pm 0.005$ | $0.50$ $\pm 0.009$ | $0.62$ $\pm 0.006$ |
| p(True) | $0.55$ $\pm 0.005$ | $0.55$ $\pm 0.008$ | $0.47$ $\pm 0.002$ | $0.54$ $\pm 0.006$ | $0.71$ $\pm 0.003$ |
| p(True)-exact | $0.55$ $\pm 0.004$ | $0.50$ $\pm 0.005$ | $0.50$ $\pm 0.008$ | $0.50$ $\pm 0.003$ | $0.67$ $\pm 0.007$ |
| Probe @ token | | | | | |
| Last generated [-1] | $0.77$ $\pm 0.005$ | $0.68$ $\pm 0.006$ | $0.69$ $\pm 0.006$ | $0.78$ $\pm 0.005$ | $0.77$ $\pm 0.009$ |
| Before last generated [-2] | $0.76$ $\pm 0.002$ | $0.69$ $\pm 0.005$ | $0.67$ $\pm 0.008$ | $0.79$ $\pm 0.004$ | $0.75$ $\pm 0.007$ |
| End of question | $0.78$ $\pm 0.004$ | $0.60$ $\pm 0.003$ | $0.65$ $\pm 0.004$ | $0.74$ $\pm 0.002$ | $0.75$ $\pm 0.011$ |
| Exact answer last | 0.83 $\pm 0.005$ | 0.76 $\pm 0.003$ | 0.78 $\pm 0.007$ | 0.91 $\pm 0.005$ | 0.78 $\pm 0.006$ |
| Exact answer last+1 | $0.83$ $\pm 0.002$ | $0.76$ $\pm 0.006$ | $0.70$ $\pm 0.006$ | $0.90$ $\pm 0.004$ | $0.78$ $\pm 0.007$ |
## Appendix C Full Generalization Results
Figures 7, 8 and 9 present the generalization results for the remaining models. While these results exhibit similar high-level patterns to those found in the main paper on Mistral-7b-instruct, notable differences suggest that these models may possess different mechanisms for encoding truthfulness.
<details>
<summary>extracted/6450693/figures/generalization/mistral.png Details</summary>

### Visual Description
## Heatmap: Correlation Between Train and Test Datasets
### Overview
The image is a square heatmap visualizing correlation coefficients between different question-answering datasets when used as training and test sets. Values range from 0.0 (blue) to 1.0 (red), with darker red indicating higher correlation. The matrix reveals patterns of dataset similarity and model generalization capabilities.
### Components/Axes
- **X-axis (Test datasets)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NOQ_WC
- **Y-axis (Train datasets)**: Same list as X-axis, ordered identically
- **Color legend**: Gradient from blue (0.0) to red (1.0) in the right column
- **Cell values**: Numerical correlation coefficients (0.00-1.00) displayed in each cell
### Detailed Analysis
1. **Diagonal dominance**:
- All diagonal cells (same train/test dataset) show high correlation (0.50-0.96)
- Strongest self-correlations:
- Math (0.96)
- IMDB (0.95)
- Winobias (0.89)
- Weakest self-correlation: Winogrande (0.66)
2. **Cross-dataset patterns**:
- **High similarity clusters**:
- TriviaQA â HotpotQA (0.64-0.77)
- Movies â Winobias (0.51-0.80)
- NLI â IMDB (0.56-0.88)
- **Low similarity pairs**:
- Winogrande â Math (0.52-0.57)
- NOQ_WC â Winobias (0.54-0.56)
3. **Color verification**:
- Red cells (0.80+) concentrated on diagonal and TriviaQA/HotpotQA cluster
- Blue cells (0.40-0.59) dominate off-diagonal regions
- Intermediate values (0.60-0.79) form transitional bands
### Key Observations
1. **Specialization effect**: Models trained on a dataset perform best on the same dataset (diagonal values 0.50-0.96)
2. **Question-type similarity**:
- TriviaQA and HotpotQA show moderate cross-correlation (0.64-0.77)
- Winobias and Movies demonstrate stronger cross-correlation (0.51-0.80)
3. **Generalization limitations**:
- Only 3 pairs show >0.80 cross-correlation
- 68% of off-diagonal values <0.60
4. **Dataset hierarchy**:
- Math and IMDB show strongest self-performance
- Winogrande and NOQ_WC show weakest self-performance
### Interpretation
The heatmap reveals fundamental challenges in cross-dataset generalization:
1. **Domain specificity**: High diagonal values suggest models learn dataset-specific patterns rather than generalizable knowledge
2. **Question-type transfer**: Moderate correlations between similar question types (e.g., TriviaQA â HotpotQA) indicate partial transferability
3. **Performance bottlenecks**: Low self-correlation values for Winogrande (0.66) and NOQ_WC (0.84) suggest these datasets may contain unique challenges or require specialized architectures
4. **Practical implications**: The weak cross-correlations (mostly <0.60) highlight the need for:
- Better data augmentation strategies
- Multi-task learning approaches
- More diverse training datasets
- Improved evaluation metrics for cross-dataset performance
The data demonstrates that current models struggle with generalization across different question-answering domains, emphasizing the importance of developing more robust transfer learning techniques.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/mistral_reduced.png Details</summary>

### Visual Description
## Heatmap: Cross-Dataset Performance Correlation
### Overview
This heatmap visualizes the correlation between performance metrics of different question-answering datasets when used as training and test sets. Values range from -0.2 (blue) to +0.3 (red), indicating negative to positive correlations.
### Components/Axes
- **X-axis (Test datasets)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train datasets)**: Same list as X-axis
- **Color legend**:
- Red â Positive correlation (0.1â0.3)
- Blue â Negative correlation (-0.2â-0.0)
- White â Near-zero correlation (-0.01â0.01)
### Detailed Analysis
1. **Diagonal dominance**:
- Highest positive values along the diagonal (e.g., Winobias: 0.36, NLI: 0.35, NQ_WC: 0.33)
- Indicates strongest correlation when train/test datasets match
2. **Negative correlations**:
- Winogrande vs NLI: -0.28 (strongest negative)
- Math vs Winobias: -0.18
- HotpotQA vs Movies: -0.15
3. **Notable off-diagonal positives**:
- TriviaQA vs NQ_WC: +0.15
- HotpotQA vs NQ_WC: +0.17
- IMDB vs Winobias: +0.17
4. **Color consistency**:
- All red cells (>0.1) match legend expectations
- Blue cells (<-0.1) align with negative correlation range
### Key Observations
- **Dataset specificity**: Diagonal values suggest models trained on a dataset perform best on the same dataset
- **Cross-dataset challenges**: Many negative correlations (e.g., Winogrande vs NLI) indicate poor generalization
- **Unexpected positives**: Some dissimilar datasets show positive correlations (e.g., TriviaQA â NQ_WC)
- **Magnitude patterns**: Largest absolute values cluster around diagonal and lower-left quadrant
### Interpretation
This heatmap reveals critical insights about dataset interoperability:
1. **Specialization vs generalization**: High diagonal values suggest models are highly specialized to their training data, with limited transferability
2. **Dataset relationships**: Negative correlations (e.g., Winogrande â NLI) may indicate conflicting knowledge structures
3. **Unexpected synergies**: Positive off-diagonal values (e.g., TriviaQA â NQ_WC) suggest some datasets share latent features
4. **Quantitative thresholds**: Values above 0.2 (red) and below -0.2 (blue) represent strong correlations worth investigating
The data implies that dataset choice significantly impacts model performance, with both strong specialization effects and surprising cross-dataset relationships. Further analysis could explore why certain dataset pairs show positive correlations despite surface differences.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 7: Generalization between datasets, Mistral-7b.
<details>
<summary>extracted/6450693/figures/generalization/llama.png Details</summary>

### Visual Description
## Heatmap: Cross-Dataset Performance Correlation
### Overview
This heatmap visualizes the correlation or similarity scores between different question-answering datasets when used as training and test sets. Values range from 0.0 (no correlation) to 1.0 (perfect correlation), with darker red indicating higher similarity.
### Components/Axes
- **X-axis (Test dataset)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train dataset)**: Same as X-axis, listed vertically
- **Color legend**: Blue (0.0) to Red (1.0), with intermediate shades representing incremental values
- **Cell values**: Numerical scores embedded in each cell (e.g., 0.82, 0.69)
### Detailed Analysis
#### Train dataset rows:
1. **TriviaQA**:
- Highest self-correlation: 0.82 (self-test)
- Strongest cross-correlation: 0.82 with HotpotQA (test)
- Weakest: 0.50 with HotpotQA_WC (test)
2. **HotpotQA**:
- Self-correlation: 0.76
- Strongest cross: 0.82 with Movies (test)
- Weakest: 0.51 with Winogrande (test)
3. **Movies**:
- Self-correlation: 0.82
- Strongest cross: 0.82 with TriviaQA (test)
- Weakest: 0.52 with IMDb (test)
4. **Winobias**:
- Self-correlation: 0.91 (highest in dataset)
- Strongest cross: 0.77 with IMDb (test)
- Weakest: 0.51 with NQ_WC (test)
5. **Winogrande**:
- Self-correlation: 0.65
- Strongest cross: 0.86 with IMDb (test)
- Weakest: 0.50 with TriviaQA (test)
6. **NLI**:
- Self-correlation: 0.94 (highest in dataset)
- Strongest cross: 0.97 with IMDb (test)
- Weakest: 0.51 with Math (test)
7. **IMDB**:
- Self-correlation: 0.97 (highest in dataset)
- Strongest cross: 0.96 with Math (test)
- Weakest: 0.52 with HotpotQA_WC (test)
8. **Math**:
- Self-correlation: 0.96
- Strongest cross: 0.97 with IMDb (test)
- Weakest: 0.51 with Winobias (test)
9. **HotpotQA_WC**:
- Self-correlation: 0.67
- Strongest cross: 0.78 with IMDb (test)
- Weakest: 0.50 with NQ_WC (test)
10. **NQ_WC**:
- Self-correlation: 0.75
- Strongest cross: 0.75 with HotpotQA_WC (test)
- Weakest: 0.50 with TriviaQA (test)
### Key Observations
1. **Diagonal dominance**: All datasets show highest scores when trained and tested on the same dataset (e.g., IMDb-IMDb: 0.97, Math-Math: 0.96)
2. **Generalization gaps**: Cross-dataset performance varies significantly:
- Strongest cross-generalization: IMDb-NLI (0.97)
- Weakest cross-generalization: Winobias-NQ_WC (0.51)
3. **WC datasets**: HotpotQA_WC and NQ_WC show moderate self-correlation (0.67-0.75) but poor performance on other datasets (<0.55 in most cases)
4. **Language understanding clusters**: TriviaQA, HotpotQA, and Movies form a cluster with moderate cross-correlation (0.58-0.70)
5. **Knowledge-intensive datasets**: IMDb and Math show high mutual correlation (0.96-0.97)
### Interpretation
This heatmap reveals critical insights about dataset relationships and model generalization:
1. **Overfitting risk**: High diagonal values (0.96-0.97) suggest models trained on specific datasets may overfit, performing poorly on dissimilar test sets
2. **Knowledge transfer limitations**: The weakest scores (0.50-0.53) between QA datasets (e.g., TriviaQA-HotpotQA_WC) indicate limited transferability between different question types
3. **WC dataset challenges**: The WC (with context) variants show significantly lower performance across all tests, suggesting contextual augmentation may reduce model adaptability
4. **Knowledge domain clustering**: IMDb and Math demonstrate near-perfect mutual correlation (0.96-0.97), implying shared underlying knowledge structures
5. **Practical implications**: For real-world deployment, models trained on diverse datasets (e.g., IMDb+NLI) may outperform single-dataset trained models when facing mixed queries
The data suggests that while specialized training yields high performance on specific tasks, cross-dataset generalization remains a significant challenge, particularly for WC variants and knowledge-intensive domains.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/llama_reduced.png Details</summary>

### Visual Description
## Heatmap: Correlation Between Train and Test Datasets
### Overview
The image is a heatmap visualizing the correlation between different train and test datasets. Each cell represents the correlation coefficient between a specific train dataset (y-axis) and test dataset (x-axis). The color scale ranges from -0.2 (blue) to 0.4 (red), with darker shades indicating stronger correlations.
### Components/Axes
- **X-axis (Test dataset)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train dataset)**: Same categories as X-axis
- **Color legend**: Vertical bar on the right, ranging from -0.2 (blue) to 0.4 (red), with intermediate values at -0.1, 0.0, 0.1, 0.2, 0.3, 0.4
### Detailed Analysis
- **Diagonal values** (train = test):
- Highest positive correlation: **NLI (0.46)**
- Other notable values:
- TriviaQA (0.06), HotpotQA (0.12), Movies (0.06), Winobias (0.37), Winogrande (0.16), IMDB (0.13), Math (0.21), HotpotQA_WC (0.09), NQ_WC (0.06)
- **Off-diagonal values**:
- Strongest negative correlation: **IMDB vs IMDB (-0.24)**
- Other notable values:
- IMDb vs Math (-0.18), IMDb vs NLI (-0.08), IMDb vs HotpotQA_WC (-0.10)
- Math vs IMDb (0.21), Math vs NLI (-0.14)
- Winobias vs Winobias (0.37), Winobias vs Winogrande (0.12)
### Key Observations
1. **Diagonal dominance**: Models trained and tested on the same dataset show the strongest correlations (e.g., NLI at 0.46).
2. **Negative correlations**: Some datasets exhibit negative self-correlations (e.g., IMDb at -0.24), suggesting potential overfitting or dataset-specific anomalies.
3. **Cross-dataset performance**:
- Math-trained models perform well on IMDb (0.21).
- IMDb-trained models perform poorly on IMDb (-0.24) but moderately on Math (0.21).
4. **Winobias and Winogrande**: Show moderate positive correlations with themselves (0.37 and 0.16, respectively).
### Interpretation
The heatmap reveals that **dataset-specific training yields the highest performance**, as evidenced by the diagonal values. However, **IMDbâs negative self-correlation (-0.24)** is an outlier, indicating potential issues with dataset consistency or model generalization. Cross-dataset performance varies, with Math-trained models showing strong transfer to IMDb (0.21), suggesting shared features between these datasets. The negative correlation between IMDb and itself may reflect dataset-specific noise or biases that degrade model reliability when tested on the same data.
**Notable trends**:
- Red cells (positive) dominate the diagonal, emphasizing dataset-specific efficacy.
- Blue cells (negative) are sparse but significant, particularly for IMDb.
- Math and IMDb exhibit a bidirectional positive correlation (0.21), suggesting overlapping linguistic or structural patterns.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 8: Generalization between datasets, Llama-3-8b.
<details>
<summary>extracted/6450693/figures/generalization/llama_instruct.png Details</summary>

### Visual Description
## Heatmap: Correlation Between Train and Test Datasets
### Overview
The image is a heatmap visualizing correlation values between pairs of datasets used for training and testing machine learning models. The color intensity (from blue to red) represents correlation strength, with darker red indicating higher correlation (closer to 1.0) and lighter blue indicating lower correlation (closer to 0.0). The dataset names are listed on both axes, with train datasets on the y-axis and test datasets on the x-axis.
### Components/Axes
- **X-axis (Test datasets)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train datasets)**: Same as X-axis (TriviaQA, HotpotQA, ..., NQ_WC)
- **Legend**: Color bar on the right, ranging from 0.0 (blue) to 1.0 (red), labeled "Correlation"
- **Axis titles**:
- X-axis: "Test dataset"
- Y-axis: "Train dataset"
### Detailed Analysis
- **Diagonal values** (same train/test dataset pairs):
- TriviaQA: 0.84
- HotpotQA: 0.78
- Movies: 0.69
- Winobias: 0.93
- Winogrande: 0.54
- NLI: 0.55
- IMDB: 0.55
- Math: 0.58
- HotpotQA_WC: 0.59
- NQ_WC: 0.73
- **Notable off-diagonal values**:
- **High correlations**:
- IMDb (train) â IMDb (test): 0.96
- Math (train) â Math (test): 0.95
- Winobias (train) â Winobias (test): 0.93
- NQ_WC (train) â NQ_WC (test): 0.78
- **Low correlations**:
- Movies (train) â Winobias (test): 0.52
- Winogrande (train) â Math (test): 0.50
- HotpotQA (train) â NLI (test): 0.55
### Key Observations
1. **Diagonal dominance**: All train/test pairs with identical datasets show high correlation (0.54â0.96), confirming that models trained on a dataset perform best on the same test set.
2. **Cross-dataset variability**:
- IMDb and Math show strong cross-correlations with other datasets (e.g., IMDb â HotpotQA_WC: 0.67).
- Winobias and Winogrande exhibit lower cross-correlations (e.g., Winobias â Winogrande: 0.67).
3. **Outliers**:
- Movies â Winobias (0.52) and Winogrande â Math (0.50) are among the weakest correlations.
- NLI â NLI (0.57) and IMDb â IMDb (0.96) highlight dataset-specific performance gaps.
### Interpretation
The heatmap reveals that dataset-specific training significantly impacts performance, as evidenced by high diagonal values. Cross-dataset correlations vary widely, suggesting that some datasets (e.g., IMDb, Math) share structural or linguistic features that generalize better across tasks, while others (e.g., Movies, Winobias) are more domain-specific. The weakest correlations (e.g., Movies â Winobias) imply that models trained on these datasets may struggle to adapt to dissimilar test sets, highlighting the importance of dataset diversity in training for robust generalization. The color gradient visually reinforces these trends, with red dominating the diagonal and cooler tones appearing in off-diagonal cells with lower values.
</details>
(a) Raw AUC values. Values above $0.5$ indicate some generalization.
<details>
<summary>extracted/6450693/figures/generalization/llama_instruct_reduced.png Details</summary>

### Visual Description
## Heatmap: Correlation Between Training and Test Datasets
### Overview
This heatmap visualizes the correlation coefficients between different question-answering datasets when used as training and test sets. Values range from -0.3 (blue) to +0.3 (red), with white indicating near-zero correlation. The matrix reveals patterns of positive/negative transferability between datasets.
### Components/Axes
- **X-axis (Test datasets)**: TriviaQA, HotpotQA, Movies, Winobias, Winogrande, NLI, IMDB, Math, HotpotQA_WC, NQ_WC
- **Y-axis (Train datasets)**: Same as X-axis, listed vertically
- **Color legend**:
- Red (+0.3): Strong positive correlation
- Blue (-0.3): Strong negative correlation
- White (0.0): No correlation
- **Values**: Numerical correlation coefficients (e.g., 0.05, -0.32)
### Detailed Analysis
1. **Diagonal values** (same train/test dataset):
- TriviaQA: 0.05
- HotpotQA: 0.07
- Movies: 0.07
- Winobias: 0.28
- Winogrande: 0.18
- NLI: 0.32
- IMDB: 0.08
- Math: 0.06
- HotpotQA_WC: 0.08
- NQ_WC: 0.10
2. **Notable positive correlations**:
- NLI â IMDB: 0.08
- Winobias â Winogrande: 0.18
- NQ_WC â HotpotQA_WC: 0.10
3. **Notable negative correlations**:
- HotpotQA â Math: -0.37
- Winobias â Math: -0.38
- IMDb â Math: -0.35
- Winogrande â Math: -0.39
### Key Observations
1. **Self-correlation**: All datasets show moderate positive self-correlation (0.05â0.32), with NLI (0.32) and Winobias (0.28) showing the strongest.
2. **Math dataset**: Exhibits strong negative correlations (-0.34 to -0.39) with most other datasets, suggesting poor generalization.
3. **WC variants**: HotpotQA_WC and NQ_WC show moderate positive correlations with their base datasets (0.08â0.10).
4. **Winobias**: Has the strongest negative correlation with Math (-0.38) and moderate positive correlation with Winogrande (0.18).
### Interpretation
The heatmap demonstrates that:
- **Domain specificity**: Math shows poor transferability to other domains, suggesting fundamental differences in question types or reasoning requirements.
- **Semantic similarity**: Winobias and Winogrande show moderate positive correlation (0.18), indicating shared linguistic patterns.
- **WC adaptations**: The WC (with context) variants maintain moderate positive correlations with their base datasets, suggesting context augmentation improves generalization.
- **Outliers**: The extreme negative correlations (-0.37 to -0.39) between Math and other datasets highlight its unique characteristics, possibly requiring specialized models.
This matrix provides critical insights for dataset selection in transfer learning scenarios, emphasizing the importance of domain alignment between training and test sets.
</details>
(b) Performance (AUC) difference of the probe and the logit-based method. Values above $0 0$ indicate generalization beyond the logit-based method.
Figure 9: Generalization between datasets, Llama-3-8b-instruct.
## Appendix D Taxonomy of Errors
<details>
<summary>extracted/6450693/figures/correctness_across_resamples.png Details</summary>

### Visual Description
## Line Graph: Correctness vs. Number of Retries
### Overview
The image depicts a line graph illustrating the relationship between the number of retries (x-axis) and correctness (y-axis). The blue line shows a clear upward trend, starting at a correctness of 0.700 with 1 retry and reaching approximately 0.860 with 31 retries. The graph includes a legend in the top-right corner, confirming the blue line represents "Correctness."
### Components/Axes
- **X-axis**: Labeled "# retries," with values ranging from 1 to 31 in increments of 1.
- **Y-axis**: Labeled "Correctness," with values ranging from 0.700 to 0.850 in increments of 0.025.
- **Legend**: Positioned in the top-right corner, featuring a blue circle labeled "Correctness."
### Detailed Analysis
- **Data Points**:
- At 1 retry: Correctness = 0.700
- At 31 retries: Correctness â 0.860
- **Trend**: The line exhibits a steep initial increase (from 1 to ~10 retries) followed by a gradual, linear rise. The slope decreases as retries increase, indicating diminishing returns.
- **Color Consistency**: The blue line matches the legend's blue circle, confirming accurate representation.
### Key Observations
1. **Initial Growth**: Correctness rises sharply from 0.700 to ~0.780 between 1 and 10 retries.
2. **Plateau Phase**: Between 10 and 31 retries, correctness increases more gradually, reaching ~0.860.
3. **Final Value**: The maximum correctness (0.860) is achieved at 31 retries, slightly exceeding the y-axis upper bound of 0.850.
### Interpretation
The graph demonstrates a positive correlation between the number of retries and correctness, suggesting that repeated attempts improve performance. The steep initial increase implies significant gains from early retries, while the plateau indicates diminishing marginal returns. The final value (~0.860) suggests a practical upper limit for correctness, even with extensive retries. This could reflect a system where initial errors are quickly corrected, but residual inaccuracies persist despite further attempts. The data underscores the importance of balancing retry efforts with resource constraints.
</details>
Figure 10: The percentage of answers for which at least one generated answer was correct. The first step is greedy decoding.
Figure 10 presents, for each amount of resamples, the amount percentage of answers for which at least one generated answer was correct. The experiment was done on Mistral-7b-instruct with the TriviaQA dataset. For many answers that the greedy decoding fails to correctly provide an answer, the LLM is still able to generate the correct answer in at least one resample. The plot plateues around 30 resamples.
### D.1 Error Taxonomy Design Choices
The error taxonomy proposed in this paper is intentionally non-orthogonal, as some errors may simultaneously belong to multiple categories. For instance, an error might fall under both âconsistently incorrectâ (e.g., the same incorrect answer appears at least 15 times) and âmany different answersâ (e.g., the remaining answers show over 10 distinct variants).
Our taxonomy is designed to capture such nuanced cases, as restricting classification to a single category would hinder the generalizability of insights. Instead, we aim to learn general properties across different error types, providing LLM providers with actionable insights into questions exhibiting overlapping error patterns.
To support this non-orthogonal framework, our probes function as one-to-many classifiers, enabling precise error analysis and tailored solutions.
### D.2 Results on Additional Datasets
Table 8 presents the results of error type classification on the Winobias dataset and Table 9 on the Math dataset.
Table 8: AUC scores for error type classification (Winobias).
| Error type | Mistral-7b | Mistral-Instr-7b | Llama3-8b | Llama3-Instr-8b |
| --- | --- | --- | --- | --- |
| (A) Refuses to answer | - | - | - | - |
| (B) Consistently correct | $0.83\scriptscriptstyle{\pm 0.004}$ | $0.88\scriptscriptstyle{\pm 0.002}$ | $0.84\scriptscriptstyle{\pm 0.003}$ | $0.89\scriptscriptstyle{\pm 0.003}$ |
| (C) Consistently incorrect | $0.83\scriptscriptstyle{\pm 0.004}$ | $0.88\scriptscriptstyle{\pm 0.002}$ | $0.79\scriptscriptstyle{\pm 0.004}$ | $0.90\scriptscriptstyle{\pm 0.003}$ |
| (D) Two competing | $0.68\scriptscriptstyle{\pm 0.004}$ | $0.58\scriptscriptstyle{\pm 0.015}$ | $0.74\scriptscriptstyle{\pm 0.005}$ | $0.88\scriptscriptstyle{\pm 0.004}$ |
| (E) Many answers | - | - | - | - |
Table 9: AUC scores for error type classification (Math). Error types are predictable from the inner model representations, indicating the encoding of fine-grained information on errors.
| Error type | Mistral-7b | Mistral-Instr-7b | Llama3-8b | Llama3-Instr-8b |
| --- | --- | --- | --- | --- |
| (A) Refuses to answer | - | - | - | - |
| (B) Consistently correct | $0.85\scriptscriptstyle{\pm 0.017}$ | $0.84\scriptscriptstyle{\pm 0.007}$ | $0.83\scriptscriptstyle{\pm 0.020}$ | $0.87\scriptscriptstyle{\pm 0.006}$ |
| (C) Consistently incorrect | $0.85\scriptscriptstyle{\pm 0.026}$ | $0.85\scriptscriptstyle{\pm 0.003}$ | $0.69\scriptscriptstyle{\pm 0.032}$ | $0.91\scriptscriptstyle{\pm 0.007}$ |
| (D) Two competing | - | $0.76\scriptscriptstyle{\pm 0.020}$ | $0.57\scriptscriptstyle{\pm 0.001}$ | $0.79\scriptscriptstyle{\pm 0.006}$ |
| (E) Many answers | $0.74\scriptscriptstyle{\pm 0.010}$ | $0.79\scriptscriptstyle{\pm 0.015}$ | $0.69\scriptscriptstyle{\pm 0.041}$ | $0.90\scriptscriptstyle{\pm 0.008}$ |
### D.3 Qualitative Examples
Tables 10 and 11 present qualitative examples of the error types in the TriviaQA and Math datasets.
Table 10: Examples of error types in TriviaQA, Mistral-7B-Instruct. Correct answer is in bold.
| Type of error | Question | Answers |
| --- | --- | --- |
| Consistently correct | What clothing-part metaphorically classifies workers/jobs according to white or blue? | â collar â: $30$ |
| Consistently incorrect | Which town in southeast Wales became a UNESCO World Heritage Site in 2000? | âBlaenavonâ: 1, âCaerleonâ: 29 |
| Many different answers | Published in 2013 who wrote the novel âThe Kill Listâ? | â Frederick Forsyth â: 1, âJerry Pattersonâ: $1$ , âEdward Leeâ: $1$ , âBarry Lancetâ: $4$ âJeremy Holidayâ: $1$ , âBarry Lincoffâ: $1$ , âJim Marrsâ: $1$ , âJohn Marrsâ: $1$ , âAnthony Lacyâ: $1$ , âDaniel Krausâ: $1$ , âRon Bassâ: $1$ , âDavid Martinielloâ: $2$ , âEric Lustbaderâ: $1$ , âBarbie Latza Nadeauâ: $1$ , âJames Swallowâ: $1$ , âMark Sullivanâ: $1$ , âAlex Binottoâ: $1$ , âDavid Baldacciâ: $1$ , âBill Cosoresâ: $1$ , âFrederic J. Brownâ: $1$ , âRon Capps and Tate Foleyâ: $1$ , âBarbie Wildeâ: $1$ , âNO ANSWERâ: $3$ |
| Two competing answers | What is the only letter of the alphabet which does not appear in any of the names of the 50 American states? | â The letter q â: 15, âThe letter Xâ: 15 |
Table 11: Examples of error types in Math, Mistral-7B-Instruct. Correct answer is in bold.
| Type of error | Question | Answers |
| --- | --- | --- |
| Consistently correct | If John travels 15 miles on a bike ride, and Jill travels 5 miles less, how many miles does Jim travel if he travels only 20% as far as Jill? | â 2 â: $30$ |
| Consistently incorrect | Joy has 30 pencils, and Colleen has 50 pencils. If they bought the pencils at $4 each at the store, how much more money did Colleen pay than Joy for her pencils? | â 80$ â: 1, â16$â: $29$ (correct) |
| Many different answers | If the first skyscraper was built 100 years ago, how many years in the future will it be 5 years before its 200th anniversary of being built? | â 95 â: $14$ , â91â: $1$ , â87â: $1$ , â15â: $2$ , â96â: $1$ , âSixâ: $1$ , â202 â: $1$ , â2035â: $1$ , â195â: $1$ , â49â: $1$ , â101â: $1$ , â199â: $1$ , â3 years before the 200th anniversaryâ: $1$ , â203 years after it was builtâ: $1$ , â196â: $1$ , â2043â: $1$ |
| Two competing answers | David did 27 more push-ups but 7 less crunches than Zachary in gym class today. If Zachary did 5 push- ups and 17 crunches.How many more crunches than push-ups did Zachary do? | â 12 â:5, â1â: 5 $x$ |
## Appendix E Detecting the Correct Answer Full Results
In Table 12 we present some qualitative samples from Mistral-7b-instruct, for the phenomenon we observe at error type (C2) Consistently incorrect but generates the correct answer at least one time. The samples in the table represent cases where the probe chose the correct answer. Table 13 compares different decoding mechanisms, including the choice via probe, on non-instruct models, and Table 14 compares on the instruct models. For all datasets and models, we observe similar conclusions to those in the main paper: significant improvement is observed for error types where the LLM shows no preference to the correct answer.
Table 12: Examples of questions where Mistral-7b-Instruct consistently provided incorrect answers but occasionally generated the correct one. In these instances, the probe successfully identified the right answer. For each question, the model was sampled 30 times.
Table 13: Various answer choice strategies, non-instruct models.
| | Mistral-7b | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing |
| All | $0.63$ $\pm 0.003$ | $0.54$ $\pm 0.004$ | $0.65$ $\pm 0.002$ | $0.62$ $\pm 0.003$ | $0.25$ $\pm 0.018$ | $0.36$ $\pm 0.022$ | $0.49$ $\pm 0.019$ | $0.60$ $\pm 0.017$ | $0.69$ $\pm 0.016$ | $0.58$ $\pm 0.009$ | $0.62$ $\pm 0.009$ | $0.83$ $\pm 0.006$ |
| (A) Refuses to answer | $0.08$ $\pm 0.015$ | $0.04$ $\pm 0.009$ | $0.00$ $\pm 0.000$ | $0.13$ $\pm 0.007$ | $0.01$ $\pm 0.009$ | $0.04$ $\pm 0.019$ | $0.00$ $\pm 0.000$ | $0.22$ $\pm 0.033$ | - | - | - | - |
| (B1) All | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | - | - | - | - | - | - | - | - |
| (B2) Most | $0.98$ $\pm 0.001$ | $0.84$ $\pm 0.009$ | $1.00$ $\pm 0.000$ | $0.91$ $\pm 0.002$ | $0.96$ $\pm 0.024$ | $0.84$ $\pm 0.031$ | $1.00$ $\pm 0.000$ | $0.86$ $\pm 0.041$ | $0.96$ $\pm 0.004$ | $0.73$ $\pm 0.009$ | $0.95$ $\pm 0.003$ | $0.91$ $\pm 0.009$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $\pm 0.003$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - | - | - | - | - |
| (C2) Most | $0.03$ $\pm 0.014$ | $0.20$ $\pm 0.008$ | $0.00$ $\pm 0.000$ | $0.27$ $\pm 0.036$ | - | - | - | - | $0.19$ $\pm 0.010$ | $0.30$ $\pm 0.026$ | $0.00$ $\pm 0.000$ | $0.70$ $\pm 0.007$ |
| (D) Two competing | $0.48$ $\pm 0.006$ | $0.36$ $\pm 0.008$ | $0.52$ $\pm 0.015$ | $0.54$ $\pm 0.016$ | - | - | - | - | $0.73$ $\pm 0.018$ | $0.54$ $\pm 0.022$ | $0.47$ $\pm 0.030$ | $0.85$ $\pm 0.019$ |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.01$ $\pm 0.004$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.01$ $\pm 0.010$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - |
| (E2) Correct appears | $0.38$ $\pm 0.009$ | $0.21$ $\pm 0.006$ | $0.42$ $\pm 0.015$ | $0.38$ $\pm 0.009$ | $0.09$ $\pm 0.010$ | $0.17$ $\pm 0.034$ | $0.36$ $\pm 0.020$ | $0.62$ $\pm 0.035$ | - | - | - | - |
| Llama-8b | | | | | | | | | | | | |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing |
| All | $0.66$ $\pm 0.002$ | $0.58$ $\pm 0.003$ | 0.68 $\pm 0.003$ | 0.68 $\pm 0.002$ | $0.30$ $\pm 0.023$ | $0.47$ $\pm 0.022$ | $0.62$ $\pm 0.014$ | $0.70$ $\pm 0.021$ | $0.73$ $\pm 0.011$ | $0.61$ $\pm 0.005$ | $0.66$ $\pm 0.016$ | 0.84 $\pm 0.006$ |
| (A) Refuses to answer | $0.08$ $\pm 0.005$ | $0.07$ $\pm 0.011$ | $0.00$ $\pm 0.000$ | 0.16 $\pm 0.011$ | $0.00$ $\pm 0.007$ | $0.04$ $\pm 0.015$ | $0.00$ $\pm 0.000$ | $0.25$ $\pm 0.025$ | - | - | - | - |
| (B) Consistently correct | | | | | | | | | | | | |
| (B1) All | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | - | - | - | - | - | - | - | - |
| (B2) Most | $0.98$ $\pm 0.001$ | $0.87$ $\pm 0.002$ | 1.00 $\pm 0.000$ | $0.95$ $\pm 0.002$ | $0.77$ $\pm 0.024$ | $0.88$ $\pm 0.025$ | $1.00$ $\pm 0.000$ | $0.97$ $\pm 0.014$ | $0.98$ $\pm 0.005$ | $0.75$ $\pm 0.004$ | 1.00 $\pm 0.000$ | $0.94$ $\pm 0.003$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - | - | - | - | - |
| (C2) Most | $0.06$ $\pm 0.013$ | $0.18$ $\pm 0.009$ | $0.00$ $\pm 0.000$ | 0.35 $\pm 0.043$ | - | - | - | - | $0.25$ $\pm 0.026$ | $0.29$ $\pm 0.023$ | $0.00$ $\pm 0.000$ | 0.65 $\pm 0.022$ |
| (D) Two competing | $0.44$ $\pm 0.029$ | $0.42$ $\pm 0.035$ | $0.53$ $\pm 0.020$ | 0.66 $\pm 0.030$ | - | - | - | - | $0.73$ $\pm 0.025$ | $0.47$ $\pm 0.019$ | $0.41$ $\pm 0.037$ | 0.86 $\pm 0.014$ |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - |
| (E2) Correct appears | $0.46$ $\pm 0.009$ | $0.34$ $\pm 0.009$ | $0.53$ $\pm 0.007$ | 0.54 $\pm 0.005$ | $0.14$ $\pm 0.015$ | $0.17$ $\pm 0.025$ | $0.44$ $\pm 0.047$ | $0.65$ $\pm 0.031$ | - | - | - | - |
Table 14: Various answer choice strategies, instruct models.
| | Mistral-7b-Instruct | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing | Greedy | Random | Majority | Probing |
| All | $0.63$ $\pm 0.003$ | $0.64$ $\pm 0.002$ | $0.67$ $\pm 0.004$ | $0.71$ $\pm 0.003$ | $0.55$ $\pm 0.021$ | $0.52$ $\pm 0.019$ | $0.57$ $\pm 0.025$ | $0.70$ $\pm 0.014$ | $0.77$ $\pm 0.012$ | $0.77$ $\pm 0.008$ | $0.77$ $\pm 0.010$ | $0.79$ $\pm 0.008$ |
| (A) Refuses to answer | $0.06$ $\pm 0.005$ | $0.06$ $\pm 0.011$ | $0.00$ $\pm 0.000$ | $0.28$ $\pm 0.009$ | - | - | - | - | - | - | - | - |
| (B1) All | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ |
| (B2) Most | $0.88$ $\pm 0.007$ | $0.83$ $\pm 0.009$ | $0.99$ $\pm 0.002$ | $0.89$ $\pm 0.010$ | $0.87$ $\pm 0.013$ | $0.84$ $\pm 0.024$ | $1.00$ $\pm 0.000$ | $0.96$ $\pm 0.007$ | $0.91$ $\pm 0.031$ | $0.87$ $\pm 0.029$ | $0.96$ $\pm 0.017$ | $0.89$ $\pm 0.032$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $\pm 0.003$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.05$ $\pm 0.020$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ |
| (C2) Most | $0.11$ $\pm 0.009$ | $0.15$ $\pm 0.012$ | $0.00$ $\pm 0.000$ | $0.53$ $\pm 0.005$ | $0.10$ $\pm 0.040$ | $0.20$ $\pm 0.050$ | $0.00$ $\pm 0.000$ | $0.82$ $\pm 0.037$ | $0.18$ $\pm 0.057$ | $0.20$ $\pm 0.039$ | $0.00$ $\pm 0.000$ | $0.54$ $\pm 0.067$ |
| (D) Two competing | $0.32$ $\pm 0.010$ | $0.45$ $\pm 0.023$ | $0.50$ $\pm 0.024$ | $0.78$ $\pm 0.017$ | - | - | - | - | - | - | - | - |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.01$ $\pm 0.003$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - | - | - | - | - |
| (E2) Correct appears | $0.23$ $\pm 0.020$ | $0.19$ $\pm 0.022$ | $0.38$ $\pm 0.009$ | $0.56$ $\pm 0.025$ | - | - | - | - | - | - | - | - |
| Llama-8b-Instruct | | | | | | | | | | | | |
| TriviaQA | Math | Winobias | | | | | | | | | | |
| Error type | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing | Greedy | Sampling | Majority | Probing |
| All | $0.69$ $\pm 0.003$ | $0.67$ $\pm 0.001$ | $0.71$ $\pm 0.002$ | 0.73 $\pm 0.004$ | $0.89$ $\pm 0.010$ | $0.87$ $\pm 0.012$ | 0.91 $\pm 0.013$ | 0.91 $\pm 0.010$ | $0.75$ $\pm 0.009$ | $0.74$ $\pm 0.009$ | $0.76$ $\pm 0.012$ | 0.83 $\pm 0.009$ |
| (A) Refuses to answer | $0.06$ $\pm 0.011$ | $0.05$ $\pm 0.011$ | $0.00$ $\pm 0.000$ | 0.27 $\pm 0.025$ | - | - | - | - | - | - | - | - |
| (B) Consistently correct | | | | | | | | | | | | |
| (B1) All | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ | $1.00$ $\pm 0.000$ |
| (B2) Most | $0.93$ $\pm 0.002$ | $0.86$ $\pm 0.009$ | 1.00 $\pm 0.001$ | $0.92$ $\pm 0.004$ | $0.94$ $\pm 0.014$ | $0.92$ $\pm 0.014$ | 1.00 $\pm 0.000$ | $0.95$ $\pm 0.013$ | $0.94$ $\pm 0.006$ | $0.88$ $\pm 0.010$ | $1.00$ $\pm 0.000$ | $0.93$ $\pm 0.011$ |
| (C) Consistently incorrect | | | | | | | | | | | | |
| (C1) All | $0.00$ $\pm 0.001$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ |
| (C2) Most | $0.12$ $\pm 0.018$ | $0.22$ $\pm 0.010$ | $0.00$ $\pm 0.000$ | 0.43 $\pm 0.010$ | - | - | - | - | $0.11$ $\pm 0.018$ | $0.15$ $\pm 0.025$ | $0.00$ $\pm 0.000$ | 0.67 $\pm 0.016$ |
| (D) Two competing | $0.43$ $\pm 0.017$ | $0.42$ $\pm 0.014$ | $0.46$ $\pm 0.016$ | 0.60 $\pm 0.010$ | - | - | - | - | $0.39$ $\pm 0.068$ | $0.39$ $\pm 0.047$ | $0.38$ $\pm 0.042$ | 0.83 $\pm 0.050$ |
| (E) Many answers | | | | | | | | | | | | |
| (E1) Non correct | $0.00$ $\pm 0.002$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | $0.00$ $\pm 0.000$ | - | - | - | - | - | - | - | - |
| (E2) Correct appears | $0.28$ $\pm 0.006$ | $0.28$ $\pm 0.008$ | $0.40$ $\pm 0.009$ | 0.52 $\pm 0.009$ | - | - | - | - | - | - | - | - |
## Appendix F Practical Guidance on Integrating Insights from this Paper into Model Development Workflows
The findings of this study reveal critical insights into the internal mechanisms of Large Language Models (LLMs) and their implications for truthfulness and error handling. To effectively incorporate these insights into model development, consider the following strategies:
Error Detection.
Focus on representations of exact answer tokens to train the error detection probe. These tokens encode significant truthfulness signals and improve the reliability of error detection mechanisms. The trained probe should be integrated as part of the pipeline for specific task, e.g., math calculations. The probe provides a confidence score which can be used to warn the user for unreliable outputs, or to perform an intervention to fix the answer.
Error-Specific Interventions.
The taxonomy of errors outlined in this study can be utilized to classify and analyze the types of errors that an LLMs may produce. Identifying these error types is useful for customizing strategies for error mitigation. The probes for detecting error types can be deployed as part of the LLM pipeline and create interventions based on their predictions. For example, Retrieval Augmented Generation (RAG) (Lewis et al., 2020) can help for âconsistently incorrectâ errors, as well as resampling and choosing the answer ranked highest by the error detection probe, or weight-update, if possible, as a more consistent solution. For âconsistently correctâ error types, an intervention on the LLMâs internal representations can increase the confidence in generating a correct answer (Simhi et al., 2024).
Cross-Task Generalization.
Universal generalization of probing classifiers across unrelated tasks should be approached with caution. The results in this work show that probes are mainly useful for task-specific error detection.