2404.15993
Model: healer-alpha-free
# Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
**Authors**: Linyu LiuYu PanXiaocheng LiGuanting Chen
(â University of North Carolina § Tsinghua University ⥠HKUST(GZ) â Imperial College London)
## Abstract
In this paper, we study the problem of uncertainty estimation and calibration for LLMs. We begin by formulating the uncertainty estimation problem, a relevant yet underexplored area in existing literature. We then propose a supervised approach that leverages labeled datasets to estimate the uncertainty in LLMsâ responses. Based on the formulation, we illustrate the difference between the uncertainty estimation for LLMs and that for standard ML models and explain why the hidden neurons of the LLMs may contain uncertainty information. Our designed approach demonstrates the benefits of utilizing hidden activations to enhance uncertainty estimation across various tasks and shows robust transferability in out-of-distribution settings. We distinguish the uncertainty estimation task from the uncertainty calibration task and show that better uncertainty estimation leads to better calibration performance. Furthermore, our method is easy to implement and adaptable to different levels of model accessibility including black box, grey box, and white box.
footnotetext: Equal contribution. footnotetext: Email address: linyuliu@unc.edu, yupan@hkust-gz.edu.cn, xiaocheng.li@imperial.ac.uk, guanting@unc.edu.
## 1 Introduction
Large language models (LLMs) have marked a significant milestone in the advancement of natural language processing (Radford et al., 2019; Brown et al., 2020; Ouyang et al., 2022; Bubeck et al., 2023), showcasing remarkable capabilities in understanding and generating human-like text. However, their tendency to produce hallucinationsâmisleading or fabricated informationâraises concerns about their reliability and trustworthiness (Rawte et al., 2023). The problem of whether we should trust the response from machine learning models is critical in machine-assisted decision applications, such as self-driving cars (Ramos et al., 2017), medical diagnosis (Esteva et al., 2017), and loan approval processes (Burrell, 2016), where errors can lead to significant loss.
This issue becomes even more pressing in the era of generative AI, as the outputs of these models are random variables sampled from a distribution, meaning incorrect responses can still be produced with positive probability. Due to this inherent randomness, the need to address uncertainty estimation in generative AI is even greater than that in other machine learning models (Gal and Ghahramani, 2016; Lakshminarayanan et al., 2017; Guo et al., 2017; Minderer et al., 2021), and yet there has been limited research in this area (Kuhn et al., 2023; Manakul et al., 2023; Tian et al., 2023).
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: LLM Uncertainty Estimation Process
### Overview
This diagram illustrates a process flow for estimating the uncertainty of answers generated by a Large Language Model (LLM). It shows how a user's question is processed by an LLM to generate multiple potential answers with associated probabilities, and how a separate "Uncertainty estimation module" analyzes the input, model activations, and output to assign a final confidence score to each answer.
### Components & Flow
The diagram is organized into three main horizontal sections, with arrows indicating the flow of information.
**1. Top Section (Input & Generation):**
* **Top-Left (Input):** A gray box with a pink user icon contains the text: `User's question: What's the capital of France?`.
* **Top-Center (Processing):** A robot icon labeled `LLM` receives the user's question via a right-pointing arrow.
* **Top-Right (Output):** A blue-bordered box titled `Randomly generate answers` contains a list of three potential answers with their generation probabilities:
* `Ans 1: It's Paris â w.p. 0.5`
* `Ans 2: Paris â w.p. 0.4`
* `Ans 3: London â w.p. 0.1`
* *Note: "w.p." is an abbreviation for "with probability".*
**2. Middle Section (Information Flow to Uncertainty Module):**
Three arrows point from the top section down to the "Uncertainty estimation module" in the bottom-left:
* An arrow labeled `input` originates from the user's question box.
* A dashed arrow labeled `activations` originates from the LLM icon.
* An arrow labeled `output` originates from the answer list box.
**3. Bottom Section (Analysis & Final Confidence):**
* **Bottom-Left (Analysis Module):** A pink box divided into two parts. The left part is labeled `Uncertainty estimation module`. The right part contains the text `Analyzing input and output`.
* **Bottom-Right (Result Table):** A right-pointing arrow from the analysis module leads to a table (with a teal header row) as follows:
| Answer | Confidence |
|--------|------------|
| It's Paris | 0.999 |
| Paris | 0.999 |
| London | 0.1 |
### Detailed Analysis
The process demonstrates a method for evaluating the reliability of LLM outputs. The LLM does not produce a single answer but a distribution of possible answers. For the factual question "What's the capital of France?", the model generates two correct variants ("It's Paris", "Paris") with high combined probability (0.9) and one incorrect answer ("London") with low probability (0.1).
The uncertainty estimation module takes three streams of data: the original input question, the internal model activations during generation, and the final output list. It processes this information to produce a final confidence score for each candidate answer.
### Key Observations
1. **Confidence vs. Probability:** The module's final confidence scores (0.999, 0.999, 0.1) are distinct from the LLM's generation probabilities (0.5, 0.4, 0.1). The module assigns near-perfect confidence to both correct answers, effectively merging their probabilities, while its confidence in the incorrect answer matches the LLM's low probability.
2. **Spatial Layout:** The legend (answer list with probabilities) is positioned in the top-right. The final output table is in the bottom-right. The core analysis module is in the bottom-left, acting as the central processing node.
3. **Color Coding:** The diagram uses color to group related elements: pink for user input and the analysis module, blue for the LLM's output space, and teal for the final confidence results.
### Interpretation
This diagram outlines a framework for making LLM outputs more trustworthy. It suggests that by analyzing not just the final text but also the input context and the model's internal state (activations), a separate system can better quantify how certain the model "should" be about its answers.
The key takeaway is the system's ability to distinguish between high-confidence correct answers and low-confidence incorrect ones. For the example question, it correctly identifies that "London" is an answer the LLM is uncertain about (confidence 0.1), while being very confident in the correct answers. This process is crucial for deploying LLMs in high-stakes domains where understanding the model's certainty is as important as the answer itself. The diagram implies that raw generation probability from the LLM may not be the best measure of final answer reliability, and a dedicated uncertainty estimation step provides a more calibrated confidence metric.
</details>
Figure 1: An example to illustrate the uncertainty estimation task. The LLM randomly generates an answer to the question (Itâs Paris, Paris, or London). The goal of the uncertainty estimation is to estimate a confidence score to the question-answer pair, where a higher score indicates a higher confidence in the correctness of the answer.
In this work, we aim to formally define the problem of uncertainty estimation for LLMs and propose methods to address it. As shown in Figure 1, uncertainty estimation for LLMs can be broadly defined as the task of predicting the quality of the generated response based on the input. In this context, âqualityâ typically refers to aspects such as confidence, truthfulness, and uncertainty. Assuming access to a universal metric for evaluating the confidence of the output, the goal of uncertainty estimation is to produce a confidence score that closely aligns with this metric. Given the inherent randomness in LLMs, where incorrect responses can still be generated with positive probability, uncertainty estimation serves as a crucial safeguard. It helps assess the reliability of responses, enhance the trustworthiness of the model, and guide users on when to trust or question the output.
It is also worth noting that calibration is closely related and can be viewed as a subclass of uncertainty estimation, where the metric corresponds to the conditional probability in the individual level. Most studies on uncertainty estimation or calibration in language models focus on fixed-dimensional prediction tasks (i.e., the output of the LLM only has one token limited in a finite set), such as sentiment analysis, natural language inference, and commonsense reasoning (Zhou et al., 2023; Si et al., 2022; Xiao et al., 2022; Desai and Durrett, 2020). However, given the structural differences in how modern LLMs are used, alongside their proven capability to handle complex, free-form tasks with variable-length outputs, there is a growing need to address uncertainty estimation and calibration specifically for general language tasks in the domain of LLMs.
This work explores a simple supervised method motivated by two ideas in the existing literature on LLMs. First, prior work on uncertainty estimation for LLMs primarily focused on designing uncertainty metrics in an unsupervised way by examining aspects like the generated outputsâ consistency, similarity, entropy, and other relevant characteristics (Lin et al., 2023; Manakul et al., 2023; Kuhn et al., 2023; Hou et al., 2023; Lin et al., 2022; Kuhn et al., 2023; Chen et al., 2024). The absence of the need for knowledge of the modelâs weights enables their application to some black-box or gray-box models. Second, a growing stream of literature argues that hidden layersâ activation values within the LLMs offer insights into the LLMsâ knowledge and confidence (Slobodkin et al., 2023; Ahdritz et al., 2024; Duan et al., 2024). It has shown success in other fields of LLMs, like hallucination detection (CH-Wang et al., 2023; Azaria and Mitchell, 2023; Ahdritz et al., 2024). Based on this argument, white-box LLMs, which allow access to more of LLMsâ internal states, such as logits and hidden layers, are believed to have the capacity to offer a more nuanced understanding and improved uncertainty estimation results (Verma et al., 2023; Chen et al., 2024; Plaut et al., 2024).
Both of the above approaches, however, have key limitations. For the unsupervised metrics, given the complexity of LLMsâ underlying architectures, semantic information may be diluted when processing through self-attention mechanisms and during token encoding/decoding. For the second idea, the requirements of hidden layer features restrict its application to close-source/black-box LLMs. In this paper, we combine the strengths of these two ideas by proposing a general supervised learning method and pipeline design that address these limitations. Specifically, to incorporate more features (e.g., hidden layers) in estimating the uncertainty, we train an external uncertainty estimation model in a supervised way to estimate the uncertainty/confidence of the response generated from an LLM (target LLM). As the quality of the response reveals to what extent we should believe the response is correct, we formulate this supervised uncertainty estimation problem as a regression task and prepare the labels in the training dataset by measuring the responseâs quality. To extend our method to black-box LLMs, we allow the semantic features of the question-response pair to come from another language model (tool LLM). The overall pipeline of this method is shown in Figure 2.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: LLM Response Quality Evaluation via Uncertainty Estimation
### Overview
This diagram illustrates a technical process for evaluating the quality of a Large Language Model's (LLM) generated response by using a secondary "Tool LLM" to estimate uncertainty. The system compares the generated response against a reference answer using a standard metric (Rouge-L/BLEU) and also uses internal model features to predict a quality score, creating a dual-path evaluation system.
### Components/Axes
The diagram is a flowchart with labeled components connected by directional arrows indicating data flow. The primary language is English.
**Key Components (from left to right):**
1. **Query x**: A gray, rounded rectangle containing the example text: "What's the capital of France?".
2. **Target LLM**: Represented by a stylized brain/gear icon. It receives the query.
3. **Generated response y**: A blue, rounded rectangle containing the text: "It's Paris.".
4. **Reference response**: A dashed gray box containing the text: "Paris". This is the ground truth.
5. **Quality metric**: A label above a process arrow. The specific metrics listed are "Rouge-L/BLEU".
6. **s(y, y_true)**: A green, rounded rectangle representing the calculated similarity score between the generated response (y) and the true reference (y_true).
7. **Tool LLM**: Represented by a cat-like icon. It receives two inputs: the original "Query x" and the "Generated response y".
8. **Hidden layers**: A yellow box containing a grid of circles (blue, red, green) representing neural network activations. An arrow from the Tool LLM points to this box.
9. **Probability/entropy features**: A yellow, rounded rectangle below the "Hidden layers" box. An arrow from the Tool LLM also points here.
10. **Input**: A label on an arrow combining data from "Hidden layers" and "Probability/entropy features".
11. **Uncertainty estimator**: A red, rounded rectangle. It receives the combined "Input".
12. **Predict**: A label on an arrow pointing from the "Uncertainty estimator" back to the "s(y, y_true)" score box.
### Detailed Analysis
The process flow is as follows:
1. **Primary Generation Path (Top Flow):**
* A **Query x** ("What's the capital of France?") is fed into a **Target LLM**.
* The Target LLM produces a **Generated response y** ("It's Paris.").
* This generated response is compared to a **Reference response** ("Paris") using a **Quality metric** (Rouge-L/BLEU).
* The output of this comparison is a similarity score, denoted as **s(y, y_true)**.
2. **Uncertainty Estimation Path (Bottom Flow):**
* The same **Query x** and the **Generated response y** are both fed into a separate **Tool LLM**.
* The Tool LLM processes these inputs. Its internal states are tapped at two points:
* **Hidden layers**: The activations from the model's neural network layers.
* **Probability/entropy features**: Derived statistical features from the model's output distribution.
* These two data streams are combined as an **Input** to an **Uncertainty estimator** module.
* The **Uncertainty estimator** produces a prediction (**Predict**).
* This prediction is directed to the **s(y, y_true)** score box, indicating it is either predicting or modulating the final quality score.
### Key Observations
* **Dual Evaluation**: The system employs two parallel evaluation methods: a direct, reference-based metric (Rouge-L/BLEU) and an indirect, model-internal uncertainty estimation.
* **Tool LLM Role**: The "Tool LLM" acts as a diagnostic model, analyzing both the input query and the output response to gauge confidence. Its icon (a cat) is distinct from the Target LLM's icon (a brain/gear), suggesting it may be a different, specialized model.
* **Feature Extraction**: The uncertainty estimator doesn't use the raw text but relies on abstract features: hidden layer activations and probability/entropy metrics.
* **Feedback Loop**: The arrow from the Uncertainty estimator back to the quality score `s(y, y_true)` creates a feedback or predictive loop, suggesting the uncertainty estimate is used to adjust, validate, or predict the final quality assessment.
### Interpretation
This diagram depicts a framework for making LLM outputs more reliable. The core idea is that a model's internal "uncertainty" (captured via its hidden states and output entropy) can be a proxy for the factual correctness or quality of its response.
* **How it works**: For a given query and response, the system doesn't just ask "Is this answer similar to the correct one?" (the Rouge-L/BLEU path). It also asks, "Was the model confident when it generated this answer?" (the uncertainty path). A response that is both similar to the reference *and* generated with high confidence (low uncertainty) is likely of higher quality.
* **Significance**: This approach is valuable for deployed AI systems where reference answers aren't always available in real-time. The uncertainty estimator could flag responses that, while fluent, are generated with low model confidence, prompting a human review or a fallback mechanism. It moves beyond surface-level text matching to probe the model's internal state for signs of potential error or hallucination.
* **Notable Design**: The use of a separate "Tool LLM" implies that estimating uncertainty might be a task best handled by a model different from the one generating the answer, possibly to avoid bias or to leverage a model fine-tuned specifically for diagnostic tasks.
</details>
Figure 2: Illustration of our proposed supervised method. The tool LLM is an open-source LLM and can be different from the target LLM. In the training phase, where the reference response is available, we train the uncertainty estimator using the quality of the response as the label. In the test phase, the uncertainty estimator predicts the quality of the generated response to obtain an uncertainty score.
Our contributions are four-fold:
- First, we formally define the task of uncertainty estimation, while some of the existing literature either does not distinguish uncertainty estimation and uncertainty calibration or misuses and confuses the terminologies of uncertainty and hallucination.
- Second, we adopt a supervised method for uncertainty estimation that is intuitive, easy to implement, and executable even on black-box LLMs. Leveraging supervised labels from the uncertainty metric, our approach sets an upper bound for the performance of all unsupervised methods, representing the highest achievable performance for these approaches.
- Third, we systematically discuss the relationship and the difference between deep learning and LLM in uncertainty estimation. Formally, we give an explanation to see why the method for the traditional deep learning model may fail in LLM, and why the hidden layer is useful in estimating the uncertainty in our context.
- Finally, numerical experiments on various natural language processing tasks demonstrate the superiority of our methods over existing benchmarks. The results also reveal several insightful observations, including the role of neural nodes in representing uncertainty, and the transferability of our trained uncertainty estimation model.
### 1.1 Related literature
The uncertainty estimation and calibration for traditional machine learning is relatively well-studied (Abdar et al., 2021; Gawlikowski et al., 2023). However, with the rapid development of LLMs, there is a pressing need to better understand the uncertainty for LLMsâ responses, and measuring the uncertainty from sentences instead of a fixed-dimension output is more challenging. One stream of work has been focusing on unsupervised methods that leverage entropy (Malinin and Gales, 2021), similarity (Fomicheva et al., 2020; Lin et al., 2022), semantic (Kuhn et al., 2023; Duan et al., 2023), logit or hidden statesâ information (Kadavath et al., 2022; Chen et al., 2024; Su et al., 2024; Plaut et al., 2024) to craft an uncertainty metric that helps to quantify uncertainty. For black-box models, some of the metrics can be computed based on multiple sampled output of the LLMs (Malinin and Gales, 2021; Lin et al., 2023; Manakul et al., 2023; Chen and Mueller, 2023); while for white-box models, more information such as the outputâs distribution, the value of the logit and hidden layers make computing the uncertainty metric easier. We also refer to Desai and Durrett (2020); Zhang et al. (2021); Ye and Durrett (2021); Si et al. (2022); Quach et al. (2023); Kumar et al. (2023); Mohri and Hashimoto (2024) for other related uncertainty estimation methods such as conformal prediction. We defer more discussions on related literature, in particular, on the topics of hallucination detection and information in hidden layers of LLMs, to Appendix A.
## 2 Problem Setup
Consider the following environment where one interacts with LLMs through prompts and responses: An LLM is given with an input prompt $\bm{x}=(x_{1},x_{2},...,x_{k})\in\mathcal{X}$ with $x_{i}\in\mathcal{V}$ representing the $i$ -th token of the prompt. Here $\mathcal{V}$ denotes the vocabulary for all the tokens. Then the LLM randomly generates its response $\bm{y}=(y_{1},y_{2},...,y_{m})\in\mathcal{Y}$ following the probability distribution
$$
y_{j}\sim p_{\theta}(\cdot|\bm{x},y_{1},y_{2},...,y_{j-1}).
$$
Here the probability distribution $p_{\theta}$ denotes the distribution (over vocabulary $\mathcal{V}$ ) as the LLMâs output, and $\theta$ encapsulates all the parameters of the LLM. The conditional part includes the prompt $\bm{x}$ and all the tokens $y_{1},y_{2},...,y_{j-1}$ generated preceding the current position.
We consider using the LLM for some downstream NLP tasks such as question answering, multiple choice, and machine translation. Such a task usually comes with an evaluation/scoring function that evaluates the quality of the generated response $s(\cdot,\cdot):\mathcal{Y}\times\mathcal{Y}\rightarrow[0,1].$ For each pair of $(\bm{x},\bm{y}),$ the evaluation function rates the response $\bm{y}$ with the score $z\coloneqq s(\bm{y},\bm{y}_{\text{true}})$ where $\bm{y}_{\text{true}}$ is the true response for the prompt $\bm{x}$ . The true response $\bm{y}_{\text{true}}$ is usually decided by factual truth, humans, or domain experts, and we can assume it follows a distribution condition on the prompt $\bm{x}$ . It does not hurt to assume a larger score represents a better answer; $z=1$ indicates a perfect answer, while $z=0$ says the response $\bm{y}$ is off the target.
We define the task of uncertainty estimation for LLMs as the learning of a function $g$ that predicts the score
$$
g(\bm{x},\bm{y})\approx\mathbb{E}\left[s(\bm{y},\bm{y}_{\text{true}})|\bm{x},
\bm{y}\right] \tag{1}
$$
where the expectation on the right-hand side is taken with respect to the (possible) randomness of the true response $\bm{y}_{\text{true}}$ , and for notational clarity, we omit the dependence of $\bm{y}_{\text{true}}$ on $\bm{x}$ . We emphasize two points on this task definition: The uncertainty function $g$ takes the prompt $\bm{x}$ and $\bm{y}$ as its inputs. This implies (i) the true and predicted uncertainty score can and should depend on the specific realization of the response $\bm{y}$ , not just $\bm{x}$ (Zhang et al., 2021; Kuhn et al., 2023), and (ii) the uncertainty function $g$ does not require the true response $\bm{y}_{\text{true}}$ as the input.
We note that a significant body of literature explores uncertainty estimation and calibration in language models (Zhou et al., 2023; Si et al., 2022; Xiao et al., 2022; Desai and Durrett, 2020). They primarily focus on classification tasks where outputs are limited to a finite set of tokens (i.e., $\bm{y}$ contains only one element). In contrast, our work extends this to allow free-form responses, and the ability to handle variable-length outputs aligns more closely with current advancements in LLMs.
## 3 Uncertainty Estimation via Supervised Learning
### 3.1 Overview of supervised uncertainty estimation
We consider a supervised approach of learning the uncertainty function $g:\mathcal{X}\times\mathcal{Y}\rightarrow[0,1]$ , which is similar to the standard setting of uncertainty quantification for ML/deep learning models. First, we start with a raw dataset of $n$ samples
$$
\mathcal{D}_{\text{raw}}=\left\{(\bm{x}_{i},\bm{y}_{i},\bm{y}_{i,\text{true}},
s(\bm{y}_{i},\bm{y}_{i,\text{true}}))\right\}_{i=1}^{n}.
$$
$\mathcal{D}_{\text{raw}}$ can be generated based on a labeled dataset for the tasks we consider. Here $\bm{x}_{i}=(x_{i,1},...,x_{i,k_{i}})$ and $\bm{y}_{i}=(y_{i,1},...,y_{i,m_{i}})$ denote the prompt and the corresponding LLMâs response, respectively. $\bm{y}_{i,\text{true}}$ denotes the true response (that comes from the labeled dataset) of $\bm{x}_{i}$ , and $s(\bm{y}_{i},\bm{y}_{i,\text{true}})$ assigns a score for the response $\bm{y}_{i}$ based on the true answer $\bm{y}_{i,\text{true}}$ .
The next is to formulate a supervised learning task based on $\mathcal{D}_{\text{raw}}$ . Specifically, we construct
$$
\mathcal{D}_{\text{sl}}=\left\{(\bm{v}_{i},z_{i})\right\}_{i=1}^{n}
$$
where $z_{i}\coloneqq s(\bm{y}_{i},\bm{y}_{i,\text{true}})\in[0,1]$ denotes the target score to be predicted. The vector $\bm{v}_{i}$ summarizes useful features for the $i$ -th sample based on $(\bm{x}_{i},\bm{y}_{i})$ . With this design, a supervised learning task on the dataset $\mathcal{D}_{\text{sl}}$ coincides exactly with learning the uncertainty estimation task defined in (1).
Getting Features. When constructing $\bm{v}_{i}$ , a natural implementation is to use the features of $(\bm{x},\bm{y})$ extracted from the LLM (denoted as target LLM) that generates the response $\bm{y}$ as done in Duan et al. (2024) for hallucination detection and Burns et al. (2022) for discovering latent knowledge. This method functions effectively with white-box LLMs where hidden activations are accessible. We note that obtaining hidden layersâ activations merely requires an LLM and the prompt-response pair $(\bm{x},\bm{y})$ , and the extra knowledge of uncertainty can come from the hidden layers of any white-box LLM that takes as input the $(\bm{x},\bm{y})$ pair, not necessarily from the target LLM.
Another note is that our goal is to measure the uncertainty of the input-output pair $(\bm{x},\bm{y})$ using the given metric, which is independent of the target LLM that generates the output from input $\bm{x}$ . Therefore, due to the unique structure of LLMs, any white-box LLM can take $(\bm{x},\bm{y})$ together as input, allowing us to extract features from this white-box LLM (referred to as the tool LLM).
This observation has two implications: First, if the target LLM is a black-box one, we can rely on a white-box tool LLM to extract feature; Second, even if the target LLM is a Which-box one, we can also adopt a more powerful white-box tool LLM) that could potentially generate more useful feature. In Algorithm 1, we present the algorithm of our pipeline that is applicable to target LLMs of any type, and we provide an illustration of the algorithm pipeline in Figure 2.
Algorithm 1 Supervised uncertainty estimation
1: Target LLM $p_{\theta}$ (the uncertainty of which is to be estimated), tool LLM $q_{\theta}$ (used for uncertainty estimation), a labeled training dataset $\mathcal{D}$ , a test sample with prompt $\bm{x}$
2: %% Training phase:
3: Use $p_{\theta}$ to generate responses for the samples in $\mathcal{D}$ and construct the dataset $\mathcal{D}_{\text{raw}}$
4: For each sample $(\bm{x}_{i},\bm{y}_{i})\in\mathcal{D}_{\text{raw}}$ , extract features (hidden-layer activations, entropy- and probability-related features) using the LLM $q_{\theta}$ , and then construct the dataset $\mathcal{D}_{\text{sl}}$
5: Train a supervised learning model $\hat{g}$ that predicts $z_{i}$ with $\bm{v}_{i}$ based on the dataset $\mathcal{D}_{\text{sl}}$
6: %% Test phase:
7: Generate the response $\bm{y}$ for the test prompt $\bm{x}$
8: Extract features $\bm{v}$ using $q_{\theta}$
9: Associate the response $\bm{y}$ with the uncertainty score $\hat{g}(\bm{v})$
### 3.2 Features for uncertainty estimation
A bunch of features that can be extracted from an LLM show a potential relationship to the measurement of uncertainty in the literature. Here we categorize these features into two types based on their sources:
White-box features: LLMâs hidden-layer activations. We feed $(\bm{x}_{i},\bm{y}_{i})$ as input into the tool LLM, and extract the corresponding hidden layersâ activations of the LLM.
Grey-box features: Entropy- or probability-related outputs. The entropy of a discrete distribution $p$ over the vocabulary $\mathcal{V}$ is defined by $H(p)\coloneqq-\sum_{v\in\mathcal{V}}p(v)\log\left(p(v)\right).$ For a prompt-response pair $(\bm{x},\bm{y})=(x_{1},...,x_{k},y_{1},...,y_{m})$ , we consider as the features the entropy at each token such as $H(q_{\theta}(\cdot|x_{1},...,x_{j-1}))$ and $H(q_{\theta}(\cdot|\bm{x},y_{1},...,y_{j-1}))$ where $q_{\theta}$ denotes the tool LLM. We defer the detailed discussions on feature construction to Appendix D.
There can be other useful features such as asking the LLM âhow certain it is about the responseâ (Tian et al., 2023). We do not try to exhaust all the possibilities, and the aim of our paper is more about formulating the uncertainty estimation for the LLMs as a supervised task and understanding how the internal states of the LLM encode uncertainty. To the best of our knowledge, our paper is the first one to do so. Specifically, the above formulation aims for the following two outcomes: (i) an uncertainty model $\hat{g}(\bm{v}_{i})$ that predicts $z_{i}$ and (ii) knowing whether the hidden layers carry the uncertainty information.
### 3.3 Three regimes of supervised uncertainty estimation
In Section 3.1, we present that our supervised uncertainty estimation method can be extended to a black-box LLM by separating the target LLM and tool LLM. Next, we formally present our method for white-box, grey-box, and black-box target LLMs.
White-box supervised uncertainty estimation (Wb-S): This Wb-S approach is applicable to a white-box LLM where the tool LLM coincides with the target LLM (i.e., $p_{\theta}=q_{\theta}$ ).
Grey-box supervised uncertainty estimation (Gb-S): This Gb-S regime also uses the same target and tool LLMs ( $p_{\theta}=q_{\theta}$ ) and constructs the features only from the grey-box source, that is, those features relying on the probability and the entropy (such as those in Table 5 in Appendix D), but it ignores the hidden-layer activations.
Black-box supervised uncertainty estimation (Bb-S): The Bb-S regime does not assume the knowledge of the parameters of $p_{\theta}$ but still aims to estimate its uncertainty. To achieve this, it considers another open-source LLM denoted by $q_{\theta}$ . The original data $\mathcal{D}_{\text{raw}}$ is generated by $p_{\theta}$ but then the uncertainty estimation data $\mathcal{D}_{\text{sl}}$ is constructed based on $q_{\theta}$ from $\mathcal{D}_{\text{raw}}$ as illustrated in the following diagram
$$
\mathcal{D}_{\text{raw}}\overset{q_{\theta}}{\longrightarrow}\mathcal{D}_{
\text{sl}}.
$$
For example, for a prompt $\bm{x}$ , a black-box LLM $p_{\theta}$ generates the response $\bm{y}.$ We utilize the open-source LLM $q_{\theta}$ to treat $(\bm{x},\bm{y})$ jointly as a sequence of (prompt) tokens and extract the features of hidden activations and entropy as in Section 3.2. In this way, we use $q_{\theta}$ together with the learned uncertainty model from $\mathcal{D}_{\text{sl}}$ to estimate the uncertainty of responses generated from $p_{\theta}$ which we do not have any knowledge about.
## 4 Insights for the algorithm design
### 4.1 Uncertainty estimation v.s. uncertainty calibration
So far in this paper, we focus on the uncertainty estimation task which aims to predict the quality of the response to reveal whether the LLM makes mistakes in its response or not. There is a different but related task known as the uncertainty calibration problem. In comparison, the uncertainty calibration aims to ensure that the output from the uncertainty estimation model for (1) conveys a probabilistic meaning. That is, $g(\bm{x},\bm{y})$ is defined as the probability that $\bm{y}$ is true. This is compatible with our method by replacing the quality $s(\bm{y},\bm{y}_{\text{true}})$ with $1\left\{\bm{y}\in\mathcal{Y}_{\text{true}}\right\}$ , where $\mathcal{Y}_{\text{true}}$ is a set containing all the possible true responses. Another aspect of the relation between our uncertainty estimation method and uncertainty calibration is that our method can be followed by any recalibration methods for ML models to form a pipeline for calibration. And intuitively, a better uncertainty estimation/prediction will lead to a better-calibrated uncertainty model, which is also verified in our numerical experiments in Appendix C.
### 4.2 Why hidden layers as features?
In this subsection, we provide a simple theoretical explanation for why the hidden activations of the LLM can be useful in uncertainty estimation. Consider a binary classification task where the features $\bm{X}\in\mathbb{R}^{d}$ and the label $Y\in\{0,1\}$ are drawn from a distribution $\mathcal{P}.$ We aim to learn a model $f:\mathbb{R}^{d}\rightarrow[0,1]$ that predicts the label $Y$ from the feature vector $\bm{X}$ , and the learning of the model employs a loss function $l(\cdot,\cdot):[0,1]\times[0,1]\rightarrow\mathbb{R}$ .
**Proposition 4.1**
*Let $\mathcal{F}$ be the class of measurable function that maps from $\mathbb{R}^{d}$ to $[0,1]$ . Under the cross-entropy loss $l(y,\hat{y})=y\log(\hat{y})+(1-y)\log(1-\hat{y})$ , the function $f^{*}$ that minimizes the loss
$$
f^{*}=\operatorname*{arg\,min}_{f\in\mathcal{F}}\mathbb{E}\left[l(Y,f(\bm{X}))\right]
$$
is the Bayes optimal classifier $f^{*}(\bm{x})=\mathbb{P}(Y=1|\bm{X}=\bm{x})$ where the expectation and the probability are taken with respect to $(\bm{X},Y)\sim\mathcal{P}.$ Moreover, the following conditional independence holds
$$
Y\perp\bm{X}\ |\ f^{*}(\bm{X}).
$$*
The proposition is not technical and it can be easily proved by using the structure of $f^{*}(\bm{X})$ so we refer the proof to Berger (2013). It states a nice property of the cross-entropy loss that the function learned under the cross-entropy loss coincides with the Bayes optimal classifier. Note that this is contingent on two requirements. First, the function class $\mathcal{F}$ is the measurable function class. Second, it requires the function $f^{*}$ learned through the population loss rather than the empirical loss/risk. The proposition also states one step further on conditional independence $Y\perp\bm{X}\ |\ f^{*}(\bm{X})$ . This means all the information related to the label $Y$ that is contained in $\bm{X}$ is summarized in the prediction function $f^{*}.$ This intuition suggests that for classic uncertainty estimation problems, when a prediction model $\hat{f}:\mathbb{R}^{d}\rightarrow[0,1]$ is well-trained, the predicted score $\hat{f}(\bm{X})$ should capture all the information about the true label $Y$ contained in the features $\bm{X}$ , without relying on the features of $\bm{X}$ . This indeed explains why the classic uncertainty estimation and calibration methods only work with the predicted score $\hat{f}(\bm{X})$ for re-calibration, including Platt scaling (Platt et al., 1999), isotonic regression (Zadrozny and Elkan, 2002), temperature scaling (Guo et al., 2017), etc.
When it comes to uncertainty estimation for LLMs, which is different from calibration and LLMsâ structure is much more complex, we will no longer have conditional independence, and that requires additional procedures to retrieve more information on $Y$ . The following supporting corollary states that when the underlying loss function $\tilde{l}$ does not possess this nice property (the Bayes classifier minimizes the loss point-wise) of the cross-entropy loss, the conditional independence will collapse.
**Corollary 4.2**
*Suppose the loss function $\tilde{l}$ satisfies
$$
\mathbb{P}\left(f^{*}(\bm{x})\neq\operatorname*{arg\,min}_{\tilde{y}\in[0,1]}
\mathbb{E}\left[\tilde{l}(Y,\tilde{y})|\bm{X}=\bm{x}\right]\right)>0,
$$
where $f^{*}$ is defined as Proposition 4.1, then for the function $\tilde{f}=\operatorname*{arg\,min}_{f\in\mathcal{F}}\mathbb{E}\left[\tilde{l}( Y,f(\bm{X}))\right],$ where the expectation is with respect to $(\bm{X},Y)\sim\mathcal{P},$ there exists a distribution $\mathcal{P}$ such that the conditional independence no longer holds
$$
Y\not\perp\bm{X}\ |\ \tilde{f}(\bm{X}).
$$*
Proposition 4.1 and Corollary 4.2 together illustrate the difference between uncertainty estimation for a traditional ML model and that for LLMs. In this task, the output $\tilde{f}(\bm{X})$ of the model (traditional ML model or LLM) is restricted in [0,1] to indicate the confidence of $Y=1$ . For the traditional ML models, the cross-entropy loss, which is commonly used for training the model, is aligned toward the uncertainty calibration objective. When it comes to uncertainty estimation for LLMs, the objective can be different from calibration, and the LLMs are often pretrained with some other loss functions (for example, the negative log-likelihood loss for next-token prediction) on diverse language tasks besides binary classifications. These factors cause a misalignment between the model pre-training and the uncertainty estimation task. Consequently, the original features (e.g., the output logits) may and should (in theory) contain information about the uncertainty score $Y$ that cannot be fully captured by $\tilde{f}(\bm{X})$ . This justifies why we formulate the uncertainty estimation task as the previous subsection and take the hidden-layer activations as features to predict the uncertainty score; it also explains why we do not see much similar treatment in the mainstream uncertainty estimation literature (Kuhn et al., 2023; Manakul et al., 2023; Tian et al., 2023).
## 5 Numerial Experiments and Findings
In this section, we provide a systematic evaluation of the proposed supervised approach for estimating the uncertainty of the LLMs. All code used in our experiments is available at https://github.com/LoveCatc/supervised-llm-uncertainty-estimation.
### 5.1 LLMs, tasks, benchmarks, and performance metrics
Here we outline the general setup of the numerical experiments. Certain tasks may deviate from the general setup, and we will detail the specific adjustments as needed.
LLMs. For our numerical experiments, we mainly consider three open-source LLMs, LLaMA2-7B (Touvron et al., 2023), LLaMA3-8B (AI@Meta, 2024) and Gemma-7B (Gemma Team et al., 2024) as $p_{\theta}$ defined in Section 2. For certain experiments, we also employ the models of LLaMA2-13B and Gemma-2B. We also use their respective tokenizers as provided by Hugging Face. We do not change the parameters/weights $\theta$ of these LLMs.
Tasks and Datasets. We mainly consider three tasks for uncertainty estimation, question answering, multiple choice, and machine translation. All the labeled datasets for these tasks are in the form of $\{(\bm{x}_{i},\bm{y}_{i,\text{true}})\}_{i=1}^{n}$ where $\bm{x}_{i}$ can be viewed as the prompt for the $i$ -th sample and $\bm{y}_{i,\text{true}}$ the true response. We adopt the few-shot prompting when generating the LLMâs response $\bm{y}_{i}$ , and we use 5 examples in the prompt of the multiple-choice task and 3 examples for the remaining natural language generation tasks. This enables the LLMâs in-context learning ability (Radford et al., 2019; Zhang et al., 2023) and ensures the LLMâs responses are in a desirable format. We defer more details of the few-shot prompting to Appendix D.1. The three tasks are:
- Question answering. We follow Kuhn et al. (2023) and use the CoQA and TriviaQA (Joshi et al., 2017) datasets. The CoQA task requires the LLM to answer questions by understanding the provided text, and the TriviaQA requires the LLM to answer questions based on its pre-training knowledge. We adopt the scoring function $s(\cdot,\cdot)$ as Rouge-1 (Lin and Och, 2004a) and label a response $\bm{y}_{i}$ as correct if $s(\bm{y}_{i},\bm{y}_{i,\text{true}})\geq 0.3$ and incorrect otherwise.
- Multiple choice. We consider the Massive Multitask Language Understanding (MMLU) dataset (Hendrycks et al., 2020), a collection of 15,858 questions covering 57 subjects across STEM. Due to the special structure of the dataset, the generated output $\bm{y}_{i}$ and the correct answer $\bm{y}_{\text{true},i}\in\{\text{A, B, C, D}\}$ . Therefore, this task can also be regarded as a classification problem for the LLM by answering the question with one of the four candidate choices.
- Machine translation. We consider the WMT 2014 dataset (Bojar et al., 2014) for estimating LLMâs uncertainty on the machine translation task. The scoring function $s(\cdot,\cdot)$ is chosen to be the BLEU score (Papineni et al., 2002; Lin and Och, 2004b) and the generated answer $\bm{y}_{i}$ is labeled as correct if $s(\bm{y}_{i},\bm{y}_{i,\text{true}})>0.3$ and incorrect otherwise.
Benchmarks. We compare our approach with a number of the state-of-the-art benchmarks for the problem. Manakul et al. (2023) give a comprehensive survey of the existing methods and compare four distinct measures for predicting sentence generation uncertainty. The measures are based on either the maximum or average values of entropy or probability across the sentence, including Max Likelihood, Avg Likelihood, Max Ent, and Avg Ent defined in Table 5. We note that each of these measures can be applied as a single uncertainty estimator, and they are all applied in an unsupervised manner that does not require additional supervised training. In particular, in applying these measures for the MMLU dataset, since the answer only contains one token from $\{\text{A, B, C, D}\}$ , we use the probabilities and the entropy (over these four tokens) as the benchmarks which represent the probability of the most likely choice and the entropy of all choices, respectively. Kuhn et al. (2023) generate multiple answers, compute their entropy in a semantic sense, and define the quantity as semantic entropy. This semantic-entropy uncertainty (SU) thus can be used as an uncertainty estimator for the LLMâs responses. Tian et al. (2023) propose the approach of asking the LLM for its confidence (denoted as A4U) which directly obtains the uncertainty score from the LLM itself.
Our methods. We follow the discussions in Section 3.3 and implement three versions of our proposed supervised approach: black-box supervised (Bb-S), grey-box supervised (Gb-S), and white-box supervised (Wb-S). These models have the same pipeline of training the uncertainty estimation model and the difference is only on the availability of the LLM. For the Bb-S method, we use the Gemma-7B as the model $q_{\theta}$ to evaluate the uncertainty of LLaMA2-7B/LLaMA3-8B $p_{\theta}$ (treated as a black-box), and reversely, use LLaMA2-7B to evaluate Gemma-7B. The supervised uncertainty model $\hat{g}$ is trained based on the random forest model (Breiman, 2001). Details on the feature construction and the training of the random forest model are deferred to Appendix D.2.
Performance metrics. For the model evaluation, we follow Filos et al. (2019); Kuhn et al. (2023) and compare the performance of our methods against the benchmark using the generated uncertainty score to predict whether the answer is correct. The area under the receiver operator characteristic curve (AUROC) metric is employed to measure the performance of the uncertainty estimation. As discussed in Section 4.1, AUROC works as a good metric for the uncertainty estimation task whereas for the uncertainty calibration task, we follow the more standard calibration metrics and present the results in Section C.
### 5.2 Performance of uncertainty estimation
Now we present the performance on the uncertainty estimation task.
#### 5.2.1 Question answering and machine translation
The question answering and machine translation tasks can all be viewed as natural language generation tasks so we present their results together. Table 1 summarizes the three versions of our proposed supervised method against the existing benchmarks in terms of AUROC.
| TriviaQA | G-7B | 0.857 | 0.862 | 0.849 | 0.854 | 0.847 | 0.534 | 0.879 | 0.866 | 0.882 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| L-7B | 0.565 | 0.761 | 0.761 | 0.773 | 0.678 | 0.526 | 0.925 | 0.811 | 0.897 | |
| L-8B | 0.838 | 0.851 | 0.849 | 0.853 | 0.826 | 0.571 | 0.843 | 0.861 | 0.874 | |
| CoQA | G-7B | 0.710 | 0.708 | 0.725 | 0.708 | 0.674 | 0.515 | 0.737 | 0.737 | 0.762 |
| L-7B | 0.535 | 0.600 | 0.603 | 0.580 | 0.541 | 0.502 | 0.848 | 0.667 | 0.807 | |
| L-8B | 0.692 | 0.697 | 0.716 | 0.699 | 0.684 | 0.506 | 0.745 | 0.737 | 0.769 | |
| WMT-14 | G-7B | 0.668 | 0.589 | 0.637 | 0.811 | 0.572 | 0.596 | 0.863 | 0.829 | 0.855 |
| L-7B | 0.606 | 0.712 | 0.583 | 0.711 | 0.513 | 0.506 | 0.792 | 0.724 | 0.779 | |
| L-8B | 0.554 | 0.685 | 0.616 | 0.729 | 0.510 | 0.502 | 0.700 | 0.724 | 0.745 | |
Table 1: Out-of-sample AUROC performance for benchmarks and our methods on natural language generation tasks. G-7B, L-7B, and L-8B represent Gemma-7B, LLaMA2-7B, and LLaMA-8B, respectively. The columns MaxL, AvgL, MaxE, and AvgE all come from Manakul et al. (2023). The column SU implements the semantic uncertainty estimation by Kuhn et al. (2023), and the column A4C implements the ask-for-confidence method by Tian et al. (2023). The columns Bb-S, Gb-S, and Wb-S represent respectively the three regimes (black-box supervised, grey-box supervised, and white-box supervised) of our supervised method with details in Section 3.3.
We make several remarks on the numerical results. First, our methods generally have a better performance than the existing benchmarks. Note that the existing benchmarks are mainly unsupervised and based on one single score, and also that our method proceeds with the most standard pipeline for supervised training of an uncertainty estimation model. The advantage of our method should be attributed to the supervised nature and the labeled dataset. While these unsupervised benchmark methods can work in a larger scope than these NLP tasks (though they have not been extensively tested on open questions yet), our methods rely on the labeled dataset. But in addition to these better numbers, the experiment results show the potential of labeled datasets for understanding the uncertainty in LLMâs responses. In particular, our method Gb-S uses features including the benchmark methods, and it shows that some minor supervised training can improve a lot upon the ad-hoc uncertainty estimation based on one single score such as MaxL or MaxE.
Second, our method Wb-S has a clear advantage over our method Gb-S. Note that these two methods differ in that the Wb-S uses the hidden activations while the Gb-S only uses probability-related (and entropy-related) features. This implies that the hidden activations do contain uncertainty information which we will investigate more in Appendix B. Also, we note from the table that there is no single unsupervised grey-box method (under the Benchmarks columns) that consistently surpasses others across different datasets/NLP tasks. For example, among all these unsupervised benchmark methods for grey-box LLMs, AvgE emerges as a top-performing one for the Gemma-7B model when applied to the machine translation task, but it shows the poorest performance for the same Gemma-7B model when tested on the question-answering CoQA dataset. This inconsistency highlights some caveats when using the unsupervised approach for uncertainty estimation of LLMs.
Lastly, we note that the Bb-S method has a similar or even better performance as the Wb-S method. As discussed in Section 3.3, the performance of uncertainty estimation relies on the LLM that we use to evaluate the prompt-response pair. Therefore, it is not surprising to see that in the question-answering task, for answers generated by LLaMA2-7B, Bb-S features better uncertainty estimation than Wb-S, possibly because Gemma-7B, the LLM that is used as the âtool LLMâ in Algorithm 1, encodes better knowledge about the uncertainty of the answers than LLaMA-7B. We also note that the performance of Bb-S is not always as good as Wb-S, and we hypothesize that it is because LLMsâ output distribution differs, which could result in evaluating the uncertainty of different answers. Despite these inconsistencies, the performance of Bb-S is still strong, and these results point to a potential future avenue for estimating the uncertainty of closed-source LLMs.
#### 5.2.2 Multiple choice (MMLU)
Table 2 presents the performance of our methods against the benchmark methods on the MMLU dataset. For this multiple choice task, the output is from {A,B,C,D} which bears no semantic meaning, and therefore we do not include the Semantic Uncertainty (SU) as Table 1. The results show the advantage of our proposed supervised approach, consistent with the previous findings in Table 1.
| Gemma-7B LLaMA2-7B LLaMA3-8B | 0.712 0.698 0.781 | 0.742 0.693 0.791 | 0.582 0.514 0.516 | 0.765 0.732 0.766 | 0.776 0.698 0.793 | 0.833 0.719 0.830 |
| --- | --- | --- | --- | --- | --- | --- |
Table 2: Out-of-sample AUROC performance for benchmarks and our methods on the MMLU dataset. The columns Probability and Entropy come from Manakul et al. (2023), and the column A4C implements the ask-for-confidence method by Tian et al. (2023). The columns Bb-S, Gb-S, and Wb-S represent respectively the three regimes (black-box supervised, grey-box supervised, and white-box supervised) of our supervised method with details in Section 3.3.
We defer more numerical experiments and visualization to Appendices B and C where we investigate more on (i) the effect of the choice of layers; (ii) the scale of the LLMs used; (iii) the uncertainty neurons of the LLMs; and (iv) the calibration performance.
### 5.3 Transferability
In this subsection, we evaluate the robustness of our methods under the OOD setting.
Setup for the OOD multiple-choice task. We split the MMLU datasets into two groups based on the subjects: Group 1 contains questions from the first 40 subjects while Group 2 contains the remaining 17 subjects, such that the test dataset size of each group is similar (around 600 questions). Note that these 57 subjects span a diverse range of topics, and this means the training and test set can be very different. To test the OOD robustness, we train the proposed methods on one group and evaluate the performance on the other group.
Setup for the OOD question-answering task. For the QA task, since we have two datasets (CoQA and TriviaQA), we train the supervised model on either the TriviaQA or CoQA dataset and then evaluate its performance on the other dataset. While both datasets are for question-answering purposes, they diverge notably in two key aspects: (i) CoQA prioritizes assessing the LLMâs comprehension through the discernment of correct responses within extensive contextual passages, while TriviaQA focuses on evaluating the modelâs recall of factual knowledge. (ii) TriviaQA typically contains answers comprising single words or short phrases, while CoQA includes responses of varying lengths, ranging from shorter to more extensive answers.
| LLMs | Test data | Ours | Best of benchmarks | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Bb-S | Gb-S | Wb-S | Best GB | Best BB | | |
| Transferability in MMLU | | | | | | |
| G-7B | Group 1 | 0.756(0.768) | 0.793(0.799) | 0.846(0.854) | 0.765 | 0.538 |
| Group 2 | 0.738(0.760) | 0.755(0.754) | 0.804(0.807) | 0.721 | 0.616 | |
| L-7B | Group 1 | 0.733(0.749) | 0.715(0.713) | 0.726(0.751) | 0.719 | 0.504 |
| Group 2 | 0.700(0.714) | 0.676(0.677) | 0.685(0.692) | 0.679 | 0.529 | |
| L-8B | Group 1 | 0.763(0.773) | 0.796(0.795) | 0.836(0.839) | 0.799 | 0.524 |
| Group 2 | 0.729(0.761) | 0.786(0.785) | 0.794(0.818) | 0.782 | 0.507 | |
| Transferability in Question-Answering Datasets | | | | | | |
| G-7B | TriviaQA | 0.842(0.879) | 0.861(0.866) | 0.861(0.882) | 0.862 | 0.847 |
| CoQA | 0.702(0.737) | 0.722(0.737) | 0.730(0.762) | 0.725 | 0.674 | |
| L-7B | TriviaQA | 0.917(0.925) | 0.801(0.811) | 0.881(0.897) | 0.773 | 0.678 |
| CoQA | 0.825(0.848) | 0.623(0.667) | 0.764(0.807) | 0.603 | 0.541 | |
| L-8B | TriviaQA | 0.813(0.843) | 0.859(0.861) | 0.863(0.874) | 0.853 | 0.826 |
| CoQA | 0.710(0.745) | 0.714(0.737) | 0.725(0.769) | 0.716 | 0.684 | |
Table 3: Transferability of the trained uncertainty estimation model across different groups of subjects in MMLU and question-answering datasets. For our proposed Bb-S, Gb-S, and Wb-S methods, values within the parentheses $(\cdot)$ represent the AUROCs where the uncertainty estimation model is trained and tested on the same group of subjects or dataset, while values outside the parentheses represent models trained on another group of subjects or dataset. The Best GB and Best BB columns refer to the best AUROC achieved by the unsupervised grey-box baselines and black-box baselines (fully listed in Table 1 and Table 2), respectively.
Table 3 summarizes the performance of these OOD experiments. As expected, for all the methods, there is a slight drop in terms of performance compared to the in-distribution setting (reported by the numbers in the parentheses in the table). We make the following observations based on the experiment results. First, based on the performance gap between in-distribution and OOD evaluation, it is evident that although incorporating white-box features such as hidden activations makes the model more susceptible to performance decreases on OOD tasks, these features also enhance the uncertainty estimation modelâs overall capacity, and the benefits outweigh the drawbacks. It is also noteworthy that even in these scenarios of OOD, our Wb-S and Bb-S method almost consistently outperform corresponding baseline approaches. Overall, the robustness of our methods shows that the hidden layersâ activations within the LLM exhibit similar patterns in encoding uncertainty information to some extent. The performance drop (from in-distribution to OOD) observed in the MMLU dataset is notably less than that in the question-answering dataset, which may stem from the larger disparity between the CoQA and TriviaQA datasets compared to that between two distinct groups of subjects within the same MMLU dataset. This suggests that in cases of significant distributional shifts, re-training or re-calibrating the uncertainty estimation model using test data may be helpful.
## 6 Conclusions
In this paper, we study the problem of uncertainty estimation and calibration for LLMs. We follow a simple and standard supervised idea and use the labeled NLP datasets to train an uncertainty estimation model for LLMs. Our finding is that, first, the proposed supervised methods have better performances than the existing unsupervised methods. Second, the hidden activations of the LLMs contain uncertainty information about the LLMsâ responses. Third, the black-box regime of our approach (Bb-S) provides a new approach to estimating the uncertainty of closed-source LLMs. Lastly, we distinguish the task of uncertainty estimation from uncertainty calibration and show that a better uncertainty estimation model leads to better calibration performance. One limitation of our proposed supervised method is that it critically relies on the labeled data. For the scope of our paper, we restrict the discussion to the NLP tasks and datasets. One future direction is to utilize the human-annotated data for LLMsâ responses to train a supervised uncertainty estimation model for open-question prompts. We believe the findings that the supervised method gives a better performance and the hidden activations contain the uncertainty information will persist.
## References
- Abdar et al. (2021) Abdar, Moloud, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion 76 243â297.
- Ahdritz et al. (2024) Ahdritz, Gustaf, Tian Qin, Nikhil Vyas, Boaz Barak, Benjamin L Edelman. 2024. Distinguishing the knowable from the unknowable with language models. arXiv preprint arXiv:2402.03563 .
- AI@Meta (2024) AI@Meta. 2024. Llama 3 model card URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Azaria and Mitchell (2023) Azaria, Amos, Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734 .
- Berger (2013) Berger, J.O. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics, Springer New York. URL https://books.google.nl/books?id=1CDaBwAAQBAJ.
- Bojar et al. (2014) Bojar, OndĆej, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. 2014. Findings of the 2014 workshop on statistical machine translation. Proceedings of the ninth workshop on statistical machine translation. 12â58.
- Breiman (2001) Breiman, Leo. 2001. Random forests. Machine learning 45 5â32.
- Brown et al. (2020) Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 1877â1901.
- Bubeck et al. (2023) Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 .
- Burns et al. (2022) Burns, Collin, Haotian Ye, Dan Klein, Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 .
- Burrell (2016) Burrell, J. 2016. How the machine âthinksâ: Understanding opacity in machine learning algorithms. Big Data & Society .
- CH-Wang et al. (2023) CH-Wang, Sky, Benjamin Van Durme, Jason Eisner, Chris Kedzie. 2023. Do androids know theyâre only dreaming of electric sheep? arXiv preprint arXiv:2312.17249 .
- Chen et al. (2024) Chen, Chao, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye. 2024. Inside: Llmsâ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744 .
- Chen and Mueller (2023) Chen, Jiuhai, Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness .
- Desai and Durrett (2020) Desai, Shrey, Greg Durrett. 2020. Calibration of pre-trained transformers. arXiv preprint arXiv:2003.07892 .
- Duan et al. (2024) Duan, Hanyu, Yi Yang, Kar Yan Tam. 2024. Do llms know about hallucination? an empirical investigation of llmâs hidden states. arXiv preprint arXiv:2402.09733 .
- Duan et al. (2023) Duan, Jinhao, Hao Cheng, Shiqi Wang, Chenan Wang, Alex Zavalny, Renjing Xu, Bhavya Kailkhura, Kaidi Xu. 2023. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379 .
- Esteva et al. (2017) Esteva, Andre, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, Sebastian Thrun. 2017. Dermatologist-level classification of skin cancer with deep neural networks. nature 542 (7639) 115â118.
- Filos et al. (2019) Filos, Angelos, Sebastian Farquhar, Aidan N Gomez, Tim GJ Rudner, Zachary Kenton, Lewis Smith, Milad Alizadeh, Arnoud de Kroon, Yarin Gal. 2019. Benchmarking bayesian deep learning with diabetic retinopathy diagnosis. Preprint at https://arxiv. org/abs/1912.10481 .
- Fomicheva et al. (2020) Fomicheva, Marina, Shuo Sun, Lisa Yankovskaya, FrĂ©dĂ©ric Blain, Francisco GuzmĂĄn, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, Lucia Specia. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics 8 539â555.
- Gal and Ghahramani (2016) Gal, Yarin, Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning. PMLR, 1050â1059.
- Gawlikowski et al. (2023) Gawlikowski, Jakob, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, et al. 2023. A survey of uncertainty in deep neural networks. Artificial Intelligence Review 56 (Suppl 1) 1513â1589.
- Gemma Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Laurent Sifre, Morgane RiviÚre, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, et al. 2024. Gemma doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
- Guo et al. (2017) Guo, Chuan, Geoff Pleiss, Yu Sun, Kilian Q Weinberger. 2017. On calibration of modern neural networks. International conference on machine learning. PMLR, 1321â1330.
- Hendrycks et al. (2020) Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 .
- Hou et al. (2023) Hou, Bairu, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, Yang Zhang. 2023. Decomposing uncertainty for large language models through input clarification ensembling. arXiv preprint arXiv:2311.08718 .
- Joshi et al. (2017) Joshi, Mandar, Eunsol Choi, Daniel S Weld, Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 .
- Kadavath et al. (2022) Kadavath, Saurav, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 .
- Kuhn et al. (2023) Kuhn, Lorenz, Yarin Gal, Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664 .
- Kumar et al. (2023) Kumar, Bhawesh, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, Andrew Beam. 2023. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404 .
- Lakshminarayanan et al. (2017) Lakshminarayanan, Balaji, Alexander Pritzel, Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30.
- Li et al. (2024) Li, Kenneth, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg. 2024. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems 36.
- Lin and Och (2004a) Lin, Chin-Yew, Franz Josef Och. 2004a. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04). 605â612.
- Lin and Och (2004b) Lin, Chin-Yew, Franz Josef Och. 2004b. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. COLING, Geneva, Switzerland, 501â507. URL https://www.aclweb.org/anthology/C04-1072.
- Lin et al. (2023) Lin, Zhen, Shubhendu Trivedi, Jimeng Sun. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187 .
- Lin et al. (2022) Lin, Zi, Jeremiah Zhe Liu, Jingbo Shang. 2022. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. Findings of the Association for Computational Linguistics: ACL 2022 .
- Liu et al. (2023) Liu, Kevin, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas. 2023. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729 .
- Malinin and Gales (2021) Malinin, Andrey, Mark Gales. 2021. Uncertainty estimation in autoregressive structured prediction. International Conference on Learning Representations. URL https://openreview.net/forum?id=jN5y-zb5Q7m.
- Manakul et al. (2023) Manakul, Potsawee, Adian Liusie, Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 .
- Mielke et al. (2022) Mielke, Sabrina J, Arthur Szlam, Emily Dinan, Y-Lan Boureau. 2022. Reducing conversational agentsâ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics 10 857â872.
- Minderer et al. (2021) Minderer, Matthias, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic. 2021. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems 34 15682â15694.
- Mohri and Hashimoto (2024) Mohri, Christopher, Tatsunori Hashimoto. 2024. Language models with conformal factuality guarantees. arXiv preprint arXiv:2402.10978 .
- Ouyang et al. (2022) Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 27730â27744.
- Papineni et al. (2002) Papineni, Kishore, Salim Roukos, Todd Ward, Wei jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. 311â318.
- Pedregosa et al. (2011) Pedregosa, Fabian, GaĂ«l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. the Journal of machine Learning research 12 2825â2830.
- Platt et al. (1999) Platt, John, et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10 (3) 61â74.
- Plaut et al. (2024) Plaut, Benjamin, Khanh Nguyen, Tu Trinh. 2024. Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a. arXiv preprint arXiv:2402.13213 .
- Quach et al. (2023) Quach, Victor, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, Regina Barzilay. 2023. Conformal language modeling. arXiv preprint arXiv:2306.10193 .
- Radford et al. (2019) Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1 (8) 9.
- Ramos et al. (2017) Ramos, Sebastian, Stefan Gehrig, Peter Pinggera, Uwe Franke, Carsten Rother. 2017. Detecting unexpected obstacles for self-driving cars: Fusing deep learning and geometric modeling. 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE, 1025â1032.
- Rawte et al. (2023) Rawte, Vipula, Amit Sheth, Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 .
- Si et al. (2022) Si, Chenglei, Chen Zhao, Sewon Min, Jordan Boyd-Graber. 2022. Re-examining calibration: The case of question answering. arXiv preprint arXiv:2205.12507 .
- Slobodkin et al. (2023) Slobodkin, Aviv, Omer Goldman, Avi Caciularu, Ido Dagan, Shauli Ravfogel. 2023. The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 3607â3625.
- Su et al. (2024) Su, Weihang, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, Yiqun Liu. 2024. Unsupervised real-time hallucination detection based on the internal states of large language models. arXiv preprint arXiv:2403.06448 .
- Tian et al. (2023) Tian, Katherine, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975 .
- Touvron et al. (2023) Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
- Verma et al. (2023) Verma, Shreyas, Kien Tran, Yusuf Ali, Guangyu Min. 2023. Reducing llm hallucinations using epistemic neural networks. arXiv preprint arXiv:2312.15576 .
- Xiao et al. (2022) Xiao, Yuxin, Paul Pu Liang, Umang Bhatt, Willie Neiswanger, Ruslan Salakhutdinov, Louis-Philippe Morency. 2022. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. arXiv preprint arXiv:2210.04714 .
- Xu et al. (2024) Xu, Ziwei, Sanjay Jain, Mohan Kankanhalli. 2024. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817 .
- Ye and Durrett (2021) Ye, Xi, Greg Durrett. 2021. Can explanations be useful for calibrating black box models? arXiv preprint arXiv:2110.07586 .
- Zadrozny and Elkan (2002) Zadrozny, Bianca, Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 694â699.
- Zhang et al. (2023) Zhang, Hanlin, Yi-Fan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster, Eric Xing, Hima Lakkaraju, Sham Kakade. 2023. A study on the calibration of in-context learning. arXiv preprint arXiv:2312.04021 .
- Zhang et al. (2021) Zhang, Shujian, Chengyue Gong, Eunsol Choi. 2021. Knowing more about questions can help: Improving calibration in question answering. arXiv preprint arXiv:2106.01494 .
- Zhou et al. (2023) Zhou, Han, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy. 2023. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249 .
## Appendix A More Related Literature
Hallucination detection.
Recently, there is a trend of adopting uncertainty estimation approaches for hallucination detection. The rationale is that the information of the value of logits and the hidden states contain some of the LLMsâ beliefs about the trustworthiness of its generated output. By taking the activations of hidden layers as input, Azaria and Mitchell (2023) train a classifier to predict hallucinations, and Verma et al. (2023) develop epistemic neural networks aimed at reducing hallucinations. Slobodkin et al. (2023) demonstrate that the information from hidden layers of LLMsâ output can indicate the answerability of an input query, providing indirect insights into hallucination occurrences. Chen et al. (2024) develop an unsupervised metric that leverages the internal states of LLMs to perform hallucination detection. More related works on hallucination detection can be found in CH-Wang et al. (2023); Duan et al. (2024); Xu et al. (2024). While there is a lack of a rigorous definition of hallucination, and its definition varies in the above-mentioned literature, the uncertainty estimation problem can be well defined, and our results on uncertainty estimation can also help the task of hallucination detection.
Leveraging LLMsâ hidden activation.
The exploration of hidden states within LLMs has been studied to better understand LLMsâ behavior. Mielke et al. (2022) improve the linguistic calibration performance of a controllable chit-chat model by fine-tuning it using a calibrator trained on the hidden states, Burns et al. (2022) utilizes hidden activations in an unsupervised way to represent knowledge about the trustfulness of their outputs. Liu et al. (2023) show that LLMsâ linguistic outputs and their internal states can offer conflicting information about truthfulness, and determining whether outputs or internal states are more reliable sources of information often varies from one scenario to another. By taking the activations of hidden layers as input, Ahdritz et al. (2024) employ a linear probe to show that hidden layersâ information from LLMs can be used to differentiate between epistemic and aleatoric uncertainty. Duan et al. (2024) experimentally reveal the variations in hidden layersâ activations when LLMs generate true versus false responses in their hallucination detection task. Lastly, Li et al. (2024) enhance the truthfulness of LLMs during inference time by adjusting the hidden activationsâ values in specific directions.
We also remark on the following two aspects:
- Fine-tuning: For all the numerical experiments in this paper, we do not perform any fine-tuning with respect to the underlying LLMs. While the fine-tuning procedure generally boosts the LLMsâ performance on a downstream task, our methods can still be applied for a fine-tuned LLM, which we leave as future work.
- Hallucination: The hallucination problem has been widely studied in the LLM literature. Yet, as mentioned earlier, it seems there is no consensus on a rigorous definition of what hallucination refers to in the context of LLMs. For example, when an image classifier wrongly classifies a cat image as a dog, we do not say the image classifier hallucinates, then why or when we should say the LLMs hallucinate when they make a mistake? Comparatively, the uncertainty estimation problem is more well-defined, and we provide a mathematical formulation for the uncertainty estimation task for LLMs. Also, we believe our results on uncertainty estimation can also help with a better understanding of the hallucination phenomenon and tasks such as hallucination detection.
## Appendix B Interpreting the Uncertainty Estimation
Now we use some visualizations to provide insights into the working mechanism of the uncertainty estimation procedure for LLMs and to better understand the experiment results in the previous subsection.
### B.1 Layer comparison
For general LLMs, each token is associated with a relatively large number of hidden layers (32 layers for LLaMA2-7B for example), each of which is represented by high-dimensional vectors (4096 for LLaMA2-7B). Thus it is generally not a good practice to incorporate all hidden layers as features for the uncertainty estimation due to this dimensionality. Previous works find that the middle layer and the last layer activations of the LLMâs last token contain the most useful features for supervised learning (Burns et al., 2022; Chen et al., 2024; Ahdritz et al., 2024; Azaria and Mitchell, 2023). To investigate the layer-wise effect for uncertainty estimation, we implement our Wb-S method with features different in two aspects: (i) different layers within the LLM architecture, specifically focusing on the middle and last layers (e.g., LLaMA2-7B and LLaMA3-8B: 16th and 32nd layers out of 32 layers with 4096 dimensions; Gemma-7B: 14th and 28th layers out of 28 layers with 3072 dimensions); and (ii) position of token activations, including averaging hidden activations over all the prompt/answer tokens or utilizing the hidden activation of the last token. The second aspect makes sense when the output contains more than one token, so we conduct this experiment on the natural language generation tasks only. Figure 3 gives a visualization of the comparison result. While the performances of these different feature extraction ways are quite similar in terms of performance across different tasks and LLMs, activation features from the middle layer generally perform better than the last layer. This may come from the fact that the last layer focuses more on the generation of the next token instead of summarizing information of the whole sentence, as has been discussed by Azaria and Mitchell (2023).
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: AUROC Performance of Feature Extraction Methods Across Language Models and Datasets
### Overview
The image is a multi-panel bar chart comparing the AUROC (Area Under the Receiver Operating Characteristic curve) performance of four different feature extraction methods. The comparison is conducted across three benchmark datasets (TriviaQA, CoQA, WMT-14) and for features derived from three different large language models: Gemma-7B, LLaMA2-7B, and LLaMA3-8B.
### Components/Axes
* **Chart Type:** Grouped Bar Chart (3 subplots).
* **Y-Axis:** Labeled **AUROC**. The scale runs from 0.75 to 0.90, with major tick marks at 0.75, 0.80, 0.85, and 0.90.
* **X-Axis (per subplot):** Three categorical datasets: **TriviaQA**, **CoQA**, **WMT-14**.
* **Subplot Titles (Top):**
1. **Features from Gemma-7B** (Left panel)
2. **Features from LLaMA2-7B** (Center panel)
3. **Features from LLaMA3-8B** (Right panel)
* **Legend (Bottom Center):** Defines four bar styles, each representing a feature extraction method:
1. **Avg token, mid layer:** Green bar with diagonal hatching (\\).
2. **Avg token, last layer:** Red bar with diagonal hatching (\\).
3. **Last token, mid layer:** Green bar with a dotted pattern.
4. **Last token, last layer:** Red bar with a dotted pattern.
### Detailed Analysis
**Trend Verification & Data Points (Approximate AUROC values):**
**1. Features from Gemma-7B (Left Panel):**
* **TriviaQA:** All four methods perform very similarly and highly, clustered around **0.88 - 0.89**. The "Last token, mid layer" (green dotted) bar appears marginally highest.
* **CoQA:** Performance is significantly lower, clustered around **0.75 - 0.76**. The "Last token, mid layer" (green dotted) bar is again slightly higher than the others.
* **WMT-14:** Performance is intermediate, clustered around **0.85**. The "Avg token, mid layer" (green hatched) bar is the highest at ~0.86, while the "Avg token, last layer" (red hatched) is the lowest at ~0.845.
**2. Features from LLaMA2-7B (Center Panel):**
* **TriviaQA:** Very high performance, with all bars near or at **0.90**. The "Avg token, last layer" (red hatched) and "Last token, mid layer" (green dotted) bars appear to reach the 0.90 line.
* **CoQA:** Performance is moderate, ranging from **~0.79 to ~0.81**. The "Last token, mid layer" (green dotted) bar is the highest (~0.81), while the "Avg token, last layer" (red hatched) is the lowest (~0.79).
* **WMT-14:** Performance is the lowest for this model, ranging from **~0.755 to ~0.78**. The "Last token, mid layer" (green dotted) bar is the highest (~0.78), and the "Avg token, mid layer" (green hatched) is the lowest (~0.755).
**3. Features from LLaMA3-8B (Right Panel):**
* **TriviaQA:** High performance, clustered around **0.87 - 0.88**. The "Last token, mid layer" (green dotted) bar is the highest (~0.88).
* **CoQA:** Performance is moderate, ranging from **~0.755 to ~0.77**. The "Last token, mid layer" (green dotted) bar is the highest (~0.77).
* **WMT-14:** Performance is the lowest, clustered around **0.74 - 0.75**. The "Last token, mid layer" (green dotted) bar is the highest (~0.75), while the "Avg token, last layer" (red hatched) is the lowest (~0.74).
### Key Observations
1. **Dataset Hierarchy:** For all three models, performance is consistently highest on **TriviaQA**, intermediate on **CoQA** (for Gemma and LLaMA3) or **WMT-14** (for LLaMA2), and lowest on the remaining dataset.
2. **Model Comparison:** **LLaMA2-7B** achieves the highest peak performance (reaching ~0.90 on TriviaQA). **Gemma-7B** shows the most consistent performance across methods on a given dataset (bars are very close in height). **LLaMA3-8B** generally shows slightly lower AUROC scores than the other two models across most tasks.
3. **Feature Extraction Method:** The **"Last token, mid layer"** method (green dotted bar) is frequently the top or tied-for-top performer within each dataset group across all models. The **"Avg token, last layer"** method (red hatched bar) is often among the lowest performers.
4. **Layer vs. Token:** Using features from the **mid layer** (green bars) often yields a slight advantage over using the **last layer** (red bars), particularly when comparing the "Last token" variants. Using the **last token** (dotted bars) often yields a slight advantage over using the **average of all tokens** (hatched bars), particularly in the mid layer.
### Interpretation
This chart investigates how to best extract meaningful features from the internal representations of large language models (LLMs) for downstream evaluation (measured by AUROC). The data suggests several key insights:
* **Task Dependency:** The LLMs' internal features are not universally optimal. Their quality for detection/classification tasks (as measured by AUROC) is highly dependent on the nature of the benchmark dataset. TriviaQA, likely a factual recall task, yields the most discriminative features across all models.
* **Optimal Extraction Strategy:** There is a consistent, subtle advantage to extracting features from the **middle layers** of the model rather than the final layer. This aligns with the understanding that middle layers often capture rich, syntactic, and semantic information before it is overly specialized for the model's final language modeling objective. Furthermore, using the representation of the **last token** (often the delimiter or "answer" position in instruction-tuned models) appears more informative than averaging all token representations.
* **Model Evolution:** The comparison between LLaMA2-7B and LLaMA3-8B is particularly interesting. While LLaMA3 is a newer and generally more capable model, its extracted features for these specific tasks do not show a uniform improvement in AUROC over LLaMA2. This could indicate differences in model architecture, training data, or that "capability" and "feature quality for external evaluation" are not perfectly correlated.
* **Practical Implication:** For researchers using LLM features as detectors (e.g., for hallucination, toxicity, or output quality), the choice of layer and token is a meaningful hyperparameter. The "Last token, mid layer" configuration appears to be a robust starting point based on this evidence. The relatively small performance gaps, however, suggest that the features are somewhat robust to these extraction choices.
</details>
Figure 3: Performance comparison of using hidden activations from different tokens and layers as features in the Wb-S method. The bars filled with â/â and â.â represent the activations averaged over the answer tokens and the hidden activation of the last token, respectively. And the green and orange bars denote the activations from the middle and the last layer, respectively.
### B.2 Scaling effect
In Figure 4, we investigate whether larger LLMsâ hidden activations enhance our uncertainty estimation method. For a fair comparison, we fix the target LLM that generates the output in Algorithm 1 and vary the tool LLM used for analysis. For example, in the left plot of Figure 4, we use Gemma-7B to generate the outputs, and LLaMA2-7B, LLaMA2-13B, and Gemma-7B to perform uncertainty estimation.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Charts: Cross-Model Prediction Performance (AUROC)
### Overview
The image displays three separate bar charts arranged horizontally. Each chart compares the performance (measured in AUROC) of different language model configurations on four standard benchmarks (MMLU, TriviaQA, CoQA, WMT-14). The charts specifically evaluate the performance of one model architecture when used to predict the outputs of another, different architecture.
### Components/Axes
* **Y-Axis (All Charts):** Labeled "AUROC". The scale runs from 0.70 to 1.00, with major tick marks at 0.05 intervals (0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00).
* **X-Axis (All Charts):** Lists four benchmark categories: `MMLU`, `TriviaQA`, `CoQA`, `WMT-14`.
* **Chart Titles (Top):**
1. Left Chart: `Use LLaMA2 to predict Gemma-7B`
2. Middle Chart: `Use Gemma to predict LLaMA2-7B`
3. Right Chart: `Use Gemma to predict LLaMA3-8B`
* **Legends (Top-Right of each chart):** Each chart has a legend box identifying the four bar types. The colors/patterns are consistent across charts, but the labels change.
* **Left Chart Legend:**
* White bar: `Wb-S`
* Grey bar: `Gb-S`
* Green bar with diagonal stripes: `7B`
* Red bar with black dots: `13B`
* **Middle & Right Charts Legend:**
* White bar: `Wb-S`
* Grey bar: `Gb-S`
* Green bar with diagonal stripes: `2B`
* Red bar with black dots: `7B`
### Detailed Analysis
**Chart 1: Use LLaMA2 to predict Gemma-7B**
* **Trend:** Performance is generally high and stable across models for MMLU and TriviaQA. There is a notable drop for CoQA, followed by a recovery for WMT-14.
* **Data Points (Approximate AUROC):**
* **MMLU:** All four bars (`Wb-S`, `Gb-S`, `7B`, `13B`) are nearly equal, clustered around **0.83**.
* **TriviaQA:** Performance is higher. `Wb-S` (~0.88), `Gb-S` (~0.87), `7B` (~0.88), `13B` (~0.90). The `13B` model performs best.
* **CoQA:** A significant drop. `Wb-S` (~0.76), `Gb-S` (~0.74), `7B` (~0.74), `13B` (~0.75).
* **WMT-14:** Performance recovers. `Wb-S` (~0.85), `Gb-S` (~0.83), `7B` (~0.86), `13B` (~0.86).
**Chart 2: Use Gemma to predict LLaMA2-7B**
* **Trend:** Shows more variability. TriviaQA has the highest scores, while MMLU and CoQA are lower. The `7B` (red dotted) model often outperforms others.
* **Data Points (Approximate AUROC):**
* **MMLU:** All bars are low and equal, at approximately **0.72**.
* **TriviaQA:** High scores. `Wb-S` (~0.90), `Gb-S` (~0.81), `2B` (~0.84), `7B` (~0.92). The `7B` model is the clear leader.
* **CoQA:** Moderate scores. `Wb-S` (~0.81), `Gb-S` (~0.67), `2B` (~0.80), `7B` (~0.85). The `Gb-S` bar is notably the lowest in the entire chart.
* **WMT-14:** Moderate scores. `Wb-S` (~0.78), `Gb-S` (~0.72), `2B` (~0.76), `7B` (~0.78).
**Chart 3: Use Gemma to predict LLaMA3-8B**
* **Trend:** Performance is generally lower than in the first two charts, especially for CoQA and WMT-14. The `Wb-S` and `Gb-S` models often perform better than the smaller `2B` and `7B` models on several tasks.
* **Data Points (Approximate AUROC):**
* **MMLU:** All bars are equal, at approximately **0.83**.
* **TriviaQA:** `Wb-S` (~0.87), `Gb-S` (~0.86), `2B` (~0.77), `7B` (~0.84). The `2B` model shows a significant drop.
* **CoQA:** Low scores. `Wb-S` (~0.77), `Gb-S` (~0.74), `2B` (~0.72), `7B` (~0.74).
* **WMT-14:** The lowest scores in the chart. `Wb-S` (~0.75), `Gb-S` (~0.73), `2B` (~0.68), `7B` (~0.70).
### Key Observations
1. **Benchmark Difficulty:** CoQA consistently yields the lowest AUROC scores across all three prediction scenarios, suggesting it is the most challenging task for cross-model prediction in this experiment.
2. **Model Size Effect:** In Chart 1 (LLaMA2 predicting Gemma-7B), the largest model (`13B`, red dotted) generally performs best or ties for best. In Charts 2 and 3 (Gemma predicting LLaMA), the relationship is less clear; sometimes the larger model (`7B`, red dotted) wins, but sometimes the white (`Wb-S`) or grey (`Gb-S`) bars are superior.
3. **Task-Specific Performance:** TriviaQA often shows the highest AUROC values, indicating it may be an easier task for these models to predict on, or that the models' knowledge on this task is more aligned.
4. **Symmetry Break:** The performance when using LLaMA2 to predict Gemma (Chart 1) is not symmetric with using Gemma to predict LLaMA2 (Chart 2). The AUROC values and patterns differ, indicating the prediction difficulty is not reciprocal between model architectures.
### Interpretation
This data investigates the "predictability" or alignment between different Large Language Model (LLM) architectures. The AUROC metric likely measures how well one model can predict the correctness or output distribution of another on standardized benchmarks.
* **What it demonstrates:** The charts show that cross-model prediction performance is highly dependent on three factors: **1) The specific model architectures involved** (LLaMA2 vs. Gemma vs. LLaMA3), **2) The task domain** (e.g., knowledge QA vs. conversational QA vs. translation), and **3) The size/capability of the predictor model**. The lack of symmetry between Charts 1 and 2 is a key finding, suggesting that "Model A predicting Model B" is a different problem than "Model B predicting Model A."
* **Underlying Patterns:** The consistently lower scores on CoQA imply that conversational reasoning (the focus of CoQA) involves model-specific nuances that are harder to transfer or predict across architectures compared to more factual knowledge (TriviaQA, MMLU). The variable impact of model size suggests that simply making a predictor larger does not guarantee better cross-model prediction; the nature of the knowledge or reasoning being transferred is crucial.
* **Implication:** This type of analysis is valuable for understanding model similarity, knowledge overlap, and the potential for using one model as a "teacher" or "evaluator" for another. It suggests that creating universally predictive or evaluative models may be challenging, as performance is tightly coupled to the specific pair of models and the task at hand.
</details>
Figure 4: (Left) Using the hidden activations of LLaMA2-7B and LLaMA2-13B to estimate the uncertainty of the answer provided by Gemma-7B. (Middle) Using the hidden activations of Gemma-2B and Gemma-7B to estimate the uncertainty of the answer provided by LLaMA2-7B. (Right) Using the hidden activations of Gemma-2B and Gemma-7B to estimate the uncertainty of the answer provided by LLaMA3-8B
We find that larger LLM does encode better knowledge about the uncertainty, which is attributed to their improved knowledge in answering the questions. We also note that in the case of using Gemma to predict LLaMA2-7B, even a small tool LLM (Gemma-2B) is capable of achieving better performance than the Gb-S that only uses the entropy- and probability-related features from the target LLM. This result also underscores the benefits of adopting the internal state in estimating the uncertainty, even from an LLM different from the one generating the answers.
### B.3 Histogram of correlations
Figure 5 plots the histograms of the pairwise correlations between the neuron activations and the labels (whether the LLMâs response is correct). We make two observations here: First, for both LLMs, some neurons have a significantly positive (or negative) correlation with the label. We can interpret these neurons as the uncertainty neuron for the corresponding task. When these neurons are activated, the LLMs are uncertain about their responses. Second, Gemma-7B and LLaMA3-8B have more significant neurons than LLaMA2-7B, and this is consistent with the better performance of Gemma-7B and LLaMA3-8B in Table 1 and Table 2. Also, this reinforces that the hidden activations of the LLMs contain uncertainty information about the LLMâs output.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Histograms: Distribution Comparison of Three Language Models
### Overview
The image displays three separate histograms arranged horizontally, each representing the distribution of a numerical metric for a different large language model (LLM). The histograms share a common vertical axis (y-axis) scale but have individual horizontal axes (x-axis) with slightly different ranges. The overall visual suggests a comparison of model outputs or internal activations, centered around zero.
### Components/Axes
* **Layout:** Three distinct plots in a single row.
* **Y-Axis (Common):** Located on the far left of the first plot. It represents frequency or count.
* **Label:** Not explicitly stated, but implied as "Frequency" or "Count".
* **Scale:** Linear, ranging from 0 to 120.
* **Tick Marks:** 0, 20, 40, 60, 80, 100, 120.
* **X-Axes (Individual):** Located at the bottom of each respective histogram. They represent the value of the measured metric.
* **Plot 1 (Left):**
* **Label:** `LLaMA2-7B`
* **Range:** Approximately -0.15 to 0.2.
* **Tick Marks:** -0.1, 0.0, 0.1, 0.2.
* **Plot 2 (Center):**
* **Label:** `LLaMA3-8B`
* **Range:** Approximately -0.2 to 0.2.
* **Tick Marks:** -0.2, -0.1, 0.0, 0.1, 0.2.
* **Plot 3 (Right):**
* **Label:** `Gemma-7B`
* **Range:** Approximately -0.2 to 0.25.
* **Tick Marks:** -0.2, -0.1, 0.0, 0.1, 0.2.
* **Data Series (Histograms):**
* **LLaMA2-7B:** Blue bars.
* **LLaMA3-8B:** Red/salmon bars.
* **Gemma-7B:** Green bars.
* **Legend:** Not present. Model identification is provided by the x-axis label below each histogram.
### Detailed Analysis
* **LLaMA2-7B (Blue):**
* **Trend:** Unimodal, roughly symmetric, bell-shaped distribution.
* **Central Tendency:** Peak (mode) is centered very close to 0.0.
* **Spread:** The majority of the data falls between -0.1 and 0.1. The distribution has a moderate spread.
* **Peak Height:** The highest bar reaches a frequency of approximately 105.
* **LLaMA3-8B (Red):**
* **Trend:** Unimodal, roughly symmetric, bell-shaped distribution.
* **Central Tendency:** Peak (mode) is centered slightly to the right of 0.0, approximately at 0.02.
* **Spread:** The distribution is wider than LLaMA2-7B, with significant data from -0.15 to 0.15. The tails extend closer to the axis limits of -0.2 and 0.2.
* **Peak Height:** The highest bar is the tallest of the three plots, reaching a frequency of approximately 125 (exceeding the 120 tick mark).
* **Gemma-7B (Green):**
* **Trend:** Unimodal, roughly symmetric, bell-shaped distribution.
* **Central Tendency:** Peak (mode) is centered very close to 0.0.
* **Spread:** The spread appears similar to or slightly narrower than LLaMA2-7B, concentrated between -0.1 and 0.1.
* **Peak Height:** The highest bar reaches a frequency of approximately 95.
### Key Observations
1. **Common Shape:** All three models exhibit a normal-like (Gaussian) distribution for the measured metric, centered near zero.
2. **Peak Magnitude:** LLaMA3-8B shows the highest concentration of values around its peak (highest frequency), followed by LLaMA2-7B, then Gemma-7B.
3. **Distribution Width:** LLaMA3-8B has the widest distribution (largest variance), indicating its values are more spread out. LLaMA2-7B and Gemma-7B have more similar, narrower spreads.
4. **Central Alignment:** LLaMA2-7B and Gemma-7B are centered almost exactly at 0.0, while LLaMA3-8B's center is slightly positive.
### Interpretation
This visualization compares the statistical distribution of a key metric (e.g., log-probability differences, activation values, or output logits) across three different LLMs. The consistent bell shape suggests the metric follows a predictable, stable pattern for each model.
The differences are meaningful:
* **LLaMA3-8B's** wider spread and higher peak could indicate it produces more extreme values (both positive and negative) but with a stronger central tendency. This might reflect a model with higher confidence or more pronounced internal activations.
* **LLaMA2-7B and Gemma-7B** show tighter, more conservative distributions centered precisely at zero. This could imply more uniform or less polarized outputs for this specific metric.
* The slight rightward shift of LLaMA3-8B's peak might indicate a minor systematic bias toward positive values in the measured metric compared to the other two models.
Without the specific metric name, the exact implication is uncertain. However, the chart effectively demonstrates that while all models share a fundamental distributional characteristic, LLaMA3-8B exhibits greater variance and a higher density of values near its mode.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Histograms: Distribution Comparison of Three Language Models
### Overview
The image displays three separate histograms arranged horizontally in a single row. Each histogram visualizes the distribution of a numerical variable (likely a performance metric, embedding similarity, or output score) for a different large language model. The models compared are LLaMA2-7B, LLaMA3-8B, and Gemma-7B. The charts share a common y-axis scale but have individual x-axes.
### Components/Axes
* **Chart Type:** Three separate histograms (frequency distributions).
* **Layout:** Three panels arranged horizontally from left to right.
* **Y-Axis (Common to all panels):**
* **Label:** Not explicitly labeled. Represents frequency or count.
* **Scale:** Linear scale from 0 to 120, with major tick marks at 0, 20, 40, 60, 80, 100, 120.
* **X-Axis (Individual for each panel):**
* **Scale:** Linear scale from approximately -0.25 to +0.25 for all three charts. Major tick marks are visible at -0.2, 0.0, and 0.2.
* **Labels (Centered below each histogram):**
* Left Panel: `LLaMA2-7B`
* Middle Panel: `LLaMA3-8B`
* Right Panel: `Gemma-7B`
* **Data Series (Color-Coded):**
* **LLaMA2-7B (Left Panel):** Blue histogram bars.
* **LLaMA3-8B (Middle Panel):** Red/Salmon histogram bars.
* **Gemma-7B (Right Panel):** Green histogram bars.
* **Legend:** Not present as a separate box. The model names below each panel serve as the legend, with color being the differentiating factor.
### Detailed Analysis
**1. LLaMA2-7B (Blue, Left Panel):**
* **Trend/Shape:** The distribution is unimodal and approximately symmetric, forming a classic bell curve (normal distribution shape).
* **Central Tendency:** The peak (mode) is centered very close to `0.0` on the x-axis.
* **Spread/Range:** The bulk of the data lies between approximately `-0.15` and `+0.15`. The distribution tapers off smoothly on both sides, with very low frequencies approaching `-0.2` and `+0.2`.
* **Peak Frequency:** The highest bar reaches a frequency of approximately `105` (just above the 100 mark).
**2. LLaMA3-8B (Red, Middle Panel):**
* **Trend/Shape:** The distribution is unimodal and symmetric, also resembling a normal distribution. It appears slightly more peaked (leptokurtic) than the LLaMA2-7B distribution.
* **Central Tendency:** The peak is centered at or extremely close to `0.0`.
* **Spread/Range:** The data is more concentrated around the mean. The visible range is roughly between `-0.15` and `+0.15`, with the tails appearing to drop off more sharply than LLaMA2-7B near the extremes.
* **Peak Frequency:** This histogram has the highest peak of the three, with the tallest bar reaching approximately `115` (close to the 120 mark).
**3. Gemma-7B (Green, Right Panel):**
* **Trend/Shape:** The distribution is markedly different. It is platykurtic (flatter) and wider, with a less defined single peak. It appears more uniform or multi-modal across a broad central region.
* **Central Tendency:** The center of mass is around `0.0`, but there is no sharp, singular peak. The highest frequencies are spread across a plateau from roughly `-0.1` to `+0.1`.
* **Spread/Range:** This distribution has the widest spread. Significant frequencies are observed from approximately `-0.2` to `+0.2`, with the tails extending slightly beyond these points. The data is much more dispersed.
* **Peak Frequency:** The maximum frequency is lower than the other two models, with the tallest bars reaching only about `50-55`.
### Key Observations
1. **Distribution Shape Contrast:** LLaMA2-7B and LLaMA3-8B show tight, normal-like distributions centered at zero, while Gemma-7B shows a broad, flat distribution.
2. **Peak Magnitude:** LLaMA3-8B exhibits the highest concentration of values near zero (highest peak frequency), suggesting the most consistent or least variable outputs for the measured metric.
3. **Dispersion:** Gemma-7B has the highest variance or dispersion in its values, as indicated by its wide, flat histogram. LLaMA3-8B appears to have the lowest variance.
4. **Symmetry:** All three distributions are roughly symmetric around zero, indicating no strong positive or negative bias in the measured metric for any model.
### Interpretation
This visualization compares the statistical behavior of three language models on a specific, zero-centered metric. The stark difference in distribution shapes suggests fundamental differences in model characteristics:
* **LLaMA3-8B's** tall, narrow distribution implies high precision and consistency. Its outputs for this metric are highly predictable and cluster tightly around the central value (0.0). This could indicate a model that is very stable or has been fine-tuned to produce outputs within a narrow band.
* **LLaMA2-7B** shows similar but slightly less concentrated behavior than its successor, which is an expected progression in model refinement.
* **Gemma-7B's** broad, flat distribution indicates high variability and less predictability. Its outputs are spread across a wide range of values. This could be interpreted in two ways: either the model is less stable/consistent, or it possesses greater diversity in its outputs for this metric, which might be desirable for certain creative or exploratory tasks.
The shared symmetry around zero suggests the metric itself is designed to be balanced (e.g., a similarity score where 0 is neutral, a sentiment score, or a deviation from a reference). The comparison highlights that model size (7B vs 8B) is not the sole determinant of output distribution; architectural differences (LLaMA vs Gemma) lead to qualitatively different statistical profiles.
</details>
Figure 5: The histograms of the pairwise correlations on the TriviaQA task between the neuron activations and the labels (whether the LLMâs response is correct), where the neural values are the last-token hidden activations of answers from the middle layer (upper) and the last layer (lower) of two models respectively.
Figure 6 plots some example neuronsâ activation by selecting the neurons with the largest absolute correlations in Figure 5. More neurons from the last layer can be found in Figure 7. These neurons as an individual indicator exhibit different distributional patterns when the response is correct compared to when the response is incorrect, and thus reflect the uncertainty of the LLMâs responses.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Multi-Panel Histogram Grid: Neuron Activation Distributions for True vs. False Answers
### Overview
The image displays a 3x4 grid of histograms. Each histogram compares the distribution of activation values for a specific neuron within a specific large language model (LLM) when the model produces a "true answer" versus a "false answer." The top row corresponds to the LLaMA-2-7B model, the middle row to LLaMA-3-8B, and the bottom row to Gemma-7B. Each column represents a different, specifically indexed neuron within that model. The histograms use overlapping semi-transparent bars: blue for "true answer" samples and red for "false answer" samples. The overlapping region appears purple.
### Components/Axes
* **Legend:** Positioned at the top center of the entire figure. It contains two entries:
* A blue square labeled "true answer".
* A red square labeled "false answer".
* **Y-Axis (Common to all plots):** Labeled "# Samples / [Model Name]". The scale is consistent across all plots, ranging from 0 to 500, with major ticks at 0, 100, 200, 300, 400, and 500. This represents the frequency count of samples falling into each activation value bin.
* **X-Axis (Per subplot):** Labeled with the specific neuron index followed by "-th neuron act." (e.g., "3961-th neuron act."). The numerical scale (activation value) varies significantly between subplots.
* **Subplot Grid Structure:**
* **Row 1 (LLaMA-2-7B):** Y-axis label is "# Samples / LLaMA-2-7B".
* **Row 2 (LLaMA-3-8B):** Y-axis label is "# Samples / LLaMA-3-8B".
* **Row 3 (Gemma-7B):** Y-axis label is "# Samples / Gemma-7B".
### Detailed Analysis
**Row 1: LLaMA-2-7B**
1. **3961-th neuron act. (Top-Left):** X-axis range approx. -1.5 to 1.0. Both distributions are centered near 0. The "true answer" (blue) distribution is taller and wider, peaking around 350 samples. The "false answer" (red) distribution is shorter and narrower, peaking around 150 samples. Significant overlap.
2. **394-th neuron act. (Top-Second):** X-axis range approx. -1.0 to 1.5. Distributions are centered slightly above 0. Blue peaks around 350, red around 150. Overlap is substantial.
3. **490-th neuron act. (Top-Third):** X-axis range approx. -2.0 to 6.0. Distributions are centered around 2.0. Blue peaks near 350, red near 150. The "true answer" distribution has a longer tail extending to higher activation values (up to ~6).
4. **2635-th neuron act. (Top-Right):** X-axis range approx. -1.5 to 1.0. Centered near 0. Blue peaks around 350, red around 150. Similar shape to the first plot.
**Row 2: LLaMA-3-8B**
1. **3702-th neuron act. (Middle-Left):** X-axis range approx. -0.6 to 0.2. Centered around -0.2. Blue peaks near 450, red near 150. Distributions are narrower than in LLaMA-2.
2. **3740-th neuron act. (Middle-Second):** X-axis range approx. -0.5 to 0.5. Centered near 0.0. Blue peaks around 450, red around 150.
3. **1800-th neuron act. (Middle-Third):** X-axis range approx. -1.0 to 0.5. Centered around -0.25. Blue peaks near 450, red near 200. The "false answer" distribution is relatively taller here compared to others in this row.
4. **2082-th neuron act. (Middle-Right):** X-axis range approx. -0.5 to 1.0. Centered near 0.0. Blue peaks above 500 (slightly clipped by the top of the axis), red peaks around 200. This shows the highest peak count in the grid.
**Row 3: Gemma-7B**
1. **2368-th neuron act. (Bottom-Left):** X-axis range approx. -0.1 to 0.1. Centered near 0.0. Blue peaks around 400, red around 200. Very narrow activation range.
2. **1945-th neuron act. (Bottom-Second):** X-axis range approx. -0.5 to 1.0. Centered around 0.5. Blue peaks near 350, red near 150. The "true answer" distribution is shifted to the right relative to the "false answer" distribution.
3. **1758-th neuron act. (Bottom-Third):** X-axis range approx. -0.5 to 0.5. Centered near 0.25. Blue peaks near 500, red near 200. High peak count.
4. **719-th neuron act. (Bottom-Right):** X-axis range approx. -0.1 to 0.1. Centered near 0.0. Blue peaks around 350, red around 150. Very narrow range, similar to the first plot in this row.
### Key Observations
1. **Consistent Pattern:** In every single histogram, the distribution for "true answer" (blue) has a higher peak and is generally wider than the distribution for "false answer" (red). This indicates that for these specific neurons, correct model outputs are associated with a broader and more frequent range of activation values.
2. **Overlap:** There is always significant overlap between the two distributions, meaning the activation values alone for these neurons are not perfectly discriminative between true and false answers.
3. **Model Differences:** The x-axis scales (activation value ranges) differ notably between models. LLaMA-2-7B neurons show activations in the range of roughly -2 to +6. LLaMA-3-8B activations are in a tighter range, often between -1 and +1. Gemma-7B activations are the most constrained, often within ±0.1 or ±0.5. This suggests different internal scaling or activation dynamics across model architectures.
4. **Neuron Specificity:** The shape, center, and spread of the distributions are unique to each neuron index, confirming that different neurons encode information differently.
### Interpretation
This visualization is likely from a mechanistic interpretability study aiming to understand how internal neural activations correlate with model correctness. The data suggests that for the selected neurons across three different models, **correct model behavior ("true answer") is associated with more variable and higher-magnitude activation patterns compared to incorrect behavior ("false answer").**
The consistent pattern across diverse models (LLaMA-2, LLaMA-3, Gemma) implies this might be a generalizable phenomenon for certain classes of neurons involved in factual recall or reasoning tasks. However, the significant overlap indicates that while these neurons are *informative*, they are not *deterministic* indicators of truthfulness. A single neuron's activation is likely one signal among many in a distributed representation.
The differences in activation scales between models (LLaMA-2 > LLaMA-3 > Gemma in range) could reflect differences in training recipes, normalization techniques (like LayerNorm), or model size. The narrow ranges in Gemma-7B might indicate a more compressed or regularized activation space.
From a Peircean perspective, this chart provides an *indexical* sign: the activation distribution points to a physical state of the model (neuron firing patterns) that is reliably correlated with a property of its output (truthfulness). The *iconic* similarity of the pattern across models suggests a common underlying computational principle. The *symbolic* interpretation is that these neurons may be part of a circuit that "gates" or "modulates" information flow, where a more active and variable state is permissive for generating correct information.
</details>
Figure 6: Distribution of values from particular neurons of mid-layers on TriviaQA dataset.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Histogram Grid: Neuron Activation Distributions by Model and Neuron Index
### Overview
The image displays a 3x4 grid of histograms, each showing the distribution of activation values for a specific neuron within a specific large language model (LLM). The distributions are split into two categories: activations associated with "true answers" (blue) and "false answers" (red). The purpose is to visualize how the activation patterns of individual neurons differ based on the correctness of the model's output.
### Components/Axes
* **Legend:** Located at the top center. It defines two data series:
* **Blue:** "true answer"
* **Red:** "false answer"
* **Y-Axis (Shared per row):** Labeled "# Samples / [Model Name]". The scale is consistent within each row, ranging from 0 to 1000, with tick marks at 0, 200, 400, 600, 800, and 1000. This represents the count of samples falling into each activation bin.
* **X-Axis (Per subplot):** Labeled with the specific neuron index being analyzed (e.g., "2021-th neuron act."). The numerical scale and range vary significantly between subplots.
* **Row Labels (Left side):** Identify the model for each row of histograms:
* **Top Row:** LLaMA-2-7B
* **Middle Row:** LLaMA-3-8B
* **Bottom Row:** Gemma-7B
### Detailed Analysis
The grid is processed row by row, from left to right.
**Row 1: LLaMA-2-7B**
1. **Top-Left (Neuron 2021):** X-axis range approx. -15 to 5. Both distributions are centered around -8. The "true answer" (blue) distribution has a much higher peak (~650 samples) compared to the "false answer" (red, ~300 samples). The distributions overlap heavily.
2. **Top-Center-Left (Neuron 149):** X-axis range approx. -10 to 15. Distributions are centered around +7. Blue peak (~500) is higher than red peak (~200). Significant overlap.
3. **Top-Center-Right (Neuron 3556):** X-axis range approx. -40 to 30. A very sharp, high peak for "true answer" (blue) centered near -30, reaching over 1000 samples. The "false answer" (red) distribution is much lower and broader, centered around -25 with a peak of ~300. This shows the strongest separation between classes in this row.
4. **Top-Right (Neuron 2672):** X-axis range approx. -3 to 6. Distributions centered around +2.5. Blue peak (~550) is higher than red peak (~200). Substantial overlap.
**Row 2: LLaMA-3-8B**
1. **Middle-Left (Neuron 1917):** X-axis range approx. -20 to 20. Distributions centered around -8. Blue peak (~450) is higher than red peak (~200). Overlap is significant.
2. **Middle-Center-Left (Neuron 4055):** X-axis range approx. -20 to 0. Distributions centered around -16. Blue peak (~500) is higher than red peak (~200). Overlap is significant.
3. **Middle-Center-Right (Neuron 3795):** X-axis range approx. -15 to 5. Distributions centered around -7. Blue peak (~600) is higher than red peak (~200). Overlap is significant.
4. **Middle-Right (Neuron 3939):** X-axis range approx. -15 to 10. Distributions centered around -10. Blue peak (~650) is higher than red peak (~200). Overlap is significant.
**Row 3: Gemma-7B**
1. **Bottom-Left (Neuron 2944):** X-axis range approx. -8 to 5. Distributions centered around -4. Blue peak (~450) is higher than red peak (~150). Overlap is significant.
2. **Bottom-Center-Left (Neuron 96):** X-axis range approx. -10 to 5. Distributions centered around -6. Blue peak (~350) is higher than red peak (~150). Overlap is significant.
3. **Bottom-Center-Right (Neuron 156):** X-axis range approx. -2 to 8. Distributions centered around +4. Blue peak (~400) is higher than red peak (~200). Overlap is significant.
4. **Bottom-Right (Neuron 23):** X-axis range approx. -6 to 8. Distributions centered around +4. Blue peak (~350) is higher than red peak (~200). Overlap is significant.
### Key Observations
1. **Consistent Pattern:** In all 12 histograms, the distribution for "true answer" (blue) has a higher peak (more samples concentrated around the mode) than the distribution for "false answer" (red).
2. **Overlap:** There is substantial overlap between the blue and red distributions in nearly all cases, indicating that the activation value of a single neuron is not a perfect classifier for answer correctness.
3. **Outlier - Strong Separation:** The histogram for **LLaMA-2-7B, Neuron 3556** (top row, third from left) is a notable exception. It shows a very sharp, high peak for true answers at a much lower activation value (~-30) compared to the broader, lower distribution for false answers. This suggests this specific neuron may be a particularly strong indicator for this model.
4. **Activation Range Variability:** The optimal activation range (where the peak occurs) varies widely across neurons and models, from negative values (e.g., -30 for LLaMA-2-7B Neuron 3556) to positive values (e.g., +7 for LLaMA-2-7B Neuron 149).
5. **Model Comparison:** The distributions for LLaMA-3-8B (middle row) appear slightly more consistent in shape and separation across the four sampled neurons compared to the other two models.
### Interpretation
This visualization is likely from a mechanistic interpretability study aiming to understand how internal neural activations correlate with model performance. The data suggests that:
* **Neurons as Features:** Individual neurons can act as features whose activation strength carries information about the correctness of the model's output. The consistent pattern of higher peaks for true answers implies these neurons are more reliably or strongly activated in correct response scenarios.
* **Distributed Representation:** The significant overlap in most plots indicates that correctness is not encoded by a single neuron in an "on/off" manner. Instead, it is likely represented in a distributed fashion across many neurons, where the combined pattern of activations is more informative than any single value.
* **Specialized Neurons:** The outlier case (LLaMA-2-7B, Neuron 3556) demonstrates that some neurons may become highly specialized for detecting conditions leading to correct or incorrect outputs, showing a clearer separation in their activation distributions.
* **Model-Specific Internals:** The differences in activation ranges and distribution shapes between LLaMA-2, LLaMA-3, and Gemma highlight that different models develop distinct internal representations, even when performing similar tasks. This has implications for model comparison and the transfer of interpretability findings.
In summary, the image provides empirical evidence that single-neuron activations are weak but consistent signals for answer correctness in LLMs, with occasional neurons showing stronger specialization. It underscores the complexity of neural network internals and the challenge of pinpointing exact causal mechanisms.
</details>
Figure 7: More distribution of values from specific neurons of last layers on the TriviaQA dataset. The plots are obtained in the same way as Figure 6.
### B.4 Proof of Proposition 4.1
The proof of Proposition 4.1 follows from the definition of $f^{*}.$
## Appendix C Calibration performance
In Section 4.1, we distinguish the two tasks of uncertainty estimation and uncertainty calibration. Throughout the paper, we have been focused on improving the performance on the task of uncertainty estimation â to predict when the LLM is uncertain about its response. Generally, a better uncertainty estimation model leads to one with better calibration performance. The calibration (or recalibration) of the uncertainty estimation model can be indeed reduced to the classic ML setting which does not involve the LLM. Table 4 gives the calibration performance and we see an advantage of our supervised methods over benchmark methods consistent with the AUROC performance in Table 1. We adopt the histogram binning method here because we find that the temperature scaling method and the Platt scaling method will give all predicted scores concentrated within a small range such as $[0.2,0.6]$ . We also do not exclude the possibility that the other calibration methods can give even better performance. The point to make here is that uncertainty estimation and uncertainty calibration are two closely related tasks. Note that (i) a better uncertainty estimation model leads to a better calibration performance and (ii) the LLMs are pretrained and not designed for these NLP tasks in the first place (see Section 4.2) so that there is no uncertainty score readily available (as the predicted probabilities for the image classifiers); we emphasize the importance of an extra uncertainty estimation procedure as our supervised one so to extract the uncertainty information from the inside of the LLMs.
| NLL | TriviaQA | G-7B | 0.478 | 0.500 | 0.428 | 0.472 | 0.739 | 8.710 | 0.414 | 0.467 | 0.392 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| L-7B | 1.155 | 0.551 | 0.575 | 0.600 | 1.481 | 21.119 | 0.338 | 0.580 | 0.388 | | |
| L-8B | 0.483 | 0.407 | 0.383 | 0.401 | 0.719 | 8.515 | 0.423 | 0.467 | 0.365 | | |
| CoQA | G-7B | 0.778 | 0.474 | 0.469 | 0.476 | 0.632 | 8.106 | 0.474 | 0.497 | 0.457 | |
| L-7B | 1.047 | 0.620 | 0.637 | 0.649 | 1.358 | 11.708 | 0.417 | 0.607 | 0.457 | | |
| L-8B | 0.823 | 0.502 | 0.508 | 0.499 | 0.762 | 8.007 | 0.551 | 0.535 | 0.507 | | |
| WMT-14 | G-7B | 9.674 | 1.266 | 0.809 | 0.618 | 0.701 | 17.933 | 0.454 | 0.463 | 0.449 | |
| L-7B | 1.204 | 1.150 | 0.718 | 0.809 | 0.796 | 16.913 | 0.553 | 0.622 | 0.583 | | |
| L-8B | 1.490 | 0.752 | 0.652 | 0.676 | 0.722 | 21.340 | 0.649 | 0.673 | 0.612 | | |
| ECE | TriviaQA | G-7B | 0.152 | 0.138 | 0.066 | 0.115 | 0.275 | 0.253 | 0.056 | 0.075 | 0.067 |
| L-7B | 0.437 | 0.068 | 0.048 | 0.146 | 0.188 | 0.616 | 0.043 | 0.087 | 0.049 | | |
| L-8B | 0.171 | 0.082 | 0.046 | 0.081 | 0.196 | 0.283 | 0.107 | 0.087 | 0.075 | | |
| CoQA | G-7B | 0.356 | 0.054 | 0.112 | 0.064 | 0.221 | 0.237 | 0.121 | 0.129 | 0.113 | |
| L-7B | 0.397 | 0.065 | 0.105 | 0.073 | 0.174 | 0.494 | 0.052 | 0.071 | 0.038 | | |
| L-8B | 0.339 | 0.031 | 0.071 | 0.033 | 0.196 | 0.312 | 0.156 | 0.110 | 0.122 | | |
| WMT-14 | G-7B | 0.499 | 0.464 | 0.234 | 0.197 | 0.072 | 0.521 | 0.097 | 0.063 | 0.073 | |
| L-7B | 0.164 | 0.389 | 0.065 | 0.269 | 0.127 | 0.491 | 0.045 | 0.090 | 0.101 | | |
| L-8B | 0.318 | 0.192 | 0.051 | 0.142 | 0.029 | 0.618 | 0.145 | 0.201 | 0.137 | | |
| Brier | TriviaQA | G-7B | 0.282 | 0.221 | 0.224 | 0.215 | 0.344 | 0.279 | 0.266 | 0.288 | 0.282 |
| L-7B | 0.431 | 0.241 | 0.271 | 0.259 | 0.322 | 0.645 | 0.334 | 0.322 | 0.315 | | |
| L-8B | 0.262 | 0.192 | 0.204 | 0.188 | 0.291 | 0.373 | 0.258 | 0.265 | 0.255 | | |
| CoQA | G-7B | 0.318 | 0.174 | 0.188 | 0.171 | 0.232 | 0.241 | 0.207 | 0.218 | 0.212 | |
| L-7B | 0.395 | 0.233 | 0.242 | 0.230 | 0.265 | 0.464 | 0.296 | 0.256 | 0.276 | | |
| L-8B | 0.338 | 0.197 | 0.201 | 0.191 | 0.255 | 0.359 | 0.258 | 0.242 | 0.248 | | |
| WMT-14 | G-7B | 0.505 | 0.454 | 0.330 | 0.319 | 0.247 | 0.606 | 0.327 | 0.287 | 0.309 | |
| L-7B | 0.313 | 0.413 | 0.271 | 0.334 | 0.275 | 0.502 | 0.296 | 0.277 | 0.288 | | |
| L-8B | 0.343 | 0.279 | 0.250 | 0.263 | 0.246 | 0.620 | 0.282 | 0.300 | 0.284 | | |
Table 4: Calibration performance on natural language generation tasks after histogram binning. The base models are from Table 1. The original uncertainty scores from the base models are first scaled into $[0,1]$ and then a histogram binning is performed with 20 bins of equal length.
## Appendix D Details for the Numerical Experiments
We ran all of our experiments on an AMD EPYC 7452 128-core processor with 4 $\times$ 48G NVIDIA A6000 GPUs.
### D.1 Dataset preparation
In the following we provide more information for the three tasks considered in our numerical experiments.
- Question answering. We follow Kuhn et al. (2023) and use the CoQA and TriviaQA (Joshi et al., 2017) datasets. The CoQA task requires the LLM to answer questions by understanding the provided text, and the TriviaQA requires the LLM to answer questions based on its pre-training knowledge. We adopt the scoring function $s(\cdot,\cdot)$ as Rouge-1 (Lin and Och, 2004a) and label a response $\bm{y}_{i}$ as correct if $s(\bm{y}_{i},\bm{y}_{i,\text{true}})\geq 0.3$ and incorrect otherwise.
- Multiple choice. We consider the Massive Multitask Language Understanding (MMLU) dataset (Hendrycks et al., 2020), a collection of 15,858 questions covering 57 subjects across STEM. Due to the special structure of the dataset, the generated output $\bm{y}_{i}$ and the correct answer $\bm{y}_{\text{true},i}\in\{\text{A, B, C, D}\}$ . Therefore, this task can also be regarded as a classification problem for the LLM by answering the question with one of the four candidate choices.
- Machine translation. We consider the WMT 2014 dataset (Bojar et al., 2014) for estimating LLMâs uncertainty on the machine translation task. The scoring function $s(\cdot,\cdot)$ is chosen to be the BLEU score (Papineni et al., 2002; Lin and Och, 2004b) and the generated answer $\bm{y}_{i}$ is labeled as correct if $s(\bm{y}_{i},\bm{y}_{i,\text{true}})>0.3$ and incorrect otherwise.
Prompt dataset generation. For all the tasks studied in this paper, we adopt the few-shot prompting for the LLM. Specifically, in the prompt, we provide $r$ examples to make the LLM learn the format of the response, as illustrated in the following. For the question-answering task, we construct the prompt without using any question-answering sample repeatedly in the original dataset. For example, Prompt 1 includes the 1st to $r$ -th question-answering samples in the original dataset as the examples and the $(r+1)$ -th sample as the target question-answering pair for the LLM; next, Prompt 2 uses the $(r+2)$ -th to $(2r+1)$ -th samples as the examples and the $(2r+2)$ -th sample as the target question-answering pair. However, as the test datasets of MMLU and WMT used for evaluation are not sufficiently large, we generate the prompt in a convolution-like manner: Prompt 2 includes the 2nd to $(r+1)$ -th question-answering samples as the examples and the $(r+2)$ -th sample as the target question-answering pair.
Dataset split. After generating the prompt-answering dataset, we split this dataset into two parts for training the calibration model and evaluation/test. For the MMLU and WMT datasets, we take the dataset generated from the original validation/test dataset. For the question-answering task, as the answer of TriviaQA in the original test dataset is vacant, we take the first 2000 generated prompt-answering pairs from the training dataset as the test dataset, and the remaining for training.
Prompting format. Here we give the different prompting templates used for different tasks. We use few-shot prompting and the templates can always be roughly divided into four parts: introduction (empty only for WMT), examples, question, and answer, where examples are just $r$ distinct question-answer pairs in the same form as the question and answer parts. We feed the model with the template string except for the reference answer as inputs.
COQA Reading the passage and answer given questions accordingly. Passage: {a passage in COQA} Examples: {r distinct QA pairs related to the given passage} Q: {a new question related to the given passage} A: {reference answer}
TriviaQA Answer the question as following examples. Examples: {r distinct QA pairs} Q: {a new question} A: {reference answer}
MMLU You would be given a multiple-choice question paired with 4 choices (A-D). Choose one of them using letter A, B, C, or D as the correct answer to the question. Here are some examples: {r distinct QA pairs} Now answer the question: {a new question} A: {answer sentence A} B: {answer sentence B} C: {answer sentence C} D: {answer sentence D} Answer: {reference answer (a letter)}
WMT {r distinct QA pairs} Q: What is the English translation of the following sentence? {a French sentence} A: {reference answer (an English sentence)}
### D.2 Details of the training procedure
For the three regimes of our supervised approach presented in Section 3.3, the details of the supervised training procedure are as below:
Gb-S. For the natural language generation tasks (question-answering and machine-translation), we train a random forest model with the input features listed in Table 5 (20 features in total). For the multiple-choice task, as the answer has only one token from {A, B, C, D}, we take the output logits of these 4 tokens (denoted as $\alpha_{\text{A}}$ , $\alpha_{\text{B}}$ , $\alpha_{\text{C}}$ , and $\alpha_{\text{D}}$ ) after inputting the question prompt $\bm{x}$ to the LLM. Then, we get the probability of each choice as follows:
$$
p_{\theta}(y|\bm{x})=\frac{\text{exp}(\alpha_{y})}{\sum_{y^{\prime}\in\{\text{
A},\text{B},\text{C},\text{D}\}}\text{exp}(\alpha_{y^{\prime}})},\ \forall y
\in\{\text{A},\text{B},\text{C},\text{D}\}.
$$
Then we use 5 features as the input to Gb-S: the entropy of this distribution, and the sorted probability values in descending order.
| Max Ent Min Ent Avg Ent | $\max_{j\in\{1,...,m\}}\ H(p_{\theta}(\cdot|\bm{x},\bm{y}_{1:j-1}))$ $\min_{j\in\{1,...,m\}}\ H(p_{\theta}(\cdot|\bm{x},\bm{y}_{1:j-1}))$ $\frac{1}{m}\sum_{j=1}^{m}H(p_{\theta}(\cdot|\bm{x},\bm{y}_{1:j-1}))$ | $\max_{j\in\{1,...,n\}}\ H(p_{\theta}(\cdot|\bm{x}_{1:j-1}))$ $\min_{j\in\{1,...,n\}}\ H(p_{\theta}(\cdot|\bm{x}_{1:j-1}))$ $\frac{1}{n}\sum_{j=1}^{n}H(p_{\theta}(\cdot|\bm{x}_{1:j-1}))$ |
| --- | --- | --- |
| Std Ent | $\sqrt{\frac{\sum_{j=1}^{m}\left(H(p_{\theta}(\cdot|\bm{x},\bm{y}_{1:j-1}))- \text{Avg Ent}\right)^{2}}{m-1}}$ | $\sqrt{\frac{\sum_{j=1}^{n}\left(H(p_{\theta}(\cdot|\bm{x}_{1:j-1}))-\text{Avg Ent}\right)^{2}}{n-1}}$ |
| Max Likelihood | $\max_{j\in\{1,...,m\}}\ -\log p_{\theta}(y_{j}|\bm{x},\bm{y}_{1:j-1})$ | $\max_{j\in\{1,...,n\}}\ -\log p_{\theta}(x_{j}|\bm{x}_{1:j-1})$ |
| Min Likelihood | $\min_{j\in\{1,...,m\}}\ -\log p_{\theta}(y_{j}|\bm{x},\bm{y}_{1:j-1})$ | $\min_{j\in\{1,...,n\}}\ -\log p_{\theta}(x_{j}|\bm{x}_{1:j-1})$ |
| Avg Likelihood | $\frac{1}{m}\sum_{j=1}^{m}-\log p_{\theta}(y_{j}|\bm{x},\bm{y}_{1:j-1})$ | $\frac{1}{n}\sum_{j=1}^{n}-\log p_{\theta}(x_{j}|\bm{x}_{1:j-1})$ |
| Std Likelihood | $\sqrt{\frac{\sum_{j=1}^{m}\left(-\log p_{\theta}(y_{j}|\bm{x},\bm{y}_{1:j-1})- \text{Avg Likelihood}\right)^{2}}{m-1}}$ | $\sqrt{\frac{\sum_{j=1}^{n}\left(-\log p_{\theta}(x_{j}|\bm{x}_{1:j-1})-\text{ Avg Likelihood}\right)^{2}}{n-1}}$ |
| Avg Prob | $\frac{1}{m}\sum_{j=1}^{m}p_{\theta}(y_{j}|\bm{x},\bm{y}_{1:j-1})$ | $\frac{1}{n}\sum_{j=1}^{n}p_{\theta}(x_{j}|\bm{x}_{1:j-1})$ |
| Std Prob | $\sqrt{\frac{\sum_{j=1}^{m}\left(p_{\theta}(y_{j}|\bm{x},\bm{y}_{1:j-1})-\text{ Avg Prob}\right)^{2}}{m-1}}$ | $\sqrt{\frac{\sum_{j=1}^{n}\left(p_{\theta}(x_{j}|\bm{x}_{1:j-1})-\text{Avg Prob}\right)^{2}}{n-1}}$ |
Table 5: Grey-box features used for the supervised task of uncertainty estimation for LLMs.
Wb-S. The dimension of a hidden layer from LM is typically high (e.g., 4096 for LLaMA2-7B), which may prevent the calibration model from capturing the effective uncertainty information revealed from the activations, especially with limited training samples. Thus, before training a model, we do the feature selection first. We maintain all the features used in the Gb-S and select another 300 features (neural nodes): (i) We use all the features to train a Lasso model and select 100 neural nodes with the highest absolute coefficient values; (ii) By calculating the mutual information between any neural node and the label (correct or not), we select another 100 features possessing top absolute mutual information; (iii) We select another 100 features with top absolute Pearson correlation coefficient. After the feature selection, we train a random forest model to predict whether the response is correct based on the selected features.
In the experiment section of the main text, the features in the Wb-S for natural language generation tasks include (i) all the features used in the Gb-S, (ii) the hidden activations of the last token of the question from the middle layer (LLaMA2-7B or LLaMA3-8B: 16th layer; Gemma-7B: 14th layer), and (iii) the hidden activations of the last token of the answer from the middle layer. Therefore, in these natural language generation tasks, the dimension is 8212 for LLaMA2-7B/LLaMA3-8B and 6164 for Gemma-7B.
The features in the Wb-S for the multiple-choice task include (i) all the features used in the Gb-S and (ii) the hidden activations of the last token of the answer (letter A, B, C, or D) from the middle layer. The dimension is 4101 for LLaMA2-7B/LLaMA3-8B and 3077 for Gemma-7B.
Notably, there are many choices of the hidden activations employed in the Wb-S. Besides what has been shown in Section B, we provide further discussion in Section E.
Bb-S. The idea of building a supervised calibration model for a black-box LLM is to use the hidden layers and output distributions from another open-source LLM model by feeding it with the question and the provided response. Therefore, the features available for the Wb-S are also available for the open-source LLM, so we just take the corresponding features from the open-source LLM in the Bb-S. Hence, in the natural language generation tasks, the input dimension of the calibration model is 4196 (including hidden activations of the question and answer and 20 entropy and likelihood-related features, $2\times 2048+20$ ) for Gemma-2B, 6164 for Gemma-7B, 8212 for LLaMA2-7B/LLaMA3-8B, and 10260 for LLaMA2-13B. In the multiple-choice task, the dimension is 2053 for Gemma-2B (including the hidden activations of the answer and 5 entropy- and probability-related features used in the Gb-S), 3077 for Gemma-7B, 4101 for LLaMA2-7B/LLaMA3-8B, and 5125 for LLaMA2-13B.
For all these methods, we employ the random forest (Breiman, 2001) using the implementation from the scikit-learn package (Pedregosa et al., 2011) to estimate the uncertainty. The hyperparameters are set as [n_estimators=150, random_state=0, max_depth=8, verbose=2, max_features=45] if the number of selected features is no less than 100 and [n_estimators=100, random_state=0, max_depth=4, verbose=2] otherwise.
## Appendix E Additional results and visualizations
In Section B, we show the advantage of utilizing the hidden activations of the answer from the middle layer of the LLM to estimate the uncertainty in Wb-S. In this section, we further discuss the impact of employing the hidden activations from the question in the Wb-S.
The motivation stems from the observation that within the transformer architecture, although the hidden activation of a questionâs last token (referred to as the questionâs activation) is forwarded to obtain the hidden activation of the answerâs last token (referred to as the answerâs activation), implying that the answerâs activation incorporates the questionâs activation information, it has been discovered that concatenating the questionâs activation with the answerâs activation offers additional insights into the answerâs uncertainty (Duan et al., 2024). We would like to further investigate the effectiveness of incorporating the questionâs activation along with the answerâs activation into the supervised setting.
We experiment with three feature combinations in our supervised setting: (i) Question: we use the hidden activation of the last token of the question from the middle layer, incorporated with the entropy- or probability-related features of the question (10 features in total listed in the right column of Table 5) if it is a natural language generation task, otherwise incorporated with all the features in Gb-S; (ii) Answer: we use the hidden activation of the last token of the answer from the middle layer incorporated with all the features used in Gb-S; (iii) Question-Answer: we use the last-token hidden activation of both the question and answer from the middle layer and all the features in Gb-S. We compare their performance with Gb-S in Figure 8 and present the following observations.
Question itself cannot capture enough uncertainty information. From Figure 8, we observe that the method Bb-S consistently outperforms Question across all these tasks. This implies that incorporating the features relating to the question only cannot provide enough information about the uncertainty of the answer. This aligns with the inferior performance of the sample-based method (Kuhn et al., 2023) we tested in the earlier sections. In these methods, the uncertainty score is used to estimate the language modelâs uncertainty about the question. This result implies that uncertainty cannot be captured in the question by the language model without generating the answer.
Questionâs hidden activation cannot help to generate more uncertainty information Again from Figure 8, by comparing the performance of Answer and Question-Answer, we find that the inclusion of questionâs activation has little impact on improving the performance. This shows that the uncertainty from the question might have already been well encoded in the last token activation of the answer.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: AUROC Performance of Different Feature Types Across Language Models and Tasks
### Overview
The image displays a set of three grouped bar charts comparing the AUROC (Area Under the Receiver Operating Characteristic curve) performance of four different feature types ("Gb-S", "Question", "Answer", "Question-Answer") extracted from three different large language models (Gemma-7B, LLaMA2-7B, LLaMA3-8B). The performance is evaluated across four distinct tasks or datasets: MMLU, TriviaQA, CoQA, and WMT-14.
### Components/Axes
* **Main Titles (Top of each subplot):**
* Left: "Features from Gemma-7B"
* Center: "Features from LLaMA2-7B"
* Right: "Features from LLaMA3-8B"
* **Y-Axis (Common to all subplots):**
* Label: "AUROC"
* Scale: Linear, ranging from 0.60 to 0.90, with major tick marks at 0.05 intervals (0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90).
* **X-Axis (Within each subplot):**
* Categories (from left to right): "MMLU", "TriviaQA", "CoQA", "WMT-14".
* **Legend (Centered at the bottom of the entire figure):**
* **Gb-S:** Represented by a white bar with a black outline.
* **Question:** Represented by a solid light green bar.
* **Answer:** Represented by a white bar with diagonal black hatching (\\).
* **Question-Answer:** Represented by a light green bar with diagonal black hatching (\\).
### Detailed Analysis
The analysis is segmented by subplot (model) and then by task category. Values are approximate visual estimates from the chart.
**1. Features from Gemma-7B (Left Subplot)**
* **MMLU:**
* Gb-S: ~0.78
* Question: ~0.76
* Answer: ~0.83
* Question-Answer: ~0.83
* *Trend:* Answer and Question-Answer features perform similarly and are highest, followed by Gb-S, then Question.
* **TriviaQA:**
* Gb-S: ~0.87
* Question: ~0.72
* Answer: ~0.88
* Question-Answer: ~0.88
* *Trend:* Answer and Question-Answer are highest and nearly identical. Gb-S is slightly lower. Question is the lowest by a significant margin.
* **CoQA:**
* Gb-S: ~0.74
* Question: ~0.62
* Answer: ~0.76
* Question-Answer: ~0.76
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S is next. Question is the lowest.
* **WMT-14:**
* Gb-S: ~0.83
* Question: ~0.66
* Answer: ~0.85
* Question-Answer: ~0.85
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S is slightly lower. Question is the lowest.
**2. Features from LLaMA2-7B (Center Subplot)**
* **MMLU:**
* Gb-S: ~0.70
* Question: ~0.69
* Answer: ~0.72
* Question-Answer: ~0.72
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S and Question are very close and lower.
* **TriviaQA:**
* Gb-S: ~0.81
* Question: ~0.77
* Answer: ~0.90
* Question-Answer: ~0.90
* *Trend:* Answer and Question-Answer are highest and equal, reaching the top of the scale. Gb-S is next, followed by Question.
* **CoQA:**
* Gb-S: ~0.67
* Question: ~0.70
* Answer: ~0.81
* Question-Answer: ~0.81
* *Trend:* Answer and Question-Answer are highest and equal. Question is next, followed by Gb-S.
* **WMT-14:**
* Gb-S: ~0.73
* Question: ~0.67
* Answer: ~0.78
* Question-Answer: ~0.78
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S is next, followed by Question.
**3. Features from LLaMA3-8B (Right Subplot)**
* **MMLU:**
* Gb-S: ~0.79
* Question: ~0.79
* Answer: ~0.83
* Question-Answer: ~0.83
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S and Question are very close and lower.
* **TriviaQA:**
* Gb-S: ~0.86
* Question: ~0.76
* Answer: ~0.88
* Question-Answer: ~0.88
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S is next, followed by Question.
* **CoQA:**
* Gb-S: ~0.74
* Question: ~0.68
* Answer: ~0.77
* Question-Answer: ~0.77
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S is next, followed by Question.
* **WMT-14:**
* Gb-S: ~0.73
* Question: ~0.62
* Answer: ~0.75
* Question-Answer: ~0.75
* *Trend:* Answer and Question-Answer are highest and equal. Gb-S is next, followed by Question.
### Key Observations
1. **Consistent Superiority of Answer-Based Features:** Across all three models and all four tasks, the "Answer" and "Question-Answer" feature types consistently achieve the highest AUROC scores. Their performance is virtually identical in every single case.
2. **Performance of Gb-S and Question Features:** The "Gb-S" and "Question" features generally perform worse than the answer-inclusive features. The "Question" feature is frequently the lowest-performing, with a notable exception in LLaMA2-7B's CoQA task where it outperforms Gb-S.
3. **Task Difficulty Variation:** The absolute AUROC values vary by task. For example, performance on TriviaQA tends to be higher (often >0.85 for top features) compared to CoQA or WMT-14, suggesting differences in task difficulty or the suitability of the features for those tasks.
4. **Model Comparison:** While all models show the same general pattern, the absolute performance levels differ. For instance, LLaMA2-7B achieves the highest single score (~0.90 on TriviaQA with Answer/QA features), while its performance on MMLU is the lowest among the three models for that task.
### Interpretation
This chart investigates the efficacy of different textual feature types (derived from questions, answers, or both) for some downstream evaluation metric (measured by AUROC) across various language models and benchmarks.
The central finding is that **features incorporating the "Answer" componentâeither alone or combined with the "Question"âare significantly more informative or predictive** than features based solely on the "Question" or the baseline "Gb-S" (which likely represents a different, perhaps model-internal, feature set). The identical performance of "Answer" and "Question-Answer" features suggests that adding the question to the answer provides no marginal benefit for this specific evaluation metric; the answer text alone carries the critical signal.
This implies that for the tasks evaluated (MMLU, TriviaQA, CoQA, WMT-14), the model's generated or retrieved answer contains the most discriminative information. The question text, while necessary to prompt the model, does not contribute additional predictive power once the answer is available. The "Gb-S" features serve as a middle-ground baseline, often outperforming pure "Question" features but falling short of answer-based ones.
The consistency of this pattern across three distinct model architectures (Gemma, LLaMA2, LLaMA3) strengthens the conclusion that this is a robust phenomenon related to the nature of the tasks and the information content of answers versus questions, rather than an artifact of a specific model.
</details>
Figure 8: Performance comparison of using last-token middle layer hidden activations of the answer (Answer) or the concatenation of the question and answer (Question-Answer) as features in the Wb-S, where the features in Gb-S are also included in Wb-S. In the natural language generation tasks, the dimensions of Gb-S, Question, Answer, and Question-Answer for Gemma-7B are 20, 3082, 3092, and 6164, while for LLaMA2-7B or LLaMA3-8B they are 20, 4106, 4116, and 8212, respectively. In the MMLU task, for Gemma-7B they are 5, 3077, 3077, and 6149, while for LLaMA2-7B or LLaMA3-8B, they are 5, 4101, 4101, and 8197, respectively.
The middle layer is still better than the last layer. In Section B, Figure 3 shows that when using the hidden activation of the answer in the Wb-S, the middle layer of the LLM is a better choice than the last layer. The next question is: Does this conclusion still hold for using the concatenated hidden activations of the question and answer? We depict the experiment result in Figure 9, which is consistent with the conclusion drawn from Figure 3.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart Comparison: AUROC Performance of Three Language Models Across Datasets
### Overview
The image displays three grouped bar charts arranged horizontally, comparing the AUROC (Area Under the Receiver Operating Characteristic curve) performance of features extracted from three different large language models (LLMs) on three distinct datasets. The models are Gemma-7B, LLaMA2-7B, and LLaMA3-8B. For each model, performance is measured using four different feature extraction strategies, represented by bars with distinct colors and patterns.
### Components/Axes
* **Chart Titles (Top):** "Features from Gemma-7B" (left), "Features from LLaMA2-7B" (center), "Features from LLaMA3-8B" (right).
* **Y-Axis (Left):** Labeled "AUROC". The scale runs from approximately 0.75 to 0.90, with major tick marks at 0.75, 0.80, 0.85, and 0.90.
* **X-Axis (Bottom of each subplot):** Lists three datasets: "TriviaQA", "CoQA", and "WMT-14".
* **Legend (Bottom of entire figure):** Positioned below the three charts. It defines four feature extraction methods:
* **Green bar with diagonal stripes (\\):** "Avg token, mid layer"
* **Red bar with diagonal stripes (\\):** "Avg token, last layer"
* **Green bar with dots (.):** "Last token, mid layer"
* **Red bar with dots (.):** "Last token, last layer"
### Detailed Analysis
**1. Features from Gemma-7B (Left Chart):**
* **TriviaQA:** All four methods perform similarly high, with AUROC values clustered around 0.88. The "Last token, mid layer" (green dots) appears marginally highest (~0.885), while "Last token, last layer" (red dots) is slightly lower (~0.875).
* **CoQA:** Performance is notably lower than for TriviaQA. Values range from ~0.75 to ~0.76. "Last token, mid layer" (green dots) is the highest (~0.76), while the other three methods are very close, around 0.75.
* **WMT-14:** Performance is intermediate. "Avg token, mid layer" (green stripes) is highest (~0.855). "Last token, mid layer" (green dots) is close behind (~0.85). "Avg token, last layer" (red stripes) and "Last token, last layer" (red dots) are slightly lower, around 0.845-0.85.
**2. Features from LLaMA2-7B (Center Chart):**
* **TriviaQA:** Shows the highest overall performance in the entire figure. All four methods are very close, with AUROC values near or at 0.90. "Avg token, last layer" (red stripes) and "Last token, mid layer" (green dots) appear to be at the peak (~0.90).
* **CoQA:** Performance is lower. "Last token, mid layer" (green dots) is the highest (~0.81). "Avg token, mid layer" (green stripes) is next (~0.80). "Last token, last layer" (red dots) is ~0.80, and "Avg token, last layer" (red stripes) is the lowest (~0.79).
* **WMT-14:** Performance is the lowest among the three datasets for this model. "Last token, mid layer" (green dots) is highest (~0.78). "Last token, last layer" (red dots) is close (~0.775). "Avg token, last layer" (red stripes) is ~0.76, and "Avg token, mid layer" (green stripes) is the lowest (~0.755).
**3. Features from LLaMA3-8B (Right Chart):**
* **TriviaQA:** Performance is high but slightly lower than LLaMA2-7B on the same task. All four methods are tightly clustered around 0.87-0.875.
* **CoQA:** Performance is the lowest among the three models for this dataset. "Last token, mid layer" (green dots) is highest (~0.77). "Last token, last layer" (red dots) is ~0.76. "Avg token, mid layer" (green stripes) and "Avg token, last layer" (red stripes) are both around 0.755.
* **WMT-14:** Performance is the lowest across all models and datasets. "Last token, mid layer" (green dots) is highest (~0.745). "Last token, last layer" (red dots) is ~0.74. "Avg token, mid layer" (green stripes) is ~0.74, and "Avg token, last layer" (red stripes) is the lowest (~0.73).
### Key Observations
1. **Dataset Difficulty:** Across all three models, **TriviaQA consistently yields the highest AUROC scores** (approx. 0.87-0.90), followed by **WMT-14** (approx. 0.74-0.855), with **CoQA generally being the most challenging** (approx. 0.73-0.81).
2. **Model Comparison:** **LLaMA2-7B** appears to achieve the peak performance on TriviaQA (~0.90). **Gemma-7B** shows strong and consistent performance on TriviaQA and WMT-14. **LLaMA3-8B** shows a more pronounced drop in performance on the CoQA and WMT-14 datasets compared to the other two models.
3. **Feature Extraction Strategy:** The **"Last token, mid layer" (green dots) strategy is frequently the top or near-top performer** across most model-dataset combinations (e.g., Gemma on CoQA/WMT-14, LLaMA2 on CoQA/WMT-14, LLaMA3 on all). Using the **last layer (red bars) often results in slightly lower performance** compared to using the mid-layer (green bars) for the same token strategy.
4. **Token Strategy:** There is no universal winner between "Avg token" and "Last token" strategies; their relative performance varies by model and dataset. However, the "Last token" strategies (dotted bars) show a slight edge in more instances.
### Interpretation
This chart evaluates how effectively internal representations (features) from different LLMs can distinguish between correct and incorrect outputs on question-answering (TriviaQA, CoQA) and translation (WMT-14) tasks. The AUROC metric quantifies this discriminative power.
The data suggests that:
* **Task-Specific Feature Quality:** The features extracted from these models are most discriminative for the factual recall task (TriviaQA) and least for the conversational QA task (CoQA). This could indicate that the models' internal states more cleanly encode factual correctness than the nuanced correctness required in conversational contexts.
* **Layer and Token Selection Matters:** The consistent, often superior performance of features from the **mid-layer** (especially using the last token) implies that the most useful signal for error detection may reside in intermediate processing stages, not necessarily the final output layer. This aligns with the "layerwise" understanding of LLMs, where different layers specialize in different types of processing.
* **Model Architecture/Training Impact:** The performance differences between models (e.g., LLaMA2-7B's peak on TriviaQA vs. LLaMA3-8B's lower scores on CoQA/WMT-14) highlight that model scale (7B vs 8B parameters) is not the sole determinant of feature quality for these tasks. Differences in training data, architecture, or fine-tuning likely contribute significantly.
* **Practical Implication:** For building a classifier or detector that uses LLM features (e.g., for detecting hallucinations or errors), this analysis indicates that **extracting features from the mid-layer using the last token representation is a robust starting point**. The choice of source model should be guided by the specific target task (e.g., LLaMA2-7B for trivia-like tasks).
</details>
Figure 9: Performance comparison of using question-answer concatenated hidden activations from different tokens and layers as features in the Wb-S method. Scores are normalized in [0,1], where a lower value indicates larger uncertainty. For Gemma-7B, the dimension of the Wb-S input is 6164 (3072 from the question, 3072 from the answer, and 20 from the grey-box features). For LLaMA2-7B/LLaMA3-8B, it is 8212.
Our method better characterizes the uncertainty. We find that the grey-box and white-box features enhance the ability to characterize the dataset so that the distribution of the generated outputâs uncertainty score is better correlated with the outputâs correctness. According to Figure 10, we observe that with black-box features, the distributions of the uncertainty score for true and false answers are not very distinguishable, and the true answerâs distribution is even similar to a uniform distribution. With grey-box and white-box features, the distributions of the uncertainty scores are more separated between the true and false answers. The results show the supervised learning approach not only achieves better AUROC but also learns to better separate the distribution of the uncertainty scores.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Histograms: Distribution of Uncertainty Scores (US) for True vs. False Answers
### Overview
The image displays a set of four horizontally arranged histograms. Each histogram compares the frequency distribution of an uncertainty score (US) for two categories: "true answer" (blue) and "false answer" (red). The plots share a common y-axis representing the number of samples. The overall purpose is to visualize how the uncertainty score distributions differ between correct and incorrect answers across four distinct metrics.
### Components/Axes
* **Legend:** Located at the top center of the entire figure. It defines two categories:
* **true answer:** Represented by blue bars.
* **false answer:** Represented by red/salmon bars.
* Where the distributions overlap, the color appears as a dark purple/magenta.
* **Y-Axis (Common to all plots):**
* **Label:** "# Samples" (rotated vertically on the far left).
* **Scale:** Linear scale from 0 to 150, with major tick marks at 0, 25, 50, 75, 100, 125, and 150.
* **X-Axes (Individual for each subplot):** Each subplot has its own x-axis label and scale.
* **Scale:** All four x-axes are linear and range from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Labels (from left to right):**
1. "US of Entropy"
2. "US of Bb-S"
3. "US of Gb-S"
4. "US of Wb-S"
### Detailed Analysis
**1. US of Entropy (Leftmost Plot):**
* **False Answer (Red):** Shows a unimodal distribution peaking between approximately 0.2 and 0.3. The peak frequency is around 50-55 samples. The distribution tapers off towards both 0.0 and 1.0.
* **True Answer (Blue):** Shows a broader, more uniform distribution across the entire range from 0.0 to 1.0. It has a slight concentration between 0.2 and 0.6, with frequencies generally between 15-30 samples. There is no single sharp peak.
* **Overlap:** Significant overlap occurs between 0.1 and 0.5, where both distributions have moderate frequency.
**2. US of Bb-S (Second Plot):**
* **False Answer (Red):** Exhibits a sharp, high peak near 0.1, reaching approximately 85-90 samples. The distribution is heavily right-skewed, with very few samples above 0.4.
* **True Answer (Blue):** Shows a bimodal or rising distribution. It has a small presence at low values but increases steadily from about 0.4, reaching its highest point (approximately 40 samples) near 1.0.
* **Overlap:** Minimal overlap. The distributions are largely separated, with false answers dominating the low-score region (<0.3) and true answers dominating the high-score region (>0.6).
**3. US of Gb-S (Third Plot):**
* **False Answer (Red):** Similar to Bb-S, it has a sharp peak near 0.1 (approx. 70-75 samples) and is right-skewed.
* **True Answer (Blue):** Shows a distinct, sharp peak at the extreme high end (1.0), reaching approximately 80 samples. There is also a smaller, secondary cluster of samples between 0.4 and 0.8.
* **Overlap:** Very limited overlap, primarily below 0.3. This metric shows the strongest separation, with a clear cluster of true answers at the maximum score.
**4. US of Wb-S (Rightmost Plot):**
* **False Answer (Red):** Has the highest peak of all plots, reaching over 150 samples (the y-axis limit) at a very low score near 0.05. The distribution is extremely right-skewed.
* **True Answer (Blue):** Shows a rising trend similar to Bb-S, with frequencies increasing from about 0.4 and peaking between 0.8 and 1.0 at around 40-45 samples.
* **Overlap:** Almost no overlap. The false answer distribution is almost entirely confined to scores below 0.2, while the true answer distribution is concentrated above 0.6.
### Key Observations
1. **Separation Trend:** The degree of separation between the "true" and "false" answer distributions increases from left to right. "US of Entropy" shows the most overlap, while "US of Wb-S" and "US of Gb-S" show the clearest separation.
2. **False Answer Behavior:** Across all metrics, false answers consistently cluster at lower uncertainty scores (typically < 0.3). The peak is sharpest and highest for "US of Wb-S".
3. **True Answer Behavior:** True answers show more varied behavior. For "Entropy," they are spread out. For "Bb-S" and "Wb-S," they trend toward higher scores. For "Gb-S," they show a strong, specific peak at the maximum score of 1.0.
4. **Metric Sensitivity:** "Gb-S" and "Wb-S" appear to be the most discriminative metrics, as they produce the clearest bimodal separation between the two answer types.
### Interpretation
This visualization suggests that the four uncertainty score (US) metrics have varying effectiveness in distinguishing between correct ("true") and incorrect ("false") answers from a model or system.
* **"US of Entropy"** is a poor discriminator. The broad, overlapping distributions indicate that entropy-based uncertainty does not strongly correlate with answer correctness in this dataset. A model could be equally uncertain (have similar entropy) whether it is right or wrong.
* **"US of Bb-S", "US of Gb-S", and "US of Wb-S"** are much more effective. They demonstrate a clear pattern: **incorrect answers are associated with low uncertainty scores, while correct answers are associated with high uncertainty scores.** This is a desirable property for an uncertainty metric, as it implies the model is "confident" (low US) when it is wrong and "uncertain" (high US) when it is right, which can be counter-intuitive but is a known phenomenon in some calibration studies. Alternatively, it may indicate these scores measure something like "confidence" rather than epistemic uncertainty, where high confidence (low score) correlates with errors.
* The **"Gb-S" metric** is particularly notable for its sharp peak of true answers at US=1.0, suggesting it may perfectly capture a state of maximum uncertainty or a specific model behavior for a subset of correct answers.
* **Practical Implication:** If the goal is to flag potential errors, using a threshold on "US of Wb-S" or "US of Gb-S" (e.g., flagging answers with US < 0.2) would be highly effective, as it would capture most false answers while rejecting very few true ones. The "US of Entropy" would be ineffective for this task.
**Language Note:** All text in the image is in English.
</details>
Figure 10: Uncertainty scores of different methods on the MMLU dataset for answers provided by the Gemma-7B model, where scores are normalized in [0,1], and US is short for uncertainty score. False answer refers to the sample where the choice assigned with maximum probability by the LLM is false, while true answer represents the sample answered correctly.
## Appendix F Examples
In this section, we show some examples of the wrong answers the LLM generated and explore how different methods understand the LLMâs uncertainty. The wrong answers are selected from those samples where the LLM makes wrong predictions.
Since we let the LLM output the greedy answer, which could be wrong, we expect an ideal uncertainty estimation model to output a high confidence score when the LLM generates the correct answer, and give a low confidence score when the LLM outputs the wrong answer. By looking at different wrong answers generated by the LLM, we note that although our approach sometimes gives a high confidence score on a wrong answer generated by the LLM, at other times it shows desirable properties such as giving higher uncertainty scores to better answers, and giving low confidence score when LLM does not know the answer.
Our illustrative examples are generated as follows: For questions where the LLMâs greedy response is incorrect, we also extract the correct answer from the dataset and additional answers randomly generated by the LLM with lower probabilities than the greedy answer. Along with these answers, we also compute the answersâ corresponding metrics and features so that we can observe how they behave with different outputs. We conduct this experiment in the test dataset of TriviaQA, in which both the question and answer are short. We summarize the ways that our uncertainty estimation model behaves as follows:
- Confidently support a wrong answer. The LLMs are confident that the wrong greedy answer is true and assign a high confidence score. Moreover, the LLMs give low uncertainty scores to the correct answers, suggesting a lack of knowledge about these questions. We give an example of LLaMA2-7B and Gemma-7B in Figure 11 and 12. Note that in both examples, our method assigns a low uncertainty score to the correct answer and a much higher uncertainty score to the wrong answer. In contrast, the unsupervised grey-box methods assign higher uncertainty scores to the correct answer.
- Confidently reject a wrong answer. We give examples from LLaMA2-7B and Gemma-7B in Figure 13 and 14. The uncertainty estimation model gives a higher score to the true answer or answers that are better than the wrong answer. This means that for these questions, our model actually knows which answer is better and can assign uncertainty scores accordingly. In contrast, the unsupervised methods tend to assign much higher uncertainty scores to the greedy (wrong) answer.
- Unconfident about any answer. Due to the lack of knowledge, the LLM may not know the true answer. We show the examples in Figure 15 and 16. From these examples, we can see that the model assigns almost the same uncertainty scores to these generated answers, including the true answer. In this scenario, the uncertainty estimation model is uncertain about the correctness of any answer. Furthermore, it is interesting to note that the unsupervised methods exhibit similar behavior, assigning almost similar scores to other answers as well, albeit with much higher uncertainty scores. This differs from the previous two cases, where the unsupervised method behaved differently from our uncertainty estimation model.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Data Table with Accompanying Text: Example of a Confidently Wrong Answer from LLaMA2-7B
### Overview
The image is a figure, likely from a research paper or technical report, illustrating an example of a language model (LLaMA2-7B) providing a "confidently wrong" answer to a factual question. It presents the question, a reference answer, several model-generated answers, and a table of associated confidence and similarity metrics for each answer.
### Components/Axes
The image is structured in three main vertical sections:
1. **Title (Top Center):** "An example of a confidently wrong answer (LM: LLaMA2-7B)"
2. **Question & Answer Block (Upper Section):** A beige, rounded rectangle containing:
* **Question:** "Who had a 70s No 1 hit with Billy, Don't Be A Hero?"
* **Ref answer:** "Bo Donaldson & The Heywoods" (displayed in blue text).
3. **Model Answers (Middle Section):** A list of answers generated by the model, each preceded by a small robot icon.
* **Greedy answer:** "Paper Lace" (displayed in red text).
* **Answer 1:** "Bo Donaldson"
* **Answer 2:** "Paperchaser"
* **Answer 3:** "Paper Moon"
4. **Data Table (Lower Section):** A table with 5 rows and 10 columns. The columns are:
* (Row Label Column)
* Rouge-1
* Max Prob
* Avg Prob
* Max Ent
* Avg Ent
* Gb-S
* Wb-S
* Bb-S
* SU
* Ask4-conf
### Detailed Analysis
**Table Data Transcription:**
The table provides quantitative metrics for each answer listed above. The rows correspond to the answers, and the columns to different evaluation metrics.
| Row Label | Rouge-1 | Max Prob | Avg Prob | Max Ent | Avg Ent | Gb-S | Wb-S | Bb-S | SU | Ask4-conf |
|----------------|---------|----------|----------|---------|---------|------|------|------|------|-----------|
| **Ref answer** | 1 | 0.13 | 0.94 | 0.82 | 0.94 | 0.21 | 0.31 | | | |
| **Greedy answer** | 0 | 0.79 | 0.99 | 0.86 | 0.94 | 0.82 | 0.83 | 0.72 | 0.31 | 0 |
| **Answer 1** | 0.67 | 0.13 | 0.9 | 0.82 | 0.9 | 0.1 | 0.25 | | | |
| **Answer 2** | 0 | 0 | 0.81 | 0.7 | 0.82 | 0.08 | 0.12 | | | |
| **Answer 3** | 0 | 0 | 0.82 | 0.86 | 0.89 | 0.1 | 0.2 | | | |
*Note: Empty cells in the table indicate no data was provided for that metric-answer combination.*
**Key Metric Observations:**
* **Rouge-1:** Measures n-gram overlap with the reference. The reference answer has a perfect score of 1. "Answer 1" ("Bo Donaldson") has a partial overlap (0.67). The "Greedy answer" and others have 0 overlap.
* **Probability (Max/Avg Prob):** The "Greedy answer" has the highest maximum probability (0.79) and average probability (0.99), indicating the model assigned very high confidence to this incorrect token sequence. The reference answer has a much lower max probability (0.13).
* **Entropy (Max/Avg Ent):** Entropy measures uncertainty. Values are relatively high across all answers (0.7 to 0.94), suggesting the model's internal state had significant uncertainty at the token level, even for the high-probability greedy answer.
* **Similarity Scores (Gb-S, Wb-S, Bb-S, SU):** These are likely various semantic similarity metrics. The "Greedy answer" scores highest on Gb-S (0.82) and Wb-S (0.83), suggesting it is semantically similar to the reference in some embedding space, despite being factually wrong. "Answer 1" scores much lower on these metrics.
* **Ask4-conf:** Only the "Greedy answer" has a value here (0), which may represent a specific confidence calibration metric.
### Key Observations
1. **Confident Error:** The "Greedy answer" ("Paper Lace") is factually incorrect but is generated with the highest model confidence (Max Prob 0.79, Avg Prob 0.99).
2. **Partial Correctness:** "Answer 1" ("Bo Donaldson") is partially correct (part of the reference answer) and has a moderate Rouge-1 score (0.67) but very low model confidence (Max Prob 0.13).
3. **Semantic Proximity of Wrong Answer:** The incorrect "Greedy answer" has high semantic similarity scores (Gb-S, Wb-S), indicating the model may have retrieved or generated a conceptually related but factually distinct entity ("Paper Lace" was another 70s band with a hit about a soldier).
4. **Metric Discrepancy:** There is a stark disconnect between the model's internal confidence metrics (high for the wrong answer) and factual accuracy (Rouge-1 of 0).
### Interpretation
This figure demonstrates a critical failure mode in language models: **confident hallucination**. The model (LLaMA2-7B) selects "Paper Lace" as its top (greedy) answer with extremely high probability, despite it being wrong. The data suggests the model's decoding process prioritizes a semantically plausible and high-probability token sequence over factual correctness.
The high semantic similarity scores for the wrong answer imply the model's internal representations place "Paper Lace" close to the correct answer "Bo Donaldson & The Heywoods" in vector space, likely due to shared context (1970s, music, Billboard hits). However, this proximity does not translate to factual accuracy. The low confidence for the partially correct "Answer 1" further shows the model fails to properly weight the correct factual components.
This example underscores the limitation of relying solely on raw model probability or even semantic similarity for factual reliability. It highlights the need for techniques like retrieval augmentation, fact-checking modules, or improved training to align model confidence with truthfulness. The figure serves as a diagnostic tool, showing that a model can be simultaneously "right" in its semantic neighborhood and "wrong" in its specific factual output.
</details>
Figure 11: An example of LLaMA2-7B assigning a confidently wrong answer in the TriviaQA dataset. Scores are normalized in $[0,1]$ , where a lower value indicates a larger uncertainty. The score of the greedy answer provided by any uncertainty estimation method is higher than that of the true answer, but the greedy answer is incorrect. The UK band Paper Lace did indeed release a version of âBilly, Donât Be A Heroâ in 1974, the same year as the version of Bo, but it was Bo Donaldson & The Heywoods (a band in the U.S.) whose version topped the charts as a No.1 hit.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Screenshot: Example of a Confidently Wrong Language Model Answer
### Overview
This image is a figure, likely from a research paper or technical report, illustrating an example where a language model (LM: Gemma-7B) provides an incorrect answer with high confidence. It presents a question, a reference answer, several model-generated answers, and a table of associated confidence and similarity metrics.
### Components/Axes
The image is structured into three main regions:
1. **Header/Title:** "An example of a confidently wrong answer (LM: Gemma-7B)"
2. **Question & Answer Block:** A beige-colored box containing the question and reference answer, followed by three model-generated answers.
3. **Metrics Table:** A data table comparing various metrics across the reference answer and the model-generated answers.
**Textual Content (Transcribed):**
* **Title:** An example of a confidently wrong answer (LM: Gemma-7B)
* **Question:** Which sitcom starred Leonard Rossiter in the role of a supermarket manager?
* **Ref answer:** Tripper's Day
* **Greedy answer:** Rising Damp
* **Answer 1:** Rising Damp.
* **Answer 2:** The Rise and Fall of Reginald Perrin
**Table Structure:**
The table has 10 columns and 5 rows (including the header row).
* **Column Headers (Metrics):** Rouge-1, Max Prob, Avg Prob, Max Ent, Avg Ent, Gb-S, Wb-S, Bb-S, SU, Ask4-conf
* **Row Headers (Answer Types):** Ref answer, Greedy answer, Answer 1, Answer 2
### Detailed Analysis
**Table Data Reconstruction:**
| Answer Type | Rouge-1 | Max Prob | Avg Prob | Max Ent | Avg Ent | Gb-S | Wb-S | Bb-S | SU | Ask4-conf |
|---------------|---------|----------|----------|---------|---------|------|------|------|------|-----------|
| **Ref answer** | 1 | 0.00 | 0.66 | 0.70 | 0.74 | 0.14 | 0.15 | 0.24 | | |
| **Greedy answer** | 0 | 0.76 | 0.99 | 0.90 | 0.94 | 0.93 | 0.86 | 0.89 | 0.46 | 1 |
| **Answer 1** | 0 | 0.02 | 0.87 | 0.81 | 0.88 | 0.60 | 0.40 | 0.86 | | |
| **Answer 2** | 0 | 0.05 | 0.91 | 0.89 | 0.93 | 0.68 | 0.46 | 0.64 | | |
*Note: Empty cells in the original table are represented as blank.*
### Key Observations
1. **Confidence vs. Accuracy Discrepancy:** The "Greedy answer" (Rising Damp) is incorrect (Rouge-1 = 0) but exhibits extremely high confidence scores: Max Prob (0.76), Avg Prob (0.99), and a perfect Ask4-conf score of 1.
2. **Reference Answer Profile:** The correct "Ref answer" (Tripper's Day) has a perfect Rouge-1 score of 1 but notably low probability scores (Max Prob = 0.00, Avg Prob = 0.66) and low similarity scores (Gb-S, Wb-S, Bb-S).
3. **Alternative Answers:** "Answer 1" and "Answer 2" are also incorrect (Rouge-1 = 0). They show high average probabilities (0.87, 0.91) and entropy scores, but lower maximum probabilities compared to the greedy answer.
4. **Metric Patterns:** For the incorrect answers, high Avg Prob and Avg Ent generally correlate with higher similarity scores (Gb-S, Wb-S, Bb-S). The "Greedy answer" leads in nearly all confidence and similarity metrics except Rouge-1.
### Interpretation
This figure serves as a clear case study of a failure mode in language models: generating plausible but factually incorrect outputs with high internal confidence.
* **What the data demonstrates:** The model (Gemma-7B) assigns very high probability to the token sequence "Rising Damp," despite it being the wrong answer to the factual question. The reference answer, while correct, receives low probability from the model, suggesting the model's internal knowledge or scoring is misaligned with factual truth for this instance.
* **Relationship between elements:** The table quantifies the model's misplaced confidence. Metrics like `Max Prob`, `Avg Prob`, and `Ask4-conf` are high for the wrong answer, while the correct answer scores low on these. The `Rouge-1` metric, which measures n-gram overlap with the reference, correctly identifies the greedy answer as wrong (score 0) and the reference as correct (score 1).
* **Notable anomaly:** The most striking anomaly is the `Ask4-conf` value of **1** for the "Greedy answer." This suggests that when asked to express confidence, the model was maximally confident in its incorrect response. This highlights a critical challenge in AI safety and reliability: a model can be both wrong and certain about it.
* **Underlying implication:** The figure argues for the necessity of metrics beyond simple probability or confidence scores (like `Rouge-1` or external fact-checking) to evaluate model outputs, especially for factual queries. It visually underscores the problem of "hallucination" or confident confabulation in LLMs.
</details>
Figure 12: An example for Gemma-7B that assigns a high confidence score to a wrong answer. Leonard Rossiter starred in âRising Dampâ as a landlord, not as a supermarket manager.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Technical Document Screenshot: Language Model Answer Evaluation Example
### Overview
This image is a screenshot from a technical document or research paper. It presents an example of how a Language Model (LM), specifically LLaMA2-7B, evaluates different answers to a factual question. The image consists of two main parts: a textual example of a question and multiple candidate answers, followed by a data table comparing various evaluation metrics for those answers.
### Components/Axes
The image is structured into two primary regions:
1. **Header/Example Region (Top):** Contains the title and a question-answer example.
2. **Data Table Region (Bottom):** A table with numerical metrics.
**Header/Example Region Details:**
* **Title:** "An example that the LM identifies the better answer (LM: LLaMA2-7B)"
* **Question:** "Which musical featured the songs A Secretary is Not A Toy, and The Company Way?"
* **Reference Answer (Ref answer):** "How to Succeed in Business Without Really Trying" (displayed in blue text).
* **Greedy Answer:** "The Pajama Game" (displayed in red text).
* **Answer 1:** "How to Succeed In Business Without Really Trying"
* **Answer 2:** "The Company Way"
* **Icons:** Small robot icons are placed next to "Greedy answer", "Answer 1", and "Answer 2".
**Data Table Structure:**
The table has 10 columns and 4 data rows.
* **Column Headers (Metrics):**
1. Rouge-1
2. Max Prob
3. Avg Prob
4. Max Ent
5. Avg Ent
6. Gb-S
7. Wb-S
8. Bb-S
9. SU
10. Ask4-conf
* **Row Labels (Answer Types):**
1. Ref answer
2. Greedy answer
3. Answer 1
4. Answer 2
### Detailed Analysis / Content Details
**Transcription of the Question & Answer Example:**
* **Question:** Which musical featured the songs A Secretary is Not A Toy, and The Company Way?
* **Ref answer:** How to Succeed in Business Without Really Trying
* **Greedy answer:** The Pajama Game
* **Answer 1:** How to Succeed In Business Without Really Trying
* **Answer 2:** The Company Way
**Data Table Reconstruction:**
The following table lists the exact numerical values as they appear in the image. Empty cells indicate no data was provided for that metric-answer combination.
| Answer Type | Rouge-1 | Max Prob | Avg Prob | Max Ent | Avg Ent | Gb-S | Wb-S | Bb-S | SU | Ask4-conf |
|---------------|---------|----------|----------|---------|---------|-------|-------|-------|-------|-----------|
| **Ref answer** | 1 | 0.12 | 0.96 | 0.43 | 0.93 | 0.23 | 0.33 | | | |
| **Greedy answer** | 0 | 0.12 | 0.9 | 0.37 | 0.82 | 0.09 | 0.14 | 0.33 | 0.08 | 0 |
| **Answer 1** | 1 | 0.08 | 0.93 | 0.43 | 0.94 | 0.14 | 0.22 | | | |
| **Answer 2** | 0 | 0.01 | 0.78 | 0.37 | 0.6 | 0.08 | 0.13 | | | |
### Key Observations
1. **Answer Correctness:** The "Ref answer" and "Answer 1" are factually correct (the musical is *How to Succeed in Business Without Really Trying*). The "Greedy answer" (*The Pajama Game*) is incorrect. "Answer 2" (*The Company Way*) is a song title from the correct musical, not the musical itself.
2. **Metric Correlation with Correctness:**
* **Rouge-1:** Correct answers (Ref, Answer 1) score 1. Incorrect answers (Greedy, Answer 2) score 0.
* **Probability Metrics (Max/Avg Prob):** The correct "Ref answer" has the highest Avg Prob (0.96). The incorrect "Greedy answer" has a relatively high Max Prob (0.12, tied with Ref) but lower Avg Prob (0.9). "Answer 2" has the lowest probabilities.
* **Entropy Metrics (Max/Avg Ent):** Correct answers have higher entropy values (0.43/0.93-0.94) compared to incorrect ones (0.37/0.6-0.82), suggesting the model is less certain about incorrect outputs.
* **Specialized Scores (Gb-S, Wb-S, etc.):** The "Ref answer" scores highest on Gb-S (0.23) and Wb-S (0.33). The "Greedy answer" has unique, non-zero values for Bb-S (0.33), SU (0.08), and Ask4-conf (0), which are absent for other answers.
3. **Data Completeness:** The metrics Bb-S, SU, and Ask4-conf are only reported for the "Greedy answer".
### Interpretation
This image serves as a qualitative example to illustrate how a language model's internal metrics can be used to distinguish between better and worse answers, even when the model itself might generate an incorrect answer via greedy decoding.
* **What it demonstrates:** The table shows that the "better" answers (Ref and Answer 1) are characterized by high **Rouge-1** (lexical overlap with a ground truth), high **average probability** (the model assigns high likelihood to the sequence), and high **entropy** (indicating the model's confidence is distributed, not peaked on a single wrong token). The incorrect "Greedy answer" fails on Rouge-1 and has lower average probability and entropy.
* **Relationship between elements:** The example sets up a clear contrast. The textual part shows the *output* (answers), while the table quantifies the model's *internal state* when generating those outputs. It argues that metrics like Avg Prob and Avg Ent can serve as proxies for answer quality, potentially for reranking or filtering generated text.
* **Notable Anomalies/Insights:** The most striking insight is that the model's greedy search (which picks the most likely next token at each step) produced an incorrect answer ("The Pajama Game"), while other sampled answers (Answer 1) were correct. This highlights a known limitation of greedy decoding. Furthermore, the specialized scores (Gb-S, Wb-S, Bb-S, SU, Ask4-conf) appear to be diagnostic tools that provide different signals for different answer types, with the "Greedy answer" triggering unique values in the last three columns. This suggests these metrics might be designed to detect specific failure modes or characteristics of generated text.
</details>
Figure 13: An example that LLaMA2-7B can successfully identify the better answer (by attaching a higher score). Scores are normalized in [0,1], where a lower value indicates larger uncertainty.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Technical Document Extraction: Language Model Answer Evaluation Example
### Overview
The image is a technical figure demonstrating how a language model (LM) identifies a better answer to a factual question. It presents a question, a reference answer, three alternative model-generated answers, and a table of quantitative metrics comparing them. The example uses the LM "Gemma-7B".
### Components/Axes
The image is structured into two main sections:
1. **Upper Section (Question & Answers):** Contains the question, reference answer, and three alternative answers.
2. **Lower Section (Metrics Table):** A data table comparing the answers across multiple evaluation metrics.
**Textual Content - Upper Section:**
* **Title:** "An example that the LM identifies the better answer (LM: Gemma-7B)"
* **Question Box (Beige Background):**
* **Question:** "The behavior of sound in rooms and concert halls is a separate science. what is its name?"
* **Ref answer:** "Acoustics" (displayed in blue text)
* **Alternative Answers (each preceded by a small robot icon):**
* **Greedy answer:** "Acoustical" (displayed in red text)
* **Answer 1:** "Acoustical Engineering"
* **Answer 2:** "Acoustiics"
### Detailed Analysis
**Table Data Reconstruction:**
The following table lists the exact numerical values for each answer across all metrics.
| Answer | Rouge-1 | Max Prob | Avg Prob | Max Ent | Avg Ent | Gb-S | Wb-S | Bb-S | SU | Ask4-conf |
|---------------|---------|----------|----------|---------|---------|------|------|------|------|-----------|
| **Ref answer** | 1 | 0.45 | 0.96 | 0.86 | 0.88 | 0.64 | 0.73 | 0.93 | | |
| **Greedy answer** | 0 | 0.41 | 0.95 | 0.79 | 0.84 | 0.50 | 0.51 | 0.29 | 0.28 | 1 |
| **Answer 1** | 0 | 0.28 | 0.94 | 0.79 | 0.83 | 0.39 | 0.44 | 0.33 | | |
| **Answer 2** | 0 | 0.04 | 0.86 | 0.69 | 0.80 | 0.16 | 0.25 | 0.39 | | |
**Note on Empty Cells:** The cells for "SU" and "Ask4-conf" are empty for "Ref answer", "Answer 1", and "Answer 2". Only the "Greedy answer" has values in these columns.
### Key Observations
1. **Reference Answer Superiority:** The "Ref answer" ("Acoustics") scores a perfect 1.0 on Rouge-1, indicating an exact lexical match with the expected answer. It also has the highest scores in Max Ent (0.86), Avg Ent (0.88), Gb-S (0.64), Wb-S (0.73), and Bb-S (0.93).
2. **Greedy Answer Characteristics:** The "Greedy answer" ("Acoustical") has a Rouge-1 score of 0, meaning it does not match the reference word. It is the only answer with values for "SU" (0.28) and "Ask4-conf" (1). Its scores are generally lower than the reference answer but higher than the other alternatives in most probability and entropy metrics.
3. **Performance of Other Answers:** "Answer 1" and "Answer 2" have Rouge-1 scores of 0. "Answer 2" ("Acoustiics") has notably low scores, particularly in Max Prob (0.04) and Gb-S (0.16), suggesting the model assigns it very low probability and confidence.
4. **Metric Trends:** Across all answers, the "Avg Prob" (Average Probability) metric remains relatively high (0.86 to 0.96), while metrics like "Gb-S", "Wb-S", and "Bb-S" show greater variance, effectively differentiating the quality of the answers.
### Interpretation
This figure illustrates a method for evaluating and comparing the quality of different text generations from a language model. The data suggests that the model (Gemma-7B) can internally distinguish the correct reference answer ("Acoustics") from incorrect or less precise alternatives ("Acoustical", "Acoustical Engineering", "Acoustiics").
The **key takeaway** is that while the model's "greedy" decoding (which selects the single most probable next token) produced a suboptimal answer ("Acoustical"), the underlying probability and entropy metrics (Max Prob, Avg Ent, etc.) assigned higher values to the true reference answer. This discrepancy highlights a potential limitation of simple greedy decoding and supports the use of more sophisticated decoding strategies or scoring metrics (like those in the table) to identify the best response from a set of candidates. The "Ask4-conf" value of 1 for the greedy answer might indicate a specific confidence query or flag associated with that generation path. The empty cells for other answers in the last two columns suggest those metrics were not computed or are not applicable to them in this evaluation framework.
</details>
Figure 14: An example that Gemma-7B can successfully identify the better answer (by attaching a higher score). Scores are normalized in [0,1], where a lower value indicates larger uncertainty.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Screenshot: Language Model Failure Example
### Overview
The image is a figure, likely from a research paper or technical report, illustrating an instance where a Large Language Model (LLM) fails to answer a factual question correctly. It presents a specific question, the correct reference answer, the model's incorrect "greedy" answer, two alternative incorrect answers, and a table of associated confidence and evaluation metrics.
### Components/Axes
The image is structured in two main sections within a rounded-corner frame:
1. **Top Section (Question & Answers):** Contains the question, reference answer, and three model-generated answers.
2. **Bottom Section (Metrics Table):** A data table with 10 columns and 4 rows of data.
**Textual Content (Top Section):**
* **Title:** "An example that the LM does not know the answer (LM: LLaMA2-7B)"
* **Question:** "Who played Sandy Richardson in the British tv series âCrossroadsâ?"
* **Ref answer:** "Roger Tonge" (displayed in blue text)
* **Greedy answer:** "Noel Clarke" (displayed in red text)
* **Answer 1:** "Mike Pratt"
* **Answer 2:** "Lucy Carless"
**Metrics Table Structure:**
* **Columns (Headers):** Rouge-1, Max Prob, Avg Prob, Max Ent, Avg Ent, Gb-S, Wb-S, Bb-S, SU, Ask4-conf
* **Rows (Labels):** Ref answer, Greedy answer, Answer 1, Answer 2
### Detailed Analysis
**Table Data Transcription:**
The table contains numerical values for various metrics associated with each answer. Empty cells are denoted by a blank space.
| Row Label | Rouge-1 | Max Prob | Avg Prob | Max Ent | Avg Ent | Gb-S | Wb-S | Bb-S | SU | Ask4-conf |
|----------------|---------|----------|----------|---------|---------|------|------|------|----|-----------|
| **Ref answer** | 1 | 0.01 | 0.78 | 0.28 | 0.71 | 0.08 | 0.09 | | | |
| **Greedy answer** | 0 | 0.16 | 0.89 | 0.28 | 0.75 | 0.08 | 0.09 | 0.23 | 0 | 0 |
| **Answer 1** | 0 | 0.01 | 0.82 | 0.28 | 0.73 | 0.08 | 0.09 | | | |
| **Answer 2** | 0 | 0 | 0.71 | 0.28 | 0.63 | 0.08 | 0.08 | | | |
**Key Metric Observations:**
* **Rouge-1:** Only the reference answer has a score of 1, indicating a perfect match with the ground truth. All model answers score 0.
* **Max Prob (Maximum Probability):** The "Greedy answer" has the highest value (0.16), suggesting the model assigned its highest token probability to this incorrect sequence. The reference answer has a very low max probability (0.01).
* **Avg Prob (Average Probability):** The "Greedy answer" also has the highest average probability (0.89), indicating the model was generally confident in its tokens for this incorrect answer.
* **Entropy (Max Ent, Avg Ent):** Entropy values are relatively consistent across answers, with the reference answer having the lowest average entropy (0.71), suggesting slightly less uncertainty in its token generation compared to the incorrect answers.
* **Specialized Metrics (Gb-S, Wb-S, Bb-S, SU, Ask4-conf):** These appear to be domain-specific confidence or similarity scores. Notably, only the "Greedy answer" has values for Bb-S (0.23), SU (0), and Ask4-conf (0).
### Key Observations
1. **Model Confidence vs. Correctness:** The model's "Greedy answer" (its most likely output) is incorrect. Crucially, this incorrect answer is generated with higher internal probability metrics (Max Prob, Avg Prob) than the correct reference answer.
2. **Complete Failure on Factual Recall:** All three model-generated answers are factually incorrect, as shown by the Rouge-1 score of 0.
3. **Metric Discrepancy:** The table highlights a disconnect between the model's internal confidence signals (high probabilities) and factual accuracy. The model is confidently wrong.
4. **Data Completeness:** The "Greedy answer" row is the only one populated with values for all 10 metrics, suggesting it is the primary focus of the analysis.
### Interpretation
This figure serves as a diagnostic case study in LLM failure modes, specifically for factual recall. It demonstrates that a model (LLaMA2-7B in this instance) can generate an incorrect answer with high internal confidence, as measured by token probabilities. The high `Avg Prob` (0.89) for the wrong answer versus the low `Avg Prob` (0.78) for the correct one is a critical finding. It suggests that the model's probability distribution is not a reliable indicator of factual correctness for this out-of-knowledge question.
The inclusion of specialized metrics like `Bb-S`, `SU`, and `Ask4-conf` (likely standing for something like "Ask for confidence") only for the greedy answer implies these are being evaluated as potential signals for detecting such failures. Their low or zero values here might indicate they are not triggering a "low confidence" flag, which is itself a problem.
In essence, the image argues that relying solely on a model's greedy decoding output or its raw probability scores is insufficient for guaranteeing factual accuracy, especially when the model lacks the knowledge. It underscores the need for external verification, retrieval-augmented generation, or more sophisticated uncertainty quantification methods.
</details>
Figure 15: An example that LLaMA2-7B does not know the true answer. Scores are normalized in [0,1], where a lower value indicates larger uncertainty. The LM does not know the true answer and attempts to guess it by generating different names with low confidence scores, but the score is also low even when the LM faces the true answer.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Technical Document Example: Language Model Uncertainty Estimation Failure
### Overview
The image is a technical slide or figure titled "An example of the failure in estimating the uncertainty (LM: Gemma-7B)". It presents a specific question posed to a language model (Gemma-7B), the reference answer, the model's greedy (most likely) answer, and two alternative sampled answers. Below this, a table provides various quantitative metrics for each of these four answer candidates. The purpose is to illustrate a case where the model's internal uncertainty metrics do not align with the factual correctness of its outputs.
### Components/Axes
The image is structured in two main sections:
1. **Header/Question Section (Top, within a beige rounded rectangle):**
* **Question:** "What is the name of the colliery in the 1939 film âThe Stars Look Downâ?"
* **Ref answer:** "Neptune Colliery" (displayed in blue text).
* **Greedy answer:** "The Black Diamond" (displayed in red text, preceded by a small robot icon).
* **Answer 1:** "Oakwood Colliery" (preceded by a small robot icon).
* **Answer 2:** "Northmoor Colliery" (preceded by a small robot icon).
2. **Data Table (Bottom):**
* **Rows:** Correspond to the four answer candidates: "Ref answer", "Greedy answer", "Answer 1", "Answer 2".
* **Columns (Metrics):** The table has 10 columns with the following headers:
* `Rouge-1`
* `Max Prob`
* `Avg Prob`
* `Max Ent`
* `Avg Ent`
* `Gb-S`
* `Wb-S`
* `Bb-S`
* `SU`
* `Ask4-conf`
### Detailed Analysis
The table contains the following numerical data for each answer candidate:
| Answer Candidate | Rouge-1 | Max Prob | Avg Prob | Max Ent | Avg Ent | Gb-S | Wb-S | Bb-S | SU | Ask4-conf |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Ref answer** | 1 | 0 | 0.62 | 0.19 | 0.65 | 0.10 | 0.13 | 0.23 | | |
| **Greedy answer** | 0 | 0.02 | 0.72 | 0.18 | 0.20 | 0.10 | 0.10 | 0.12 | 0 | 1 |
| **Answer 1** | 0 | 0 | 0.73 | 0.18 | 0.57 | 0.10 | 0.11 | 0.18 | | |
| **Answer 2** | 0 | 0 | 0.73 | 0.18 | 0.53 | 0.10 | 0.12 | 0.19 | | |
**Note:** The cells for `SU` and `Ask4-conf` are empty for the "Ref answer", "Answer 1", and "Answer 2" rows.
### Key Observations
1. **Correctness vs. Model Confidence:** The reference answer ("Neptune Colliery") is correct (Rouge-1 = 1). The model's greedy answer ("The Black Diamond") is incorrect (Rouge-1 = 0).
2. **Probability Metrics:** The incorrect greedy answer has a higher `Avg Prob` (0.72) than the correct reference answer (0.62). The two other incorrect sampled answers ("Answer 1" and "Answer 2") have the highest `Avg Prob` (0.73).
3. **Entropy Metrics:** The correct reference answer has the highest `Avg Ent` (0.65), indicating higher model uncertainty for the correct token sequence. The incorrect greedy answer has a much lower `Avg Ent` (0.20), suggesting the model is more certain about its incorrect output.
4. **Special Metrics:** Only the greedy answer has values for `SU` (0) and `Ask4-conf` (1). The `Ask4-conf` value of 1 suggests the model would express high confidence if asked about this answer.
5. **Similarity Scores (Gb-S, Wb-S, Bb-S):** These scores are relatively low and similar across all answers, with the reference answer having slightly higher values in `Wb-S` and `Bb-S`.
### Interpretation
This example demonstrates a critical failure mode in language model uncertainty estimation. The model (Gemma-7B) exhibits **overconfidence in an incorrect answer**.
* **The Core Problem:** The model assigns higher average probability (`Avg Prob`) and lower average entropy (`Avg Ent`) to its incorrect greedy answer compared to the correct reference answer. This is the opposite of the desired behavior, where correct answers should be associated with higher model confidence (higher probability, lower entropy).
* **Implication for Reliability:** This misalignment means that using the model's own probability or entropy scores as a proxy for answer correctness or uncertainty is unreliable in this case. A user or system relying on these metrics would be misled into trusting the wrong answer.
* **Broader Significance:** The slide highlights a key challenge in AI safety and reliability: ensuring that a model's internal confidence signals are well-calibrated with factual accuracy. When models are confidently wrong, it becomes difficult to build systems that can automatically flag uncertain or potentially incorrect outputs for human review. This failure underscores the need for better uncertainty quantification methods that are robust to such cases.
</details>
Figure 16: An example that Gemma-7B does not know the true answer. Scores are normalized in [0,1], where a lower value indicates larger uncertainty.