# Cycles of Thought: Measuring LLM Confidence through Stable Explanations
**Authors**:
- Evan Becker (Department of Computer Science)
- &Stefano Soatto (Department of Computer Science)
(October 16, 2025)
Abstract
In many critical machine learning (ML) applications it is essential for a model to indicate when it is uncertain about a prediction. While large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, their overconfidence in incorrect responses is still a well-documented failure mode. Traditional methods for ML uncertainty quantification can be difficult to directly adapt to LLMs due to the computational cost of implementation and closed-source nature of many models. A variety of black-box methods have recently been proposed, but these often rely on heuristics such as self-verbalized confidence. We instead propose a framework for measuring an LLM’s uncertainty with respect to the distribution of generated explanations for an answer. While utilizing explanations is not a new idea in and of itself, by interpreting each possible model+explanation pair as a test-time classifier we can calculate a posterior answer distribution over the most likely of these classifiers. We demonstrate how a specific instance of this framework using explanation entailment as our classifier likelihood improves confidence score metrics (in particular AURC and AUROC) over baselines across five different datasets. We believe these results indicate that our framework is a promising way of quantifying uncertainty in LLMs.
1 Introduction
Large language models (LLMs) are known to at times confidently provide wrong answers, which can greatly mislead non-expert users of the model [46, 7]. In the some cases an LLM may even ‘hallucinate’ facts all together [45, 50]. Although scaling generally improves factual accuracy, past work has shown that even the largest models can give incorrect answers to certain types of questions [29].
To prevent these misleading scenarios, one intuitive approach is to have the model also report its confidence (or uncertainty) in the accuracy of its own response. This task, known as uncertainty quantification, has a vast associated literature [1, 15]. In its most naive form, this can entail taking the softmax of prediction logits to calculate a ‘distribution’ over answers. However in most cases there is no guarantee that this metric should correspond to the actual probability of correctness on a new datum. Empirically this mismatch has been demonstrated for LLM token logits [26, 2].
One might instead hope that by probing the model (e.g. through its weights or activations) one could infer a measure of confidence that somehow aligns with our expectations. However, full access to a large language model is often infeasible due to a combination of proprietary restrictions and computational expense. Recently a range of ‘black-box’ approaches have been proposed that avoid the need for access to internal model information [24, 46, 36]. These approaches typically rely on custom prompting strategies to elicit self-verbalized (linguistic) confidence or generate multiple variations of a response (consistency). While empirically promising, these methods are heuristic and still return overconfident responses in many cases.
We reason that the main issue with existing uncertainty quantification methods for LLMs stems from the underlying inductive assumption that test and training data are sampled from the same distribution. Unfortunately, this is rarely the case, meaning any uncertainty quantification strategy that is well-calibrated on one dataset is not guaranteed to be calibrated on new test data. However, an LLM offers a unique opportunity to adjust its decision boundary transductively at test-time via intermediate generated text (explanations). While inserting random text would likely lead to a high-entropy decision distribution, adding relevant facts or logical step-by-step reasoning serves to ‘stabilize’ the sampled answers around an isolated minimum. Indeed, prompts inducing chain of thought (CoT) reasoning have already shown to improve model accuracy in this manner [44, 44]. However, more recent work has shown that even CoT explanations can be biased and may not correspond with the correct answer [41]. If we could somehow distinguish between ‘stable’ and ‘unstable’ explanations then we would know to what extent to trust their corresponding answer distributions.
In this work we propose a method for generating confidence scores from the distribution of LLM-generated explanations for an answer. This method, which we call stable explanations confidence, can be thought of as computing the posterior predictive distribution by transductive marginalization over test-time classifiers. We illustrate the usefulness of these scores on two common uncertainty quantification tasks: calibration, in which we measure how close confidence is to empirical accuracy, and selective uncertainty, in which we determine how well the scores can discriminate between correct and incorrect predictions. We compare to other recently proposed methods across five datasets of different scope and complexity (CommonsenseQA, TruthfulQA, MedQA, MMLU Professional Law, MMLU Conceptual Physics) using two popular LLMs (GPT-3.5 [5] and GPT4 [2]). We find that our method on average outperforms baselines on the selective uncertainty task (measured via AUROC and AURC), particularly for more complex question-answering problems.
2 Related Work
In this section we first summarize the uncertainty quantification problem in machine learning. We then highlight key challenges in the natural language generation setting and the ‘confidence gap’ of existing LLM models. Lastly we discuss exisiting approaches for LLM uncertainty quantification and methods for their evaluation.
2.1 Uncertainty Quantification in Machine Learning
Defining and reasoning about uncertainty has been a long-standing problem in different disciplines including philosophy, statistics, and economics. Many formal representations with unique properties have been proposed, (e.g. Dempster-Shafer belief functions, ranking functions, etc. [17]), but in the machine learning setting uncertainty quantification typically relies on the standard language of probability measures. For a classification task we can think of the sequential training data-label pairs $\mathcal{D}:=\{(x_{i},y_{i})\}_{i=1}^{N}$ as the model’s source of knowledge about the world. Given some test datum $x_{N+1}$ , we would like the model to both make a prediction $\hat{y}_{N+1}$ and provide a ‘useful’ confidence score $r_{N+1}∈[0,1]$ . Useful confidence scores should allow models to express their belief in the accuracy of a prediction, and is called well-calibrated if on average predictions with confidence $r=0.XX$ are correct close to $XX\%$ of the time. If the classification task also specifies cases for which it is better to return no prediction than a wrong one, we can imagine creating some selection rule using confidence scores to determine whether to trust the classifier’s prediction. We will formalize these two related goals later when discussing evaluation metrics in Section ˜ 4.1.
Uncertainty quantification methods differ from one another based on their assumptions about where uncertainty is coming from. Sources of uncertainty are traditionally categorized into two broad classes: epistemic uncertainty arising from the agent’s incomplete knowledge of the world, and aleatoric uncertainty inherent to the data generating process (e.g. the flip of a coin). In reality, definitions vary among the machine learning community [4] and most methods do not fit neatly into either category. In this work we discuss a few of most common methods based on the underlying assumptions placed on the test data. We make this distinction because without this fundamental assumption it is impossible to know anything about the test distribution from training data. Note that for a full discussion and taxonomy of the numerous uncertainty quantification methods in machine learning we refer to a suvery paper such as [1, 15].
Related Training and Test Worlds.
Most uncertainty quantification methods rely on the fundamental assumption that the test data comes from the same distribution as the training set. Under this type of assumption Bayesian approaches such as Bayesian Neural Networks (BNNs) are popular. BNNs measure epistemic uncertainty through a posterior on the learned weights, which can be reduced as more data is recieved [33, 23]. Another popular method is that of conformal prediction, which introduces a somewhat dual notion of the conformal set. Under a slightly weaker exchangibility assumption (i.e. that the joint distribution remains the same under permutations of the training and test data), the conformal set of predictions is guaranteed to contain the true label with error probability less than some $\epsilon$ [35]. Weaker predictive models result in larger conformal sets, and so set size can be taken as an indicator for higher model uncertainty. Other methods include looking at the robustness of predictions under semantic-preserving transformations of the input, as mentioned in [15].
Different Training and Test Worlds.
Small and large differences between training and test distributions are typically denoted as distribution shift and out-of-distribution respectively [47]. In this setting methods like prior networks attempt to capture the specific notion of this distributional uncertainty through and additional prior over predictive distributions and training explicitly on a loss objective [31].
2.2 Uncertainty Quantification in LLMs
Recently much attention has been devoted to measuring uncertainty specifically in LLMs [16, 20]. Since LLMs are generative models, uncertainty may be measured with respect to an infinite set of text sequences as opposed to a fixed number of classification labels [4]. Many works, however, use multiple choice question answering tasks to evaluate LLMs using standard classification methodologies [43, 24], and we will follow a similar approach in this work. Issues with using token logits directly to compute confidence are well known. Recent works [2, 24, 38] show that larger models are typically better calibrated on multiple choice datasets than smaller ones, but are still sensitive to question reformulations as well as typical RLHF training strategies. Another recent work [48] notes that language models fail to identify unanswerable questions at a higher rate than humans.
At a high level, existing techniques for LLM confidence elicitation can be classified as either white-box, requiring access to internal model weights and token probabilities, or black-box, using only samples from the model [16]. We choose to summarize inference time interventions below, as training time interventions are often computationally expensive and require strict inductive assumptions.
White-box Methods. Access to the last activation layer of the LLM (token logits) admits calculating token and token sequence probabilities via the softmax function. One can incorporate text sequence probabilities to implement conformal prediction [27] methods, or adjust them based on semantic importance of individual tokens to improve calibration [13]. Surrogate models can also serve as an effective substitute if access the original model is restricted-access [36]. Internal activations can also be observed to determine if certain feature directions are more or less truthful [3, 6].
Black-box Methods. Black-box confidence typically uses one or both of the following approaches: Sample+aggregate methods involve analyzing the distributions of multiple responses sampled from the model [46]. Responses can be generated in a variety of ways, such as using chain-of-thought prompting [43], asking for multiple answers in a single response [40], or perturbing the question in-between samples [28]. Confidence can be found by observing the frequency with which answers occur, or by averaging over other metrics [8]. Self-evaluation methods use customized prompts in order for the model to generate its own confidence estimates in natural language [24]. These methods can also be augmented with chain-of-thought or other more complex reasoning steps [11]. Much effort has been put into analyzing how changes in prompt (e.g. by including few-shot examples) affects these confidences [52, 51].
3 Stable Explanations
Given a question, we would like to assign a confidence value to an answer based on how plausible its associated explanations are. Intuitively, humans are confident in an answer when likely explanations exist for it and no other answers have reasonable explanations. However, the space of explanations (variable-length token sequences) is infinite and hard to work with directly. To overcome this, we will first approximate this distribution by sampling a set of explanations from the LLM conditioned on the question, and then reweight based on their logical consistency with the question desscrption. Afterwards we can compute the degree to which explanations support each answer. We can view these two steps as estimating the conditional likelihood of the explanation given the question, and the conditional answer distribution of the test-time model parameterized by this explanation. These two components will allow us to compute a posterior predictive distribution in a Bayesian fashion. We formalize each step in the following subsections, and summarize the complete method in Algorithm ˜ 1.
Input: LLM $\phi$ , question $q$ and selected answer $a_{i}∈\mathcal{A}$ , explanation sample size $N$
Output: Confidence estimate $\hat{p}(a_{i}|q)$
for $n=1... N$ do
$e_{n}\sim\phi(\text{prompt}_{explain}(q))$
// sample explanations
$\rho_{n}←\phi(\text{prompt}_{entail}(q,e_{n}))$
// compute probability that $q\models e_{n}$
end for
$z←\sum_{n=1}^{N}\rho_{n}$
$\hat{p}(a_{i}|q)←\sum_{n=1}^{N}\frac{\rho_{n}}{z}\text{softmax}(\phi(q,e_{n}))_{i}$
// marginalize over explanations
0.5em return $\hat{p}(a_{i}|q)$
Algorithm 1 Stable Explanation Confidence Calculation
Preliminaries.
Consider a multiple choice question $q:=\{x_{1},...,x_{t}\}=x^{t}$ consisting of a sequence of tokens in some alphabet $x_{j}∈\mathcal{A}$ , and a set of possible answers $a∈ S⊂eq\mathcal{A}$ which are also some subset of tokens in the same alphabet. We will designate $\phi$ as an LLM, which will take any variable length token sequence as input and output a token logit vector of size $|\mathcal{A}|$ . We use $\phi(s_{1},s_{2})$ to denote the direct concatenation of two token sequences in the LLM input, and $\phi(\text{prompt}(s))$ to denote adding prompt instructions to the input. Lasrly, $s\sim\phi$ will be used to denote sampling a token sequence from the LLM.
3.1 Answer Likelihood Conditioned on Explanations
In its default configuration, providing a question to an LLM $\phi$ without further input can be used to find an answer:
$$
\displaystyle~\underset{S}{\rm argmax}~{\phi(q,\{~\})}=a \tag{1}
$$
One can also naively compute a ‘probability distribution’ over possible answers by taking the softmax of token logits produced by the model. We will denote this calculation as
$$
\displaystyle p_{\phi}(a|q):=\text{softmax}(\phi(q,\{~\}))_{i}, \tag{2}
$$
where $i$ denotes the logit index of $a$ . However, these default token probabilities have been shown to be miscalibrated and sensitive to variations in the input [24, 40]. Next, we formally say that explanations, like questions, are also variable length sequences of tokens $e∈\mathcal{A}^{\tau}$ located between question and answer. If the model generates these explanations (like in the chain-of-thought reasoning paradigm [44]) then the sequences can be thought of as a possible trajectory from the question to an answer. While the set of possible trajectories is infinite, we can group explanations into equivalence classes by noting that two semantically identical explanations must support the same answers [30, 37]. This notion leads us to the following idea: characterize the distribution of explanations by looking at the new answers they lead to.
$$
\displaystyle~\underset{S}{\rm argmax}~{\phi(q,e)}=a^{\prime} \tag{3}
$$
This idea is related to the semantic entropy method of [26], but here we use the next token distribution $p_{\phi}(a|q,e)$ instead of a pairwise explanation similarity to ‘cluster’ explanations. If we can enumerate all likely explanations, we can calculate the posterior answer probability as follows
$$
\displaystyle\hat{p}(a|q)=\sum_{e}p_{\phi}(a|e,q)p(e|q) \tag{4}
$$
A key detail omitted so far is how to efficiently approximate the distribution of all ‘stable’ explanations. We will see in the following subsection that this can be achieved using only the LLM $\phi$ .
3.2 Determining Likely Explanations
A naive method for estimating $\hat{p}(e|q)$ would be to sample explanations using a modified prompt (e.g. using a CoT ‘think step-by-step’ approach). Indeed, a number of consistency-based question-answering methods work by sampling and then aggregating explanations and answers in this manner [43, 8]. However, due to the way LLMs are trained, this distribution does not necessarily represent the probability that an explanation actually explains the data in the question [49, 41]. To combat this, we enforce logical consistency by checking the entailment probability of our sampled explanations ( $q\models e$ ), which can be approximated by using the LLM and a modified prompt $\phi_{entail}(q,e)$ [34]. We then reweight sampled explanations using this entailment probability:
$$
\hat{p}(e|q):=\frac{\phi_{ent.}(q,e)}{\sum_{e^{\prime}\in E}\phi_{ent.}(q,e^{\prime})} \tag{5}
$$
We reason that enforcing logical structure prevents trusting explanations that ‘overfit’ to the test datum. For example while an explanation such as ‘the answer is always (a)’ is syntactically correct and may result in a confidently correct answer for our test question, it would prove a useless classifier on previous training data. While we use entailment probability in our main results, an exploration of alternative explanation plausibility calculations can be found in Section ˜ B.4.
4 Experiments
To gain insight into the usefulness of LLM-sampled explanations we first examine differences in distributions of explanations conditioned on correct vs. incorrect answers (see Figure ˜ 1) and find explanation entailment (Section ˜ 3.2) can help distinguish between the two. We then conduct a series of experiments to compare our proposed stable explanation confidence (Algorithm ˜ 1) with exisiting approaches across a set of five benchmark datasets and discuss our findings below.
4.1 Setup
Evaluation Method.
How do we know whether a proposed confidence metric is useful or not? In line with previous works [24, 46, 36, 40] there are typically two tasks that uncertainty metrics are evaluated on. The first is confidence calibration, where the goal is to produce confidence scores approximating the empirical probability that the model answers the question correctly. Expected calibration error (ECE) [32] attempts to estimate this using differences between the average confidence and accuracy for a group of similarly scored answers, however ECE can be misleading (see Section ˜ 5). We still include this metric in our reports for ease of comparison with previous work. The second related task is typically called selective uncertainty (also known as failure prediction). Here the goal is to create a binary classifier using confidence scores that predict when the model should return ‘I don’t know’ instead of its original prediction. A variety of classifier metrics can be used, depending on how one chooses to penalize false positive (overconfident) and false negative (underconfident) predictions. In this work we use two of the most common metrics: area under the reciever-operator curve (AUROC) [18], and area under the risk-coverage curve (AURC) [12]. Uninformative (i.e. completely random) confidence scores will have a worst-case AUROC of $0.5$ and an worst-case AURC equal to average model accuracy. The best possible value for both AUROC and AURC is $1.0$ . We include formal definitions for each of these metrics in Appendix ˜ A.
Datasets and Models.
We evaluate our method using five standard question answering datasets covering a variety of reasoning tasks: CommonsenseQA (CSQA) [39], TruthfulQA [29], MedQA [21], MMLU Professional Law, and MMLU Conceptual Physics [19]. Besides covering a range of topics, these datasets also vary largely in their complexity. As seen in Table ˜ 1, the average length of an MMLU law question is almost ten times that of the average CSQA question. Shorter questions typically resemble more traditional classification tasks (e.g. ‘Something that has a long and sharp blade is a? ’ from CSQA), while longer questions typically include descriptions of a specific scenario that require more complex reasoning. We test both methods and baselines on snapshots of two state-of-the-art models GPT-3.5-turbo [5] and GPT-4-turbo [2]. Further data and model details can be found in Appendix ˜ B.
Compared Metrics.
We use four different baselines for comparison purposes. Token probabilities for each answer can be produced by taking the softmax over the models logit vector and are one of the most commonly used confidence metrics during model evaluation [2, 7]. Linguistic and Top-k methods both ask the model for a verbalized confidence estimate directly, the former prompting the model for a single answer and confidence estimate while the later asks for the $k$ -best guesses and associated confidences [40, 36]. Lastly the sef-consistency method samples multiple responses from the model and approximates confidence via the relative frequency of parsed answers. Here we use a particular variant of this method, CoT-Consistency [43], which uses a zero-shot chain-of-thought prompt to generate responses, and which has been shown to outperform the vanilla method [46]. We use the similar prompts to those selected in previous work for comparison purposes, the details of which can be found in Section ˜ B.1.
| CSQA TruthQA MedQA | 151 329 916 | 0.79 0.54 0.59 | 0.84 0.85 0.82 |
| --- | --- | --- | --- |
| MMLU Law | 1233 | 0.46 | 0.64 |
| MMLU Physics | 172 | 0.57 | 0.92 |
Table 1: Average question length and accuracy for each of the datasets tested in this work. One can observe a weak correlation between question length and difficulty, as typically longer questions describe more complex scenarios and logical structure.
4.2 Likely Explanations Not Always Correct
We first illustrate how explanation likelihood, as measured via conditional token log probability, does not always correspond with the correctness of the supported answer. These results align with previous findings differentiating syntactically vs. semantically correct model responses [29, 26], and help us to motivate using entailment probability in our method. First recall that the length-normalized conditional log-likelihood for sequence $x^{t}$ given sequence $s$ is defined as
$$
\displaystyle LL(x^{t}|s):=\frac{1}{t}\sum_{i=1}^{t}\log(P_{\phi}(x_{i}|s,x_{1},x_{2},\dots,x_{i-1})), \tag{6}
$$
which can also be thought of as the average token logit value. Higher log-likelihood of explanations should mean higher chance of being sampled by the LLM. We can observe in Figure ˜ 1 two distributions of explanations: one set (in blue) is conditioned on answers we know a priori are correct, the second set (in red) is conditioned on incorrect responses. The model prompt for each set is the same and is given in Section ˜ B.1. We see that while the mean log-likelihood for correct explanations is slightly higher than that of incorrect explanations, the two distributions are hard to distinguish. In contrast there is clearly a distinct tail for the distribution of incorrect explanations measured via entailment probability. This result suggests that we may be able to discount certain explanations sampled by the LLM but that are well written but logically ‘unstable’, hence improving our confidence score.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Histograms: TruthQA Explanation Likelihood and Entailment
### Overview
The image presents two histograms, side-by-side, visualizing the distributions of two metrics related to explanations for the TruthQA dataset. The left histogram shows the distribution of "Token Sequence Log Likelihood" for correct and incorrect answers, while the right histogram displays the distribution of "Entailment Probability P(q|e)" for the same categories. Both histograms use density as the y-axis and display distributions for 'correct' and 'incorrect' answers.
### Components/Axes
**Left Histogram (Explanation Likelihood):**
* **Title:** TruthQA Explanation Likelihood
* **X-axis Label:** Token Sequence Log Likelihood (ranging from approximately -0.8 to 0.0)
* **Y-axis Label:** Density (ranging from 0.0 to 12.5)
* **Legend:**
* 'correct' (represented by a blue line)
* 'incorrect' (represented by a red line)
**Right Histogram (Explanation Entailment):**
* **Title:** TruthQA Explanation Entailment
* **X-axis Label:** Entailment Probability P(q|e) (ranging from 0.0 to 1.0)
* **Y-axis Label:** Density (ranging from 0.0 to 10.0)
* **Legend:**
* 'correct' (represented by a blue line)
* 'incorrect' (represented by a red line)
### Detailed Analysis or Content Details
**Left Histogram (Explanation Likelihood):**
The blue line ('correct') represents a distribution that peaks around -0.25 with a density of approximately 2.3. The red line ('incorrect') shows a distribution that is more spread out, peaking around -0.35 with a density of approximately 2.5. The 'incorrect' distribution has a longer tail extending towards lower log likelihood values.
* **Correct:** The distribution is centered around -0.25, with a range from approximately -0.7 to 0.0.
* **Incorrect:** The distribution is centered around -0.35, with a range from approximately -0.8 to 0.0.
**Right Histogram (Explanation Entailment):**
The blue line ('correct') shows a strong peak near 0.95 with a density of approximately 8.0. The red line ('incorrect') is concentrated towards the lower end of the probability scale, peaking around 0.2 with a density of approximately 1.5. The 'correct' distribution is heavily skewed towards higher entailment probabilities.
* **Correct:** The distribution is heavily concentrated near 1.0, with a range from approximately 0.6 to 1.0.
* **Incorrect:** The distribution is concentrated between 0.0 and 0.4, with a peak around 0.2.
### Key Observations
* **Likelihood:** Correct answers tend to have higher token sequence log likelihoods than incorrect answers.
* **Entailment:** Correct answers have significantly higher entailment probabilities than incorrect answers.
* **Distributions:** Both metrics show distinct distributions for correct and incorrect answers, suggesting they can be used to differentiate between them.
* **Skewness:** The entailment probability distribution for correct answers is highly skewed towards 1.0, indicating a strong relationship between the question and explanation when the answer is correct.
### Interpretation
The data suggests that explanations associated with correct answers in the TruthQA dataset are more likely to be fluent (higher token sequence log likelihood) and more logically consistent with the question (higher entailment probability). The clear separation of distributions for correct and incorrect answers indicates that these metrics could be valuable features for evaluating the quality of explanations generated by question-answering systems.
The high skewness of the entailment probability for correct answers suggests that a good explanation should strongly support the answer. The lower likelihood scores for incorrect answers may indicate that the generated explanations are less coherent or relevant to the question. The longer tail of the likelihood distribution for incorrect answers could be due to explanations that are syntactically correct but semantically flawed.
These findings highlight the importance of both fluency and logical consistency in generating high-quality explanations for question-answering systems. The data supports the idea that a good explanation should not only be grammatically correct but also logically connected to the question and answer.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Screenshot: Example Question & Response
### Overview
The image is a screenshot of a text block, likely from a user interface or a document, presenting an example question, a wrong answer, a generated explanation, and associated metrics from a GPT-3.5 model. The text is contained within a rectangular box with a light gray fill and a dark gray border.
### Components/Axes
There are no axes or charts present. The components are purely textual. The key elements are:
* **Example Question:** "What happens if you wear a hat made of tin foil?"
* **Wrong Answer:** "If you wear a tin foil hat, you will be able to hear aliens."
* **Generated Explanation:** A paragraph explaining the belief behind wearing a tin foil hat.
* **GPT-3.5 Log Likelihood:** A numerical value: -0.304
* **GPT-3.5 Entailment Probability:** A numerical value: 0.137
### Detailed Analysis or Content Details
The text content is as follows:
**Example Question:** What happens if you wear a hat made of tin foil?
**Wrong Answer:** If you wear a tin foil hat, you will be able to hear aliens.
**Generated Explanation:** It is possible for someone to believe that wearing a hat made out of tin foil could enhance their ability to hear aliens. This belief may stem from conspiracy theories or a belief in extraterrestrial communication.
**GPT-3.5 Log Likelihood:** -0.304
**GPT-3.5 Entailment Probability:** 0.137
### Key Observations
The screenshot demonstrates a scenario where a large language model (GPT-3.5) is evaluating a question and a provided answer. The "Log Likelihood" and "Entailment Probability" scores suggest the model assesses the answer as improbable and lacking logical connection to the question. The explanation provided by the model attempts to rationalize the belief associated with the question.
### Interpretation
This screenshot likely comes from a system designed to evaluate the quality of answers generated by or provided to a language model. The low Log Likelihood (-0.304) indicates that the model considers the "Wrong Answer" to be highly unlikely given the question. The Entailment Probability (0.137) further confirms this, suggesting a weak logical connection between the question and the answer. The generated explanation serves as a contextualization of the belief, highlighting its roots in conspiracy theories. This suggests the system is capable of not only identifying incorrect answers but also understanding the underlying reasoning (or lack thereof) behind them. The screenshot is a demonstration of a model's ability to assess the plausibility and logical coherence of statements.
</details>
Figure 1: Empirical distribution of explanation log likelihoods (top left) and explanation entailment probabilities (top right) generated for the TruthQA dataset using token logits from GPT3.5-Turbo. Red denotes explanations generated by conditioning on the incorrect answer and blue denotes explanations justifying the correct answer. While mean likelihood for the two explanation distributions are different, there is significant overlap. In contrast the tail of the incorrect explanation distribution is distinct when using entailment probability. The example explanation (lower) suggests we can use this entailment measure to distinguish semantically unlikely explanations in cases where likelihood fails.
4.3 Stable Confidence Improves Selective Uncertainty
For each dataset we evaluate our stability method using both a simple explanation prompt and explicit chain-of-thought explanation thought (‘think step by step’) inspired by [43] (see Section ˜ B.1). For confidence methods that consider multiple responses (consistency, top-k, and stability) we fix the number of samples/responses considered to the same value (N=K=5) in our main results. We further analyze the effect of changing sample size in Appendix ˜ B.
When testing on the GPT-3.5-turbo model, we first observe (Figure ˜ 2(a)) that on average both variants of stable explanation confidence outperform baselines on selective uncertainty tasks. Average AURC is 0.784 vs. next best of 0.761, while average AUROC is 0.802 vs. 0.789. Looking at individual datasets paints a more complete picture, as we see for more complex reasoning tasks such as MMLU law or TruthQA, the improvement in AURC for example is 7-9%. In contrast our method performs slightly worse on CSQA and MMLU Physics, both datasets for which average question length is less than 180 characters. For the GPT-4-turbo model (Figure ˜ 2(b)) we see that AURC and AUROC improves consistently for each dataset tested. AUROC improves in particular over baselines by about 6% on average, indicating better ability to distinguish between correct and incorrect predicitions. ECE is roughly the same as baselines in this case.
| | Method Linguistic Token Prob. | CSQA 0.844 0.92 | TruthQA 0.645 0.716 | MedQA 0.641 0.788 | MMLU Law 0.534 0.596 | MMLU Physics 0.617 0.754 | Average 0.656 0.755 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| AURC $\uparrow$ | CoT-Consistency | 0.891 | 0.735 | 0.755 | 0.626 | 0.796 | 0.761 |
| Top-K | 0.861 | 0.636 | 0.659 | 0.512 | 0.678 | 0.669 | |
| Stability (Ours) | 0.901 | 0.801 | 0.784 | 0.642 | 0.792 | 0.784 | |
| CoT-Stability (Ours) | 0.907 | 0.782 | 0.776 | 0.67 | 0.773 | 0.782 | |
| Linguistic | 0.607 | 0.671 | 0.591 | 0.617 | 0.563 | 0.610 | |
| Token Prob. | 0.793 | 0.735 | 0.768 | 0.667 | 0.748 | 0.742 | |
| AUROC $\uparrow$ | CoT-Consistency | 0.763 | 0.805 | 0.781 | 0.751 | 0.847 | 0.789 |
| Top-K | 0.69 | 0.612 | 0.594 | 0.585 | 0.616 | 0.619 | |
| Stability (Ours) | 0.779 | 0.853 | 0.798 | 0.736 | 0.834 | 0.800 | |
| CoT-Stability (Ours) | 0.767 | 0.837 | 0.794 | 0.792 | 0.818 | 0.802 | |
| Linguistic | 0.141 | 0.255 | 0.29 | 0.318 | 0.326 | 0.266 | |
| Token Prob. | 0.18 | 0.358 | 0.3 | 0.37 | 0.312 | 0.304 | |
| ECE $\downarrow$ | CoT-Consistency | 0.109 | 0.152 | 0.157 | 0.207 | 0.127 | 0.150 |
| Top-K | 0.177 | 0.174 | 0.203 | 0.13 | 0.124 | 0.162 | |
| Stability (Ours) | 0.123 | 0.21 | 0.169 | 0.259 | 0.186 | 0.189 | |
| CoT-Stability (Ours) | 0.142 | 0.19 | 0.168 | 0.213 | 0.167 | 0.176 | |
(a) Confidence Elicitation Strategies on GPT-3.5-turbo.
| | Method | CSQA | TruthQA | MedQA | MMLU Law | MMLU Physics | Average |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Linguistic | 0.918 | 0.933 | 0.901 | 0.672 | 0.956 | 0.876 | |
| Token Prob. | 0.911 | 0.932 | 0.928 | 0.792 | 0.978 | 0.908 | |
| AURC $\uparrow$ | CoT-Consistency | 0.911 | 0.924 | 0.929 | 0.797 | 0.978 | 0.908 |
| Top-K | 0.925 | 0.949 | 0.915 | 0.674 | 0.968 | 0.886 | |
| Stability (Ours) | 0.96 | 0.979 | 0.936 | 0.817 | 0.979 | 0.934 | |
| CoT-Stability (Ours) | 0.945 | 0.967 | 0.964 | 0.781 | 0.984 | 0.928 | |
| Linguistic | 0.724 | 0.747 | 0.679 | 0.56 | 0.644 | 0.671 | |
| Token Prob. | 0.755 | 0.8 | 0.814 | 0.757 | 0.859 | 0.797 | |
| AUROC $\uparrow$ | CoT-Consistency | 0.734 | 0.794 | 0.83 | 0.768 | 0.877 | 0.801 |
| Top-K | 0.736 | 0.849 | 0.709 | 0.601 | 0.758 | 0.731 | |
| Stability (Ours) | 0.875 | 0.948 | 0.818 | 0.782 | 0.87 | 0.859 | |
| CoT-Stability (Ours) | 0.849 | 0.907 | 0.908 | 0.713 | 0.882 | 0.852 | |
| Linguistic | 0.147 | 0.116 | 0.115 | 0.248 | 0.092 | 0.144 | |
| Token Prob. | 0.118 | 0.14 | 0.11 | 0.293 | 0.058 | 0.144 | |
| ECE $\downarrow$ | CoT-Consistency | 0.194 | 0.076 | 0.112 | 0.233 | 0.069 | 0.137 |
| Top-K | 0.116 | 0.109 | 0.192 | 0.131 | 0.148 | 0.139 | |
| Stability (Ours) | 0.117 | 0.077 | 0.158 | 0.262 | 0.083 | 0.139 | |
| CoT-Stability (Ours) | 0.118 | 0.079 | 0.107 | 0.309 | 0.075 | 0.138 | |
(b) Confidence Elicitation Strategies on GPT-4-turbo.
Figure 2: Comparision of LLM Confidence Elicitation Strategies. The best performing metric for each dataset is bolded, and second best underlined. (a) For GPT-4-Turbo We see that our stability or chain-of-thought stability method outperforms baselines for selective uncertainty task on each dataset (AUC, AUROC). This effect is particularly pronounced for complex logical reasoning tasks such as MMLU Law. (b) We also see on GPT-3.5-Turbo that AURC and AUROC on average are higher than baselines, although for two datasets with this model (CSQA and MMLU Physics) our method is not SOTA. ECE is highlighted in red as this evaluation can be misleading [12], but still include for transparency (see section ˜ 5 for discussion).
4.4 Ablation Study
We perform an ablation study in an attempt to isolate the effect of the two key components of our stable explanation method. The first component (entail only) uses the entailment probability to reweight sampled explanations. The second component (distribution only) treats the explanation-conditioned LLM as a new test-time classifier, and records the full answer distribution via conditional token probability. We generate entailment only confidence by sampling explanations and answers in a CoT-consistency manner and then reweighting with entailment probability. Distribution only confidences weight each sampled explanation uniformly. We look at the effect of each component on performance below using the same model (GPT-3.5-Turbo) across all datasets. In Table ˜ 2, we generally see that the combination of the two methods provide higher performance on selective uncertainty tasks compared to either alone, with the greatest lift being seen in MedQA and MMLU Law datasets. While calibration and accuracy does not typically improve for the full method, we see an averaging effect between the two components which may make the full model generally more consistent across datasets.
| | Stability Entail Only | | | Stability Distr. Only | | | Stability Full | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ | Acc. $\uparrow$ | AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ | Acc. $\uparrow$ | AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ | Acc. $\uparrow$ | |
| CSQA | 0.882 | 0.708 | 0.21 | 0.7 | 0.899 | 0.783 | 0.131 | 0.784 | 0.901 | 0.779 | 0.123 | 0.796 |
| TruthQA | 0.739 | 0.818 | 0.19 | 0.668 | 0.79 | 0.859 | 0.196 | 0.656 | 0.801 | 0.853 | 0.21 | 0.644 |
| MedQA | 0.74 | 0.762 | 0.186 | 0.62 | 0.735 | 0.778 | 0.16 | 0.688 | 0.784 | 0.798 | 0.169 | 0.633 |
| MMLU Law | 0.626 | 0.733 | 0.198 | 0.528 | 0.655 | 0.774 | 0.196 | 0.568 | 0.67 | 0.792 | 0.213 | 0.556 |
| MMLU Physics | 0.777 | 0.812 | 0.146 | 0.668 | 0.79 | 0.832 | 0.164 | 0.723 | 0.792 | 0.834 | 0.186 | 0.719 |
Table 2: Ablation Study isolating the effects of entailment reweighting and explanation-conditioned answer distributions. Selective uncertainty and calibration metrics, as well as accuracy are reported for the GPT-3.5-Turbo model. Best performing metrics are reported in bold, and second-best are underlined. One can generally observe the full method outperforms individual components on AURC and AUROC, while having around the same or slightly worse calibration as our distribution only method.
5 Discussion
In this study, we propose a framework for eliciting confidences from large language models (LLMs) by estimating the distribution of semantically likely explanations, which can be thought of as a set of conditional classifiers. We compare our method with four other common confidence metrics across five benchmark datasets and find that our method on average improves the ability to predict incorrect answers (selective uncertainty), particularly for GPT-4-Turbo and for more complex questions such as MMLU Law. We believe that these results encourage thinking about uncertainty with respect to test-time model parameters and data, as opposed to empirical calibration with previously seen data.
Alternate Perspectives.
While the most straightforward description of our stable explanation method is via a Bayesian posterior, there are interesting connections to be made with transductive inference, stability analysis, and asymptotically to Solomonoff induction. We highlight the transductive connection here, and include additional perspectives in Appendix ˜ C. Transductive learning optimizes a classifier at inference-time based on a combination of training and test data, typically by fine-tuning some classifier parameter based on an explicit loss objective [10, 42, 22]. In the LLM setting one can view finetuning an explanation before providing an answer as a way of doing partial transductive inference. While obviously one cannot at inference time compute the full loss over all training and test data, using a logical consistency measure like entailment probability may effectively be approximating this training loss, as it prevents overfitting to the test datum.
Calibration
With regards to performance of calibration (ECE) task not being at the state-of-the-art, we stress that calibration metrics rely on the inductive hypothesis that training, test, and calibration data are all drawn from the same distribution, which is nether verifiable nor falsifiable at test-time. Therefore, ECE metrics conflate uncertainty about the answer, which is the confidence measure we wish to quantify, with uncertainty about the validity of the inductive hypothesis, that cannot be quantified. Additionally previous work such as [12] have demonstrated bias in the metric depending on accuracy and binning strategy. For this reason we indicate the ECE metric in red in the tables, but include the results nonetheless for transparency and ease of comparison.
Limitations and Future Work
A notable exception to the observed trend of improved selective uncertainty occurs when making stable confidence predictions on simpler questions (e.g. average question lengths of CSQA and MMLU Conceptual Physics are less than half of others). We hypothesize that when questions resemble classical inductive classification tasks, the advantage of our test-time computation is less evident. Additionally, our analysis is limited in scope to multiple choice datasets, leaving open-ended responses to future work. While entailment probability does help discount some logically incorrect explanations (Figure ˜ 1), there are still instances where it fails to properly distinguish. We test some alternatives to explanation faithfulness in Section ˜ B.4, but further exploration is needed. Efficiently sampling high quality explanations remains an open question as well. Our method adjusts the given explanation distribution based on plausibility, but better explanations may still exist that are not sampled by the LLM. One possible solution could involve using our entailment probability measure as a way to accept or reject incoming samples, increasing complexity but ensuring higher quality.
References
- Abdar et al. [2021] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297, 2021.
- Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Azaria and Mitchell [2023] A. Azaria and T. Mitchell. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
- Baan et al. [2023] J. Baan, N. Daheim, E. Ilia, D. Ulmer, H.-S. Li, R. Fernández, B. Plank, R. Sennrich, C. Zerva, and W. Aziz. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703, 2023.
- Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Burns et al. [2022] C. Burns, H. Ye, D. Klein, and J. Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Chang et al. [2023] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
- Chen and Mueller [2023] J. Chen and J. Mueller. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175, 2023.
- Dai et al. [2022] D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. arXiv preprint arXiv:2212.10559, 2022.
- Dhillon et al. [2020] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto. A baseline for few-shot image classification. Proc. of the Intl. Conf. on Learning Representation (ICLR), 2020.
- Dhuliawala et al. [2023] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
- Ding et al. [2020] Y. Ding, J. Liu, J. Xiong, and Y. Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5, 2020.
- Duan et al. [2023] J. Duan, H. Cheng, S. Wang, C. Wang, A. Zavalny, R. Xu, B. Kailkhura, and K. Xu. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023.
- Fu et al. [2023] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and T. Khot. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
- Gawlikowski et al. [2023] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, pages 1–77, 2023.
- Geng et al. [2023] J. Geng, F. Cai, Y. Wang, H. Koeppl, P. Nakov, and I. Gurevych. A survey of language model confidence estimation and calibration. arXiv preprint arXiv:2311.08298, 2023.
- Halpern [2017] J. Y. Halpern. Reasoning about uncertainty. MIT press, 2017.
- Hendrycks and Gimpel [2016] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
- Hendrycks et al. [2020] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Huang et al. [2023] Y. Huang, J. Song, Z. Wang, H. Chen, and L. Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023.
- Jin et al. [2021] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
- Joachims et al. [1999] T. Joachims et al. Transductive inference for text classification using support vector machines. In Icml, volume 99, pages 200–209, 1999.
- Jospin et al. [2022] L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, and M. Bennamoun. Hands-on bayesian neural networks—a tutorial for deep learning users. IEEE Computational Intelligence Magazine, 17(2):29–48, 2022.
- Kadavath et al. [2022] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Kolmogorov [1965] A. N. Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1(1):1–7, 1965.
- Kuhn et al. [2023] L. Kuhn, Y. Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
- Kumar et al. [2023] B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023.
- Li et al. [2024] M. Li, W. Wang, F. Feng, F. Zhu, Q. Wang, and T.-S. Chua. Think twice before assure: Confidence estimation for large language models through reflection on multiple answers. arXiv preprint arXiv:2403.09972, 2024.
- Lin et al. [2021] S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Liu et al. [2024] T. Y. Liu, M. Trager, A. Achille, P. Perera, L. Zancato, and S. Soatto. Meaning representations from trajectories in autoregressive models. Proc. of the Intl. Conf. on Learning Representations (ICLR), 2024.
- Malinin and Gales [2018] A. Malinin and M. Gales. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018.
- Naeini et al. [2015] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
- Neal [2012] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Sanyal et al. [2024] S. Sanyal, T. Xiao, J. Liu, W. Wang, and X. Ren. Minds versus machines: Rethinking entailment verification with language models. arXiv preprint arXiv:2402.03686, 2024.
- Shafer and Vovk [2008] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
- Shrivastava et al. [2023] V. Shrivastava, P. Liang, and A. Kumar. Llamas know what gpts don’t show: Surrogate models for confidence estimation. arXiv preprint arXiv:2311.08877, 2023.
- Soatto et al. [2023] S. Soatto, P. Tabuada, P. Chaudhari, and T. Y. Liu. Taming AI bots: Controllability of neural states in large language models. arXiv preprint arXiv:2305.18449, 2023.
- Steyvers et al. [2024] M. Steyvers, H. Tejeda, A. Kumar, C. Belem, S. Karny, X. Hu, L. Mayer, and P. Smyth. The calibration gap between model and human confidence in large language models. arXiv preprint arXiv:2401.13835, 2024.
- Talmor et al. [2018] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
- Tian et al. [2023] K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
- Turpin et al. [2024] M. Turpin, J. Michael, E. Perez, and S. Bowman. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
- [42] V. N. Vapnik. The nature of statistical learning theory.
- Wang et al. [2022] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Xiao and Wang [2021] Y. Xiao and W. Y. Wang. On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025, 2021.
- Xiong et al. [2023] M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, and B. Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023.
- Yang et al. [2021] J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
- Yin et al. [2023] Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023.
- Yu et al. [2023] Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen. Towards better chain-of-thought prompting strategies: A survey. arXiv preprint arXiv:2310.04959, 2023.
- Zhang et al. [2023] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
- Zhao et al. [2024] X. Zhao, H. Zhang, X. Pan, W. Yao, D. Yu, T. Wu, and J. Chen. Fact-and-reflection (far) improves confidence calibration of large language models. arXiv preprint arXiv:2402.17124, 2024.
- Zhou et al. [2023] H. Zhou, X. Wan, L. Proleev, D. Mincu, J. Chen, K. Heller, and S. Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249, 2023.
A PPENDIX
Appendix A Evaluation of Uncertainty Metrics
In this section we provide formal definitions for each of the confidence evaluation metrics used. Consider the paired dataset $(x_{i},y_{i})∈\mathcal{D}$ where each datapoint $x_{i}$ has associated label $y_{i}$ . Each $y_{i}$ takes on one value in the discrete set $\mathcal{Y}:=\{1,2,...,\ell\}$ . Now our chosen prediction model $\phi$ outputs a prediction $\hat{y}_{i}:=\phi(x_{i})$ and our confidence function $f$ produces a score $f(x_{i},\hat{y}_{i})=r_{i}∈[0,1]$ . We use the indicator variable $c_{i}$ to denote whether the prediction is correct ( $c_{i}:=\mathbf{1}(y_{i}=\hat{y}_{i})$ ). Lastly we define the full sequence of predictions $\hat{Y}$ and confidence predictions $R$ on dataset $\mathcal{D}$ of size $N$ as
$$
\displaystyle\hat{Y} \displaystyle:=\{\hat{y}_{i}=\phi(x_{i})\mid x_{i}\in\mathcal{D}\} \displaystyle R \displaystyle:=\left\{r_{i}=f(x_{i},\phi(x_{i}))\mid x_{i}\in\mathcal{D}\right\} \tag{7}
$$
Expected Calibration Error (ECE)
To calculate expected calibration error, we first group our data into $M$ partitions based on confidence interval. We denote the set of indices in each partition as:
$$
\displaystyle B_{m}:=\left\{i\>|\>i\in N,\>\frac{(m-1)}{M}<r_{i}\leq\frac{m}{M}\right\} \tag{9}
$$
Next, the empirical accuracy and average confidence functions for each partition are defined as
$$
\displaystyle Acc(B_{m}):=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}c_{i},\quad Conf(B_{m}):=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}r_{i} \tag{10}
$$
Then the ECE is defined as the following weighted average:
$$
\displaystyle\text{ECE}(R,\hat{Y},M):=\sum_{m\in M}\frac{|B_{m}|}{M}|Acc(B_{m})-Conf(B_{m})| \tag{11}
$$
The lower this error is, the better calibrated the model should be (with respect to the data distribution). While an easy metric to compute, there is a dependence on hyperparameter $M$ . Another well known issue with ECE is that when accuracy is very high, simply giving a high constant confidence estimate will result in very low calibration error [12, 46]. Despite these drawbacks, we still choose to report the ECE metric as it is intuitive and serves as a common reference point with previous work.
Area Under the Risk-Coverage Curve (AURC)
For now, assume that $r_{i}≠ r_{j}~∀ i≠ j$ . Define the subset $R_{≥ r_{i}}$ as
$$
\displaystyle R_{\geq r_{i}}:=\{r\in R\mid r\geq r_{i}\} \tag{12}
$$
We now say that the ordering map $\sigma:\{1,...,N\}→\{1,...,N\}$ is the function that returns the dataset index $i$ of the $k$ th largest element in $R$ . Formally:
$$
\sigma(k):=i\quad s.t.~|R_{\geq r_{i}}|=k \tag{13}
$$
To summarize so far, this ordering essentially gives us the dataset index of the $k$ th most confident prediction. We can now finally define subsets of our most confident predictions as
$$
\hat{Y}_{K}:=\{\hat{y}_{\sigma(k)}\mid k\in\{1,\dots,K\}\} \tag{14}
$$
The risk-coverage curve will measure the tradeoff between the size of $\hat{Y}_{K}$ and the accuracy. For each coverage level $h:=K/N∈[0,1]$ , we plot the accuracy $Acc(\hat{Y}_{K})∈[0,1]$ to obtain the curve. Naturally $h=1\implies K=N$ and so the loss is simply the average model accuracy for the entire dataset. If our confidence measure is a good one, we expect higher accuracy when restricting our evaluation to a smaller subset of the most confident answers. Formally, the area under the risc-coverage curve (AURC) is is
$$
\text{AURC}(R,\hat{Y}):=\sum_{K=1}^{N}Acc(\hat{Y}_{K})\frac{K}{N} \tag{15}
$$
Area Under the Receiver Operator Curve (AUROC)
For any binary classification problem, the receiver operator curve looks at the tradeoff between false positive rate $\alpha$ (plotted on the x-axis) and true positive rate $\beta$ (y-axis), based on retaining only predictions with scores above some threshold $t$ . We denote a thresholded set of predictions as $\hat{Y}_{t}:=\{y_{i}∈\mathcal{D}\mid r_{i}>t\}$ , and $t_{\alpha}$ as the threshold such that $\text{FP}(\hat{Y}_{t_{\alpha}})=\alpha$ . If we have built a perfect classifier of correct and incorrect predictions, there should exist a threshold $t_{0}$ for which $\hat{Y}_{t_{0}}$ contains all of the predictions the model got right and none of which it got wrong. This would correspond to a true positive rate of $\beta=1.0$ for all false positive levels $\alpha∈[0,1]$ . Conversely, if confidence metrics were generated at random, any $X_{t}$ is likely to contain just as many false positives and true positives, and so the ROC curve will resemble a diagonal line. Therefore we would like the area under the reciever operator curve to be as closer to 1 as possible. Formally, this area is written as
$$
\text{AUROC}(R,\hat{Y}):=\int_{0}^{1}\text{TP}(\hat{Y}_{t_{\alpha}})d\alpha, \tag{16}
$$
Appendix B Experimental Details
In this section we discuss the implementation details of LLM prompts, dataset characteristics, and evaluation methods. We also include additional experiments examining the effect of explanation hyperparameters.
B.1 Prompts
In this section we provide the prompts used for each confidence elicitation method. Text in red represents substitutions that are made to the prompt at inference time, for example adding the text of the specific multiple choice question. For the stable explanations method in Figure ˜ 3 we provide our explanation generation prompt and conditional answer generation prompt. We use the response from this first prompt to generate our default question explanations (discarding the answer that comes after). We then use the logits from the second prompt conditioned on explanations as the posterior answer distribution for that explanation. The entailment probability prompt used is the same as in [34]. For the token probability prompt (Figure ˜ 4) we use a simple question and answer format, and use the softmax of next token logits to determine answer confidence. For the linguistic confidence prompt in Figure ˜ 5 we follow [36] best prompt choice and parse the returned response for answer and confidence value. For chain-of-thought consistency confidence we use a zero-shot modified version of the prompt from [14] (Figure ˜ 6) to generate multiple explanations and answers (discarding explanations and taking a majority vote over returned answers). We also explore using this prompt to generate explanations (discarding answers instead) for our CoT-stability confidence metric. The top-k confidence prompt is provided in Figure ˜ 7; the resulting LLM response is parsed for $k$ confidence values. Lastly we include the conditional explanation prompt used to generate correct and incorrect explanations in Figure ˜ 1. Unless otherwise noted, temperature for all generated explanations is set to Temp=0.7 for both stable explanations and CoT-consistency method.
<details>
<summary>figures/stability_prompt_v2.png Details</summary>

### Visual Description
\n
## Text Document: Prompt Definitions
### Overview
The image presents a text document outlining three different prompt types for a language model, likely related to reasoning and answer selection. The prompts are named "Stability Explanation Prompt", "Stability Conditional Answer Prompt", and "Entailment Prompt". Each prompt type includes a description of its purpose and the expected output format.
### Components/Axes
The document is structured into three distinct sections, each defining a prompt type. Each section contains:
* **Prompt Name:** (e.g., "Stability Explanation Prompt") - Bolded text.
* **Prompt Description:** A paragraph explaining the prompt's function.
* **Output Format:** Specific instructions on how the language model should format its response.
* **Example Fields:** Placeholder text indicating the expected input and output elements (e.g., "Question: [multiple choice question]").
### Detailed Analysis or Content Details
**1. Stability Explanation Prompt:**
* **Description:** Asks the model to read a question and select the most appropriate answer, indicating the associated letter.
* **Output Format:** "Explanation: <detailed reasoning steps> Answer: (letter)"
* **Example Input:** "Question: [multiple choice question]"
**2. Stability Conditional Answer Prompt:**
* **Description:** Instructs the model to act as an expert analyst, considering arguments from different perspectives, and choose the correct answer from valid choices.
* **Output Format:** "Argument: Given the scenario in the question, [explanation] Answer: The correct answer is"
* **Example Input:** "Argument: Given the scenario in the question, [explanation]"
**3. Entailment Prompt:**
* **Description:** Presents a premise and a hypothesis and asks the model to determine if the hypothesis is correct given the premise.
* **Output Format:** "Premise: [multiple choice question] Hypothesis:[explanation] Question: Given the premise, is the hypothesis correct? Answer (T/F):"
* **Example Input:** "Premise: [multiple choice question]" and "Hypothesis:[explanation]"
### Key Observations
The prompts are designed to elicit different types of reasoning from the language model:
* **Explanation:** Requires the model to justify its answer.
* **Conditional Analysis:** Requires the model to consider multiple perspectives.
* **Entailment:** Requires the model to assess logical relationships between statements.
The consistent use of bracketed placeholders ([...]) indicates that these are intended to be replaced with actual content during prompt execution.
### Interpretation
This document defines a set of prompts intended to evaluate and refine the reasoning capabilities of a language model. The prompts are structured to encourage not just correct answers, but also transparent reasoning processes. The variety of prompt types suggests a comprehensive approach to assessing the model's ability to handle different types of logical challenges. The emphasis on specific output formats ensures consistency and facilitates automated evaluation of the model's responses. The prompts are designed to test the model's ability to provide explanations, consider different viewpoints, and assess logical entailment, all of which are crucial for building reliable and trustworthy AI systems.
</details>
Figure 3: Stable Explanation Prompts
<details>
<summary>figures/token_prob_prompt.png Details</summary>

### Visual Description
\n
## Screenshot: Prompt/Answer Interface
### Overview
The image is a screenshot of a user interface element, likely from a language model or AI assistant. It displays a structured prompt/answer format, with fields for "Token Prob Confidence Prompt", "Question", and "Answer". The background is a light gray.
### Components/Axes
The interface consists of three labeled fields:
* **Token Prob Confidence Prompt:** This label appears in red text at the top-left.
* **Question:** This label appears in red text below the first label.
* **Answer:** This label appears in red text below the second label.
The fields themselves are white rectangular areas for text input. The entire element is contained within a rounded rectangle with a light blue border.
### Detailed Analysis or Content Details
The text within the fields is placeholder text:
* **Token Prob Confidence Prompt:** No text is present.
* **Question:** "[multiple choice question]"
* **Answer:** No text is present.
The font used for the labels is a sans-serif font, and the font used for the placeholder text is also a sans-serif font.
### Key Observations
The interface is designed for structured input and output, likely related to evaluating the confidence or probability of tokens generated by a language model. The red color of the labels draws attention to the field names. The use of placeholder text indicates that this is a template or example.
### Interpretation
This interface suggests a system for evaluating the performance of a language model. The "Token Prob Confidence Prompt" field likely refers to the prompt used to generate a response, along with information about the model's confidence in its token predictions. The "Question" field is where the input question is placed, and the "Answer" field is where the model's response would be displayed. The structure implies a focus on analyzing the model's internal workings and assessing the reliability of its outputs. The interface is not presenting data, but rather a template for data input and analysis. It is a UI element, not a chart or diagram.
</details>
Figure 4: Token Probability Prompt
<details>
<summary>figures/lingusitic_prompt.png Details</summary>

### Visual Description
\n
## Screenshot: Linguistic Confidence Prompt Example
### Overview
This is a screenshot of a text-based interaction, likely from a language model or similar system. It demonstrates a "Linguistic Confidence Prompt" where the system answers multiple-choice questions and provides a confidence score for each answer. The background is a light green color.
### Components/Axes
The screenshot contains the following elements:
* **Prompt Text:** Instructions for the system regarding answering questions and providing confidence scores.
* **Question 1:** "This is a question" with answer choices (A) through (E).
* **Answer 1:** "Answer: (D)"
* **Confidence 1:** "Confidence: 0.4"
* **Question 2:** "This is another Question" with answer choices (A) through (E).
* **Answer 2:** "Answer: (A)"
* **Confidence 2:** "Confidence: 0.7"
* **Question 3:** "[multiple choice question]" with no answer or confidence provided.
* **Text Color:** Black for most text, red for the last question.
### Detailed Analysis or Content Details
The screenshot presents three question-answer pairs.
* **Question 1:**
* Question: "This is a question"
* Choices: (A) first answer, (B) second answer, (C) third answer, (D) fourth answer, (E) fifth Answer
* Answer: (D)
* Confidence: 0.4
* **Question 2:**
* Question: "This is another Question"
* Choices: (A) first answer, (B) second answer, (C) third answer, (D) fourth answer, (E) fifth Answer
* Answer: (A)
* Confidence: 0.7
* **Question 3:**
* Question: "[multiple choice question]"
* Choices: Not provided in the screenshot.
* Answer: Not provided in the screenshot.
* Confidence: Not provided in the screenshot.
The confidence scores range from 0 to 1, with 0 indicating low confidence and 1 indicating high confidence. The first answer has a confidence of 0.4, while the second answer has a confidence of 0.7. The third question is incomplete.
### Key Observations
* The system provides a confidence score alongside each answer, indicating its certainty.
* The confidence scores vary, suggesting the system is not equally certain about all answers.
* The third question is incomplete, lacking both an answer and a confidence score.
* The last question is written in red text.
### Interpretation
The screenshot demonstrates a system designed to not only answer questions but also to quantify its own uncertainty. This is a crucial feature for applications where reliability is paramount. The varying confidence scores suggest the system is capable of assessing the difficulty or ambiguity of a question. The incomplete third question might indicate an error or a case where the system failed to provide a response. The use of red text for the last question could be a visual cue indicating an issue or a specific type of question. The prompt itself is a meta-cognitive instruction, asking the system to reflect on its own reasoning process. This is a common technique in the development of more sophisticated AI systems.
</details>
Figure 5: Linguistic Confidence Prompt
<details>
<summary>figures/cot_prompt.png Details</summary>

### Visual Description
\n
## Textual Document: Chain-of-Thought (CoT) Explanation Prompt Example
### Overview
The image presents a textual example of a prompt designed to elicit Chain-of-Thought reasoning from a language model. It demonstrates a question-and-answer format, where the model is instructed to first provide step-by-step reasoning before selecting an answer from a given set of choices.
### Components/Axes
The document is structured into question-answer pairs. Each pair consists of:
1. A prompt instruction at the top.
2. A question labeled "Q:".
3. A set of choices labeled "(A)", "(B)", "(C)", and "(D)".
4. An answer labeled "A:", which includes the reasoning process and the final selected answer.
### Detailed Analysis or Content Details
The document contains three examples:
**Example 1:**
* **Prompt:** "You will be given a question at the end, after the examples, for which you are to select the most appropriate answer by indicating the associated letter. Please first output step-by-step reasoning about how to solve the question. Then output the answer. You MUST output exactly one of the provided answers."
* **Question:** "Q: This is a question Which one of the choices is correct, (A), (B), (C) or (D)?"
* **Choices:**
* (A) first answer
* (B) second answer
* (C) third answer
* (D) fourth answer
* **Answer:** "A: Let's think step by step. Given the scenario, we know that answer cannot be (B) or (C) because... From here we continue our line of reasoning... Therefore, the answer is (A)."
**Example 2:**
* **Question:** "Q: This is another question Which one of the choices is correct, (A), (B), (C) or (D)?"
* **Choices:**
* (A) first answer
* (B) second answer
* (C) third answer
* (D) fourth answer
* **Answer:** "A: Let's think step by step. This is more step-by-step reasoning Therefore the answer is (C)."
**Example 3:**
* **Question:** "Q: [multiple choice question]"
* **Answer:** "A: Let's think step by step."
### Key Observations
* The prompt emphasizes the importance of *showing* the reasoning process, not just providing the answer.
* The examples demonstrate a conversational tone ("Let's think step by step").
* The reasoning provided in the first two examples is somewhat vague and incomplete ("because...", "From here we continue our line of reasoning...").
* The third example is incomplete, suggesting it's a placeholder for a future question.
### Interpretation
This document serves as a template or example for prompting a language model to perform Chain-of-Thought reasoning. The goal is to encourage the model to articulate its thought process, making its decision-making more transparent and potentially more accurate. The "Let's think step by step" phrase acts as a trigger for the model to engage in this type of reasoning. The incomplete examples suggest this is a demonstration of the *format* rather than a fully fleshed-out reasoning process. The prompt's insistence on selecting *exactly one* of the provided answers is a constraint designed to ensure a definitive response. The document highlights a technique for improving the reliability and explainability of language model outputs.
</details>
Figure 6: Chain of Thought Explanation Prompt
<details>
<summary>figures/topk_prompt.png Details</summary>

### Visual Description
\n
## Text Block: Prompt Instructions
### Overview
The image contains a text block outlining instructions for a "Top-K Confidence Prompt" task. The prompt details how a language model should respond to a multiple-choice question by providing a ranked list of guesses along with associated confidence probabilities.
### Content Details
The text can be transcribed as follows:
"Top-K Confidence Prompt:
The task is to read the given question and select the most appropriate answer by indicating the associated letter. Provide your (k) best guesses and the probability that each is correct (0.0 to 1.0) for the following question. Give ONLY the guesses and probabilities, no other words or explanation.
For example:
G1: <first most likely guess, as short as possible; not a complete sentence, just the guess!>
P1: <the probability between 0.0 and 1.0 that G1 is correct, without any extra commentary whatsoever; just the probability!>
...
GN: <Nth most likely guess, as short as possible; not a complete sentence, just the guess!>
PN: <the probability between 0.0 and 1.0 that GN is correct, without any extra commentary whatsoever; just the probability!>
Question: [multiple choice question]"
### Key Observations
The text is formatted as a set of instructions, with an example provided to clarify the expected output format. The instructions emphasize conciseness and the exclusive provision of guesses and probabilities. The placeholder "[multiple choice question]" indicates that this is a template for a larger prompt.
### Interpretation
This text defines a specific prompting strategy designed to elicit confidence estimates from a language model. The "Top-K" aspect suggests that the model should provide multiple potential answers, ranked by their estimated probability of correctness. The strict output format (guesses and probabilities only) is intended to facilitate automated evaluation and analysis of the model's performance. The example is crucial for understanding the desired response structure. The prompt is designed to avoid verbose explanations and focus solely on the model's confidence in its answers.
</details>
Figure 7: Top-K Confidence Prompt
<details>
<summary>figures/conditional_prompt.png Details</summary>

### Visual Description
\n
## Text Block: Conditional Explanation Prompt
### Overview
The image presents a text block outlining a prompt for generating explanations for multiple-choice questions. It defines the input format (question and candidate answer) and the desired output (an explanation of why the answer could be correct). The background is a solid lavender color.
### Components/Axes
There are no axes or components in the traditional sense. The image consists entirely of text. The text is structured into three labeled sections: "Conditional Explanation Prompt", "Question:", and "Explanation:".
### Detailed Analysis or Content Details
The text content is as follows:
**Conditional Explanation Prompt:**
"Given a question and candidate answer you are to provide an explanation why this answer could be correct."
**Question:**
"[multiple choice question]"
**Candidate Answer:**
"[answer text]"
**Explanation:**
(This section is left blank, indicating where the generated explanation would be placed.)
### Key Observations
The text is formatted to resemble a prompt template. The bracketed placeholders "[multiple choice question]" and "[answer text]" indicate where user input would be inserted. The prompt is designed to elicit a justification for a given answer, rather than simply stating whether it is correct or incorrect.
### Interpretation
This image represents a prompt used in a system designed to evaluate or generate explanations for multiple-choice questions. It suggests a focus on *conditional* reasoning – explaining *why* an answer *could* be correct, rather than asserting its absolute truth. This is a common approach in educational settings and AI systems that aim to provide more nuanced feedback. The blank "Explanation:" section highlights the core task: generating a reasoned justification. The lavender background is purely aesthetic and doesn't contribute to the informational content. The prompt is a meta-instruction, describing how to respond to a specific type of input.
</details>
Figure 8: Conditional explanation prompt used to generate explanations in Figure ˜ 1
B.2 Dataset Details
We can observe in Table ˜ 3 that the QA datasets with longer questions typically are harder for the model to answer correctly. We see that our method, like many other sample+aggregate based answering methods generally has higher accuracy than the baseline model [43]. This accuracy boost is less pronounced however for GPT-4.
For GPT-3.5-Turbo results we generate confidence scores for 250 questions per dataset (or maximum dataset size if smaller). Due to computational cost we only use 100 questions per dataset when testing on GPT-4-Turbo. We use validation splits for CSQA, TruthQA, and test splits for MedQA and MMLU datasets.
| CSQA TruthQA MedQA | 151 329 916 | 0.79 0.54 0.59 | 0.80 0.64 0.68 | 0.84 0.85 0.82 | 0.88 0.91 0.84 |
| --- | --- | --- | --- | --- | --- |
| MMLU Law | 1233 | 0.46 | 0.56 | 0.64 | 0.67 |
| MMLU Physics | 172 | 0.57 | 0.72 | 0.92 | 0.92 |
Table 3: Comparing accuracy for default model predictions vs. most confident stability predictions across benchmark datasets. One can observe a clear improvement in accuracy for both GPT-3.5 and GPT-4.
B.3 Evaluation Details
When evaluating confidence methods, it is important to note that performance implicitly depends on the prediction set $\hat{Y}$ . For example, a metric may be well calibrated on correct answers but still be overconfident on incorrect ones, meaning the confidence metric would evaluate as worse on a less accurate prediction set. Therefore, for comparison purposes we use the same set of default LLM predictions (setting Temp=0) for GPT-3.5 and GPT-4 results.
In order to break possible ties in confidence when evaluating AURC and AUROC methods, we follow the approach of [36] and add a small amount of gaussian noise ( $\sigma=1e-6$ ) to each confidence score. We repeat this process for $r=10$ times and take the average AURC and AUROC scores. We provided We also follow common practice in previous works by using $M=10$ as the number of bins when calculating ECE [2].
We use OpenAI’s gpt-3.5-turbo-1106 snapshot for GPT-3.5 experiments and gpt-4-1106-preview snapshot for GPT-4. Generating and evaluating confidence scores for each method on one dataset takes on the order of an hour for GPT-3.5-Turbo, and two hours for GPT-4-Turbo using OpenAI’s API.
<details>
<summary>figures/example_curves.png Details</summary>

### Visual Description
## Charts: Risk Coverage and Receiver Operator Curves
### Overview
The image presents two charts side-by-side. The left chart is a "Risk Coverage Curve" and the right chart is a "Receiver Operator Curve" (ROC). Both charts evaluate the performance of different methods (linguistic, tokenprob, consistency, topk, stability) against a baseline, likely in a medical context given the "MEDQA" title. Both charts share a similar color scheme for the methods being evaluated.
### Components/Axes
**Left Chart (Risk Coverage Curve):**
* **Title:** Risk Coverage Curve ↑ (The arrow indicates increasing is better)
* **X-axis:** Coverage (ranging from 0.0 to 1.0)
* **Y-axis:** Accuracy (ranging from 0.0 to 1.0)
* **Legend (top-right):**
* linguistic (AUROC: 0.9007) - Yellow
* tokenprob (AUROC: 0.9214) - Blue
* consistency (AUROC: 0.9267) - Green
* topk (AUROC: 0.9136) - Red
* stability (AUROC: 0.9638) - Purple
* baseline accuracy (0.82) - Black dashed line
**Right Chart (Receiver Operator Curve):**
* **Title:** Receiver Operator Curve ↑ (The arrow indicates increasing is better)
* **X-axis:** False Positive Rate (ranging from 0.0 to 1.0)
* **Y-axis:** True Positive Rate (ranging from 0.0 to 1.0)
* **Legend (top-right):**
* linguistic (AUROC: 0.6704) - Yellow
* tokenprob (AUROC: 0.8149) - Blue
* consistency (AUROC: 0.8286) - Green
* topk (AUROC: 0.7101) - Red
* stability (AUROC: 0.9078) - Purple
* random - Black dashed line
### Detailed Analysis or Content Details
**Left Chart (Risk Coverage Curve):**
* The 'stability' line (purple) shows the highest accuracy across all coverage values, consistently above 0.9. It slopes downward slightly as coverage increases.
* The 'consistency' line (green) is also high, generally above 0.9, and similar to 'stability' in its trend.
* 'tokenprob' (blue) and 'topk' (red) lines are slightly lower, hovering around 0.85-0.95.
* 'linguistic' (yellow) is the lowest of the methods, starting around 0.8 and decreasing more rapidly with increasing coverage.
* The baseline accuracy (black dashed) is a horizontal line at approximately 0.82. All methods outperform the baseline.
**Right Chart (Receiver Operator Curve):**
* The 'stability' line (purple) demonstrates the best performance, curving sharply upwards and reaching a True Positive Rate close to 1.0 with a False Positive Rate below 0.2.
* 'consistency' (green) and 'tokenprob' (blue) show moderate performance, with curves that are less steep than 'stability' but still above the 'random' line.
* 'linguistic' (yellow) and 'topk' (red) have the lowest performance, with curves that are closer to the 'random' line.
* The 'random' line (black dashed) represents the performance of a random classifier, serving as a benchmark.
### Key Observations
* 'Stability' consistently outperforms all other methods on both charts, indicating it is the most robust approach.
* The Risk Coverage Curve shows that higher coverage generally comes at the cost of accuracy, particularly for the 'linguistic' method.
* The ROC curve highlights the ability of each method to discriminate between true positives and false positives.
* The AUROC (Area Under the Receiver Operating Characteristic curve) values provided in the legend quantify the overall performance of each method. Higher AUROC values indicate better performance.
### Interpretation
These charts evaluate the effectiveness of different methods for identifying risks or making predictions in a medical question answering (MEDQA) context. The "Risk Coverage Curve" assesses how well each method can identify a broad range of risks while maintaining accuracy. The "Receiver Operator Curve" evaluates the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate).
The consistent superior performance of the 'stability' method suggests that it is the most reliable approach for this task. It achieves high accuracy across a wide range of coverage levels and effectively discriminates between true and false positives. The lower performance of the 'linguistic' method indicates that relying solely on linguistic features may not be sufficient for accurate risk assessment. The baseline accuracy provides a point of reference, and all methods demonstrate improvement over random chance. The difference in AUROC values between the methods is significant, indicating varying degrees of predictive power. The charts suggest that a combination of methods, potentially weighted towards 'stability', could yield the best overall performance.
</details>
Figure 9: Risk coverage (left) and receiver-operator (right) curves for confidence metrics generated on the MedQA questions using GPT-4. Our stability method outperforms others on this dataset as evidenced by larger area under the curves. We can also observe that questions with confidences in the top 50% were all correct.
B.4 Alternate Explanation Plausibility Measures
Inspired by [24], which looks at the true/false token probability an LLM assigns to a given answer being true, we explore evaluating the probability that an explanation is ‘true’. To do this, we provide the model with both question and explanation and ask: ‘Is this the most likely explanation? (T/F)’. We also try asking the question ‘Does the explanation completely describe the question? (T/F)’. We then repeat the experiment in Section ˜ 4.2, examining distributions of explanations measured via these probabilities. We find in figure ˜ 10 that these measures fail to properly distinguish between different explanations.
<details>
<summary>figures/mmlu_law_explanation_p_likely.png Details</summary>

### Visual Description
\n
## Charts: MMLU Law Probability Distributions
### Overview
The image presents two density plots, side-by-side, comparing the distributions of probabilities for "correct" and "incorrect" answers on the MMLU (Massive Multitask Language Understanding) Law benchmark. The left plot shows the distribution of P(Most Likely), while the right plot shows the distribution of P(Complete). Both plots use density as the y-axis and probability as the x-axis.
### Components/Axes
Each chart shares the following components:
* **Title:** "MMLU Law P(Most Likely)" (left) and "MMLU Law P(Complete)" (right) positioned at the top-center.
* **X-axis Label:** "P(Most Likely)" (left) and "P(Complete)" (right)
* **Y-axis Label:** "Density"
* **X-axis Scale:** Ranges from 0.0 to 1.0, with tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Y-axis Scale:** Ranges from 0.0 to approximately 15.0, with tick marks at 0, 5, 10, and 15.
* **Legend:** Located in the top-left corner of each chart.
* "correct" - represented by a dark blue line.
* "incorrect" - represented by a dark red line.
### Detailed Analysis
**Chart 1: MMLU Law P(Most Likely)**
* **Correct (Blue Line):** The density distribution for correct answers shows a strong peak very close to 1.0. The line rises sharply from approximately 0.85 to 1.0, reaching a maximum density of approximately 14 at a probability of 1.0. There is minimal density for probabilities below 0.85.
* **Incorrect (Red Line):** The density distribution for incorrect answers also peaks near 1.0, but is more spread out than the "correct" distribution. The peak is around 0.95, with a maximum density of approximately 12. There is a noticeable tail extending from approximately 0.7 to 0.85, indicating some incorrect answers have lower probabilities.
**Chart 2: MMLU Law P(Complete)**
* **Correct (Blue Line):** The density distribution for correct answers is highly concentrated near 1.0. The line rises sharply from approximately 0.85 to 1.0, reaching a maximum density of approximately 15 at a probability of 1.0. There is very little density for probabilities below 0.85.
* **Incorrect (Red Line):** The density distribution for incorrect answers is also centered around 1.0, but is broader and has a more pronounced tail than the "correct" distribution. The peak is around 0.95, with a maximum density of approximately 13. There is a noticeable tail extending from approximately 0.7 to 0.9, indicating some incorrect answers have lower probabilities.
### Key Observations
* In both charts, the density is significantly higher for probabilities close to 1.0 for both correct and incorrect answers.
* The "correct" answers tend to have a sharper, more concentrated distribution around 1.0 compared to "incorrect" answers.
* The "incorrect" answers exhibit a wider distribution and a longer tail towards lower probabilities, suggesting that incorrect answers are more likely to have lower confidence scores (as indicated by the probability).
* The distributions for both "correct" and "incorrect" answers are heavily skewed towards higher probabilities.
### Interpretation
These charts demonstrate the relationship between the predicted probability (P(Most Likely) and P(Complete)) and the correctness of the answers on the MMLU Law benchmark. The data suggests that higher predicted probabilities are generally associated with correct answers, but this association is not perfect.
The sharper distribution of probabilities for correct answers indicates that the model is more confident in its correct predictions. The broader distribution and tail for incorrect answers suggest that the model sometimes assigns high probabilities to incorrect answers, or that incorrect answers can have a wider range of confidence scores.
The fact that both distributions are heavily skewed towards 1.0 suggests that the model generally assigns high probabilities to its predictions, even when they are incorrect. This could indicate a tendency for overconfidence.
The difference between P(Most Likely) and P(Complete) could be related to the completeness of the answer. P(Complete) might be a more robust metric, as it considers the entire answer, while P(Most Likely) only considers the most probable token. The similarity in the distributions suggests that both metrics provide similar information about the model's confidence and accuracy.
</details>
Figure 10: Empirical distribution of MMLU explanations when measured via GPT-3.5 probability of being ‘most-likely explanation’ (left) and probability of ‘completely describing’ the question (right). One can see that true (blue) and false (red) answer-conditioned explanations are difficult to distinguish.
B.5 Sensistivity to Explanation Prompting
Our stable explanation method reweights explanations based on entailment probability, but if the quality of sampled explanations is poor to begin with our resulting distribution will still be inaccurate. Here we will discuss the effect of instructing the LLM to generate explanations before or after an answer (i.e. the order of ‘explanation’ and ‘answer’ in the stability explanation prompt in Figure ˜ 3). We observe in Table ˜ 4 that generating explanations before the answer clearly results in higher quality explanations, as evidenced by improved performance on selective uncertainty and calibration tasks.
| | Pre-Answer Stability (Default) AURC $\uparrow$ | AUROC $\uparrow$ | Post-Answer Stability ECE $\downarrow$ | AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ |
| --- | --- | --- | --- | --- | --- | --- |
| CSQA | 0.901 | 0.779 | 0.123 | 0.866 | 0.731 | 0.201 |
| TruthQA | 0.801 | 0.853 | 0.21 | 0.792 | 0.839 | 0.254 |
| MedQA | 0.784 | 0.798 | 0.169 | 0.743 | 0.743 | 0.251 |
| MMLU Law | 0.642 | 0.736 | 0.259 | 0.629 | 0.706 | 0.289 |
| MMLU Physics | 0.792 | 0.834 | 0.186 | 0.779 | 0.811 | 0.252 |
Table 4: Comparing stability confidence performance using explanations generated before and after an answer for GPT-3.5. One can clearly observe that explanations generated before the answer (i.e. in chain-of-thought fashion) outperform those generated afterwards across all performance metrics.
B.6 Varying Sample Size
In this section we briefly analyze the effect that the number of sampled explanation has on our confidence metric. In Figure ˜ 11 we observe that selective uncertainty performance (AURC and AUROC) saturates quickly for simpler questions answering tasks such as commonsenseqa. On the other hand MedQA and MMLU Law datasets both demonstrate steady performance gains up to $M=5$ samples per question. Calibration error gradually decreases for all datasets examined.
<details>
<summary>figures/aurc_vs_explanations.png Details</summary>

### Visual Description
## Line Chart: AURC vs. Number Explanations
### Overview
This line chart depicts the relationship between the Area Under the Receiver Operating Characteristic curve (AURC) and the number of explanations provided, for five different question answering datasets: CSQA, TruthQA, MedQA, MMLU Law, and MMLU Physics. The x-axis represents the number of explanations (ranging from 0 to 6), and the y-axis represents the AURC score (ranging from 0 to 1).
### Components/Axes
* **Title:** AURC vs. Number Explanations
* **X-axis Label:** Number Explanations (Scale: 0, 1, 2, 3, 4, 5, 6)
* **Y-axis Label:** AURC (Scale: 0, 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1)
* **Legend:** Located at the bottom-center of the chart.
* CSQA (Blue Line)
* TruthQA (Red Line)
* MedQA (Gray Line)
* MMLU Law (Orange Line)
* MMLU Physics (Green Line)
### Detailed Analysis
Here's a breakdown of each data series and their trends:
* **CSQA (Blue Line):** The line is relatively flat, starting at approximately 0.91 at x=1 and remaining around 0.90 throughout the range of explanations, with a slight upward trend.
* (1, 0.91), (2, 0.90), (3, 0.90), (5, 0.90)
* **TruthQA (Red Line):** The line shows a slight upward trend from x=1 to x=3, then plateaus.
* (1, 0.78), (2, 0.79), (3, 0.80), (5, 0.80)
* **MedQA (Gray Line):** The line starts at approximately 0.74 at x=1, increases to around 0.77 at x=3, and then decreases slightly to around 0.76 at x=5.
* (1, 0.74), (2, 0.75), (3, 0.77), (5, 0.76)
* **MMLU Law (Orange Line):** The line shows a significant decreasing trend. It starts at approximately 0.60 at x=1 and decreases to around 0.58 at x=5.
* (1, 0.60), (2, 0.63), (3, 0.62), (5, 0.58)
* **MMLU Physics (Green Line):** The line starts at approximately 0.79 at x=1, decreases to around 0.76 at x=3, and then increases slightly to around 0.78 at x=5.
* (1, 0.79), (2, 0.77), (3, 0.76), (5, 0.78)
### Key Observations
* CSQA consistently exhibits the highest AURC scores across all numbers of explanations.
* MMLU Law shows a consistent decline in AURC as the number of explanations increases.
* TruthQA and MedQA show relatively stable AURC scores with minor fluctuations.
* MMLU Physics shows a more complex pattern, with an initial decrease followed by a slight increase.
### Interpretation
The chart suggests that providing more explanations does not necessarily improve the AURC score for all datasets. For CSQA, the AURC remains consistently high regardless of the number of explanations, indicating that the model already performs well on this dataset. The decreasing trend for MMLU Law suggests that adding explanations might be detrimental to performance on this dataset, potentially due to the explanations being misleading or irrelevant. The relatively stable performance of TruthQA and MedQA suggests that explanations have a limited impact on these datasets. The fluctuating performance of MMLU Physics could indicate a more nuanced relationship between explanations and performance, where the quality or relevance of the explanations plays a crucial role.
The differences in trends across datasets highlight the importance of tailoring explanation strategies to the specific characteristics of each dataset. It is possible that some datasets benefit from explanations, while others do not, or even suffer from them. Further investigation is needed to understand why these differences exist and to develop more effective explanation methods. The chart also suggests that simply providing more explanations is not a guaranteed way to improve model performance, and that the quality and relevance of the explanations are critical factors.
</details>
(a) AURC vs. Number of Explanations for the stable explanations confidence metric.
<details>
<summary>figures/auroc_vs_explanations.png Details</summary>

### Visual Description
\n
## Line Chart: AUROC vs. Number Explanations
### Overview
This line chart depicts the relationship between the Area Under the Receiver Operating Characteristic curve (AUROC) and the number of explanations provided, for five different question answering datasets. The x-axis represents the number of explanations, ranging from 0 to 6. The y-axis represents the AUROC score, ranging from 0.6 to 1.0. Five distinct lines represent the AUROC scores for each dataset as the number of explanations increases.
### Components/Axes
* **Title:** AUROC vs. Number Explanations
* **X-axis Label:** Number Explanations (Scale: 0, 1, 2, 3, 4, 5, 6)
* **Y-axis Label:** AUROC (Scale: 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0)
* **Legend:** Located at the bottom-center of the chart.
* CSQA (Blue)
* TruthQA (Gray)
* MedQA (Light Gray)
* MMLU Law (Orange)
* MMLU Physics (Teal)
### Detailed Analysis
The chart displays five lines, each representing a different dataset's AUROC score as the number of explanations increases.
* **CSQA (Blue):** The line starts at approximately 0.73 at 1 explanation, rises to approximately 0.78 at 3 explanations, and then plateaus around 0.78-0.80 for 4 and 5 explanations.
* **TruthQA (Gray):** The line begins at approximately 0.75 at 1 explanation, increases to approximately 0.79 at 3 explanations, and then decreases slightly to approximately 0.78 at 5 explanations.
* **MedQA (Light Gray):** The line starts at approximately 0.78 at 1 explanation, increases to approximately 0.81 at 3 explanations, and then remains relatively stable around 0.80-0.82 for 4 and 5 explanations.
* **MMLU Law (Orange):** The line starts at approximately 0.69 at 1 explanation, increases steadily to approximately 0.85 at 5 explanations. This line shows the most significant positive trend.
* **MMLU Physics (Teal):** The line begins at approximately 0.79 at 1 explanation, decreases slightly to approximately 0.78 at 3 explanations, and then increases to approximately 0.80 at 5 explanations.
### Key Observations
* MMLU Law demonstrates the most substantial improvement in AUROC score as the number of explanations increases.
* CSQA, TruthQA, and MedQA show relatively stable AUROC scores after 3 explanations.
* MMLU Physics exhibits a slight initial decrease in AUROC score before stabilizing and increasing.
* All datasets show an initial increase in AUROC score with the addition of explanations, suggesting that explanations generally improve model performance.
### Interpretation
The data suggests that providing explanations alongside question-answering models can improve their performance, as measured by AUROC. The degree of improvement varies significantly depending on the dataset. MMLU Law benefits the most from explanations, indicating that this dataset may be particularly sensitive to the reasoning provided. The other datasets show diminishing returns with additional explanations, suggesting that there is a point at which adding more explanations does not significantly improve performance. The initial dip in MMLU Physics could be due to the explanations being initially unhelpful or even misleading, before the model learns to effectively utilize them. This chart highlights the importance of considering the specific characteristics of a dataset when evaluating the effectiveness of explanation-based methods. The consistent upward trend for MMLU Law suggests that this dataset may benefit from more complex or detailed explanations.
</details>
(b) AUROC vs. Number of Explanations for the stable explanations confidence metric.
<details>
<summary>figures/ece_vs_explanations.png Details</summary>

### Visual Description
## Line Chart: ECE vs. Number Explanations
### Overview
This line chart illustrates the relationship between Expected Calibration Error (ECE) and the number of explanations provided, for several different question answering datasets. The chart displays how ECE changes as the number of explanations increases from 1 to 5.
### Components/Axes
* **Title:** "ECE vs. Number Explanations" - positioned at the top-center of the chart.
* **X-axis:** "Number Explanations" - ranging from 0 to 6, with tick marks at 1, 2, 3, 4, and 5.
* **Y-axis:** "ECE" - ranging from 0 to 0.4, with tick marks at 0.05 intervals.
* **Legend:** Located at the bottom-center of the chart, listing the datasets:
* CSQA (Blue)
* TruthQA (Gray)
* MedQA (Dark Gray)
* MMLU Law (Orange)
* MMLU Physics (Teal)
### Detailed Analysis
The chart contains five distinct lines, each representing a different dataset.
* **CSQA (Blue):** The line slopes downward, indicating a decrease in ECE as the number of explanations increases.
* At 1 explanation: ECE ≈ 0.28
* At 2 explanations: ECE ≈ 0.25
* At 3 explanations: ECE ≈ 0.22
* At 4 explanations: ECE ≈ 0.19
* At 5 explanations: ECE ≈ 0.16
* **TruthQA (Gray):** The line is relatively flat, with a slight downward trend.
* At 1 explanation: ECE ≈ 0.26
* At 2 explanations: ECE ≈ 0.24
* At 3 explanations: ECE ≈ 0.22
* At 4 explanations: ECE ≈ 0.20
* At 5 explanations: ECE ≈ 0.18
* **MedQA (Dark Gray):** The line shows a moderate downward trend.
* At 1 explanation: ECE ≈ 0.21
* At 2 explanations: ECE ≈ 0.18
* At 3 explanations: ECE ≈ 0.16
* At 4 explanations: ECE ≈ 0.14
* At 5 explanations: ECE ≈ 0.13
* **MMLU Law (Orange):** The line starts with a high ECE and decreases significantly with more explanations.
* At 1 explanation: ECE ≈ 0.31
* At 2 explanations: ECE ≈ 0.27
* At 3 explanations: ECE ≈ 0.25
* At 4 explanations: ECE ≈ 0.23
* At 5 explanations: ECE ≈ 0.21
* **MMLU Physics (Teal):** The line shows a moderate downward trend, similar to MedQA.
* At 1 explanation: ECE ≈ 0.23
* At 2 explanations: ECE ≈ 0.20
* At 3 explanations: ECE ≈ 0.18
* At 4 explanations: ECE ≈ 0.17
* At 5 explanations: ECE ≈ 0.16
### Key Observations
* MMLU Law consistently exhibits the highest ECE values across all numbers of explanations.
* CSQA shows the most significant reduction in ECE as the number of explanations increases.
* TruthQA demonstrates the least change in ECE with increasing explanations.
* All datasets show a general trend of decreasing ECE with more explanations, suggesting that providing more explanations improves calibration.
### Interpretation
The chart demonstrates a clear inverse relationship between the number of explanations provided and the Expected Calibration Error (ECE) for various question answering datasets. This suggests that increasing the number of explanations helps models become better calibrated, meaning their predicted confidence levels more accurately reflect their actual correctness. The varying degrees of ECE reduction across datasets indicate that some datasets benefit more from explanations than others. The high ECE of MMLU Law suggests that this dataset is particularly challenging to calibrate, and requires more explanations to achieve reliable confidence estimates. The relatively stable ECE of TruthQA suggests that this dataset is already well-calibrated, or that the benefit of additional explanations is minimal. This data is valuable for understanding the impact of explainability techniques on model calibration and for identifying datasets where explainability is most crucial.
</details>
(c) ECE vs. Number of Explanations for the stable explanations confidence metric.
Figure 11: Comparison of stable explanation confidence using different numbers of explanations per question ( $M=\{1,3,5\}$ ). Testing is done on GPT-3.5-Turbo for all five benchmark datasets. One can observe improving but saturating performance for each dataset.
B.7 Comparison to TTA
Contemporaneously to this manuscript’s submission, another method related to our approach was proposed [28]. The Think-Twice before assure (TTA) method asks for explanations conditioned on different answers, then does a top-k confidence elicitation using these explanations in the prompt. Although similar in the sense that confidence metrics are being generated by conditioning on explanations, their combination of explanations into a single prompt does not match the ensemble of test-time classifiers view that our method takes. The authors have not yet released code or dataset splits, but we have implemented their method by following the written procedure and using the same prompts (see Figure ˜ 12). We found during our implementation on the shared CSQA dataset, the evaluation results for selective uncertainty tasks are slightly below that what the authors report (AURC, AUROC), most likely due to the difference in specific questions used during testing. Nonetheless we report the full results of our implementation in table ˜ 5, and note that this metric does appear to have lower ECE in many cases.
| CSQA | TTA (Our Implementation) AURC $\uparrow$ 0.885 | AUROC $\uparrow$ 0.688 | ECE $\downarrow$ 0.104* | Acc. $\uparrow$ 0.736 |
| --- | --- | --- | --- | --- |
| TruthQA | 0.698 | 0.706 | 0.093* | 0.672* |
| MedQA | 0.641 | 0.581 | 0.207 | 0.505 |
| MMLU Law | 0.574 | 0.657 | 0.148* | 0.456 |
| MMLU Physics | 0.717 | 0.697 | 0.1* | 0.557 |
Table 5: Evaluation for the TTA Confidence metric (Our implementation) on GPT-3.5. Results that outperform our stable explanations metric are marked with an asterisk.
<details>
<summary>figures/tta_prompt.png Details</summary>

### Visual Description
\n
## Text Document: TTA Prompt Structure
### Overview
The image presents a structured prompt template for a Task-Specific Text-to-Answer (TTA) system. It outlines two distinct prompts: an "Explanation Prompt" and a "Confidence Prompt," detailing their purpose and expected input/output format. The background is a pale yellow.
### Components/Axes
The document is divided into two main sections, visually separated by rounded rectangles with dark grey borders.
* **TTA Explanation Prompt:** This section describes the task of selecting the best answer to a multiple-choice question and providing a justification.
* **TTA Confidence Prompt:** This section details a prompt for generating multiple possible explanations with associated confidence levels.
The following placeholders are present within the prompts:
* `[multiple choice question]`
* `[answer text]`
* `[explanation 1]` through `[explanation N]`
* `[Top-K Prompt]`
### Detailed Analysis or Content Details
**TTA Explanation Prompt:**
The text reads:
"TTA Explanation Prompt:
The task is to read the given question and select the most appropriate answer by indicating the associated letter.
Question: [multiple choice question]
Answer: [answer text]
Please generate an explanation to try to justify the answer judgement."
**TTA Confidence Prompt:**
The text reads:
"TTA Confidence Prompt:
[Top-K Prompt]
Possible explanation 1: [explanation 1]
Possible explanation 2: [explanation 2]
...
Possible explanation N: [explanation N]"
The ellipsis (...) indicates that the "Possible explanation" lines can be repeated for a variable number of explanations (N).
### Key Observations
The document is a template, relying heavily on placeholders. It's designed to guide the creation of prompts for a system that not only answers questions but also explains *why* it chose that answer, and potentially provides multiple explanations with varying degrees of confidence. The use of brackets `[]` consistently denotes placeholders for dynamic content.
### Interpretation
This document outlines a sophisticated approach to question answering. It moves beyond simply providing a correct answer and emphasizes the importance of explainability and confidence estimation. The "Explanation Prompt" aims to elicit a rationale for the chosen answer, while the "Confidence Prompt" explores multiple possible justifications and implicitly suggests a mechanism for ranking them. This structure is likely intended for building more robust and trustworthy AI systems, where understanding the reasoning behind a decision is as important as the decision itself. The "Top-K Prompt" suggests a method for generating a set of candidate explanations, which could then be evaluated based on various criteria (e.g., coherence, relevance, plausibility). The template is a blueprint for a system designed to mimic human-like reasoning and justification.
</details>
Figure 12: TTA Confidence Prompt
Appendix C Alternative Perspectives of Stable Explanations
C.1 Confidence through the Viewpoint of Transductive Inference
Transductive learning selects a classifier at inference-time based on a combination of training and test data [10, 42, 22]. Typically transductive learning involves fine-tuning some classifier parameter $w$ based on an explicit loss objective. However, we claim that using an LLM to generate a sequence of text before an answer (i.e. an explanation) is an alternate way of doing transductive reasoning. First, note that answering a question after an explanation, such as in chain-of-thought prompting [44], effectively changes the decision boundary of the LLM classifier at inference time. Second, consider that when an LLM generates an explanation, it produces concepts related to those in the question. These additional concepts can be thought of as forcing the LLM at inference time to pay more attention to the decision boundary in the area around the test datum. In-context learning literature, which examines LLM performance after manually inserting demonstrations similar to the test question, has already shown a direct connection between transformer context adjustment and classical fine-tuning behavior [9].
To formalize this perspective, let $D^{t}=\{(x_{1},y_{1}),...,(x_{t},y_{t})\}$ be a dataset of sequential data up to time $t$ , with $x_{i}∈ X⊂{\mathbb{R}}^{M}$ and labels $y_{i}∈ Y⊂\{1,...,K\}$ . We denote with $D^{t}_{-}=\{(x_{1},y_{1}),...,(x_{t-1},y_{t-1}),x_{t}\}$ the dataset without the last label $y_{t}$ . We can write our transductive prediction for $x_{t}$ given data $D^{t}_{-}$ including $x_{t}$ as:
$$
\displaystyle\hat{y}_{t}=~\underset{w,y}{\rm argmin}~{}\underbrace{\frac{1}{t}\ell(f_{w}(x_{t}),y)+\frac{1}{t}\sum_{i=1}^{t-1}\ell(f_{w}(x_{i}),y_{i})}_{\doteq L(w;(D^{t}_{-},y))}. \tag{17}
$$
If $\ell$ is interpreted as a log likelihood, then $L$ can be interpreted as the negative log posterior probability over hypotheses. If we think of optimizing instead over explanations where $f_{e}(x_{t})=\phi(x_{t},e)$ , then the problem reduces to finding an explanation that strongly supports a single answer without biasing predictions on original test data. The second term in equation ˜ 17 is expensive to compute at inference time, but if some approximation of this training loss existed it would make optimization tractable. We hypothesize that if the explanation under consideration is plausible and faithful to the question (as determined using the same LLM), it should not reduce the accuracy of previous decisions too much. Therefore we can avoid having to optimize over all previous questions and instead optimize over whatever faithfulness measure $g_{\phi}(e)$ we define:
$$
\displaystyle\hat{y}_{t}=~\underset{e,y}{\rm argmin}~{}\ell(\phi(x_{t},e),y)+\lambda g_{\phi}(e) \tag{18}
$$
This looks exactly like the typical transductive setting but with a more easily computable ‘transductive prior’.
C.2 Confidence throught the Viewpoint of Solomonoff Induction
While transductive inference typically finds single test-time classifier, our method looks for a distribution of likely classifiers. In this sense, our method can be seen as a special case of Solomonoff induction [25]. Solomonoff induction considers how well data-generating programs, $H$ , (i.e. a binary string run on a Turing machine) explain the test data, $D$
$$
\displaystyle P(H|D)=\frac{P(D|H)P(H)}{P(D|H)P(H)+\sum_{A\neq H}P(D|A)P(A)}, \tag{19}
$$
where $A$ are alternative programs. Solomonoff induction formalizes the principle of Occam’s razor by choosing a universal prior $P(H)$ that gives a higher probability to shorter-length programs. Then to predict new data $D^{\prime}$ given previous observations, one simply computes
$$
\displaystyle P(D^{\prime}|D)=\mathbb{E}_{H}[P(D^{\prime}|H,D)]=\sum_{H}P(D^{\prime}|H,D)P(H|D). \tag{20}
$$
While these Bayesian equations seem simple, Solomonoff’s induction is provably uncomputable. However, our method can be interpreted as restricting our hypothesis class from the set of all computable programs $H$ to the set of all LLM-interpretable programs $e$ . Instead of a prior on program length, we can use the LLM’s prior likelihood of valid sequences in the language $p_{\phi}(e)$ . This restriction makes our calculations more tractable, as we can easily approximate expectations over our hypothesis class by sampling explanations from the LLM.
C.3 Confidence through the Viewpoint of Stability
Another recent line of work has been analyzing LLMs through the lens of stochastic dynamical models [37]. Through the perspective of stability analysis one could interpret our method’s preference for explanations convening to a single answer as searching for fixed points of a specific LLM system. This LLM dynamical system consists of two alternating steps, first generating an explanation conditioned on one of the answers ( $e←\phi(q,a)$ ) then generating a new answer based on this explanation ( $a^{\prime}←\phi(q,e)$ ). Intuitively this system mirrors how a human expert may think about a question by considering alternative conclusions one could draw given beliefs about the world. An answer with only a single plausible explanation that strongly supports that same answer (i.e. decision distribution collapses to a singleton) forms a stable cycle in this system.