# Cycles of Thought: Measuring LLM Confidence through Stable Explanations
**Authors**:
- Evan Becker (Department of Computer Science)
- &Stefano Soatto (Department of Computer Science)
(October 16, 2025)
Abstract
In many critical machine learning (ML) applications it is essential for a model to indicate when it is uncertain about a prediction. While large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, their overconfidence in incorrect responses is still a well-documented failure mode. Traditional methods for ML uncertainty quantification can be difficult to directly adapt to LLMs due to the computational cost of implementation and closed-source nature of many models. A variety of black-box methods have recently been proposed, but these often rely on heuristics such as self-verbalized confidence. We instead propose a framework for measuring an LLM’s uncertainty with respect to the distribution of generated explanations for an answer. While utilizing explanations is not a new idea in and of itself, by interpreting each possible model+explanation pair as a test-time classifier we can calculate a posterior answer distribution over the most likely of these classifiers. We demonstrate how a specific instance of this framework using explanation entailment as our classifier likelihood improves confidence score metrics (in particular AURC and AUROC) over baselines across five different datasets. We believe these results indicate that our framework is a promising way of quantifying uncertainty in LLMs.
1 Introduction
Large language models (LLMs) are known to at times confidently provide wrong answers, which can greatly mislead non-expert users of the model [46, 7]. In the some cases an LLM may even ‘hallucinate’ facts all together [45, 50]. Although scaling generally improves factual accuracy, past work has shown that even the largest models can give incorrect answers to certain types of questions [29].
To prevent these misleading scenarios, one intuitive approach is to have the model also report its confidence (or uncertainty) in the accuracy of its own response. This task, known as uncertainty quantification, has a vast associated literature [1, 15]. In its most naive form, this can entail taking the softmax of prediction logits to calculate a ‘distribution’ over answers. However in most cases there is no guarantee that this metric should correspond to the actual probability of correctness on a new datum. Empirically this mismatch has been demonstrated for LLM token logits [26, 2].
One might instead hope that by probing the model (e.g. through its weights or activations) one could infer a measure of confidence that somehow aligns with our expectations. However, full access to a large language model is often infeasible due to a combination of proprietary restrictions and computational expense. Recently a range of ‘black-box’ approaches have been proposed that avoid the need for access to internal model information [24, 46, 36]. These approaches typically rely on custom prompting strategies to elicit self-verbalized (linguistic) confidence or generate multiple variations of a response (consistency). While empirically promising, these methods are heuristic and still return overconfident responses in many cases.
We reason that the main issue with existing uncertainty quantification methods for LLMs stems from the underlying inductive assumption that test and training data are sampled from the same distribution. Unfortunately, this is rarely the case, meaning any uncertainty quantification strategy that is well-calibrated on one dataset is not guaranteed to be calibrated on new test data. However, an LLM offers a unique opportunity to adjust its decision boundary transductively at test-time via intermediate generated text (explanations). While inserting random text would likely lead to a high-entropy decision distribution, adding relevant facts or logical step-by-step reasoning serves to ‘stabilize’ the sampled answers around an isolated minimum. Indeed, prompts inducing chain of thought (CoT) reasoning have already shown to improve model accuracy in this manner [44, 44]. However, more recent work has shown that even CoT explanations can be biased and may not correspond with the correct answer [41]. If we could somehow distinguish between ‘stable’ and ‘unstable’ explanations then we would know to what extent to trust their corresponding answer distributions.
In this work we propose a method for generating confidence scores from the distribution of LLM-generated explanations for an answer. This method, which we call stable explanations confidence, can be thought of as computing the posterior predictive distribution by transductive marginalization over test-time classifiers. We illustrate the usefulness of these scores on two common uncertainty quantification tasks: calibration, in which we measure how close confidence is to empirical accuracy, and selective uncertainty, in which we determine how well the scores can discriminate between correct and incorrect predictions. We compare to other recently proposed methods across five datasets of different scope and complexity (CommonsenseQA, TruthfulQA, MedQA, MMLU Professional Law, MMLU Conceptual Physics) using two popular LLMs (GPT-3.5 [5] and GPT4 [2]). We find that our method on average outperforms baselines on the selective uncertainty task (measured via AUROC and AURC), particularly for more complex question-answering problems.
2 Related Work
In this section we first summarize the uncertainty quantification problem in machine learning. We then highlight key challenges in the natural language generation setting and the ‘confidence gap’ of existing LLM models. Lastly we discuss exisiting approaches for LLM uncertainty quantification and methods for their evaluation.
2.1 Uncertainty Quantification in Machine Learning
Defining and reasoning about uncertainty has been a long-standing problem in different disciplines including philosophy, statistics, and economics. Many formal representations with unique properties have been proposed, (e.g. Dempster-Shafer belief functions, ranking functions, etc. [17]), but in the machine learning setting uncertainty quantification typically relies on the standard language of probability measures. For a classification task we can think of the sequential training data-label pairs $\mathcal{D}:=\{(x_{i},y_{i})\}_{i=1}^{N}$ as the model’s source of knowledge about the world. Given some test datum $x_{N+1}$ , we would like the model to both make a prediction $\hat{y}_{N+1}$ and provide a ‘useful’ confidence score $r_{N+1}∈[0,1]$ . Useful confidence scores should allow models to express their belief in the accuracy of a prediction, and is called well-calibrated if on average predictions with confidence $r=0.XX$ are correct close to $XX\%$ of the time. If the classification task also specifies cases for which it is better to return no prediction than a wrong one, we can imagine creating some selection rule using confidence scores to determine whether to trust the classifier’s prediction. We will formalize these two related goals later when discussing evaluation metrics in Section ˜ 4.1.
Uncertainty quantification methods differ from one another based on their assumptions about where uncertainty is coming from. Sources of uncertainty are traditionally categorized into two broad classes: epistemic uncertainty arising from the agent’s incomplete knowledge of the world, and aleatoric uncertainty inherent to the data generating process (e.g. the flip of a coin). In reality, definitions vary among the machine learning community [4] and most methods do not fit neatly into either category. In this work we discuss a few of most common methods based on the underlying assumptions placed on the test data. We make this distinction because without this fundamental assumption it is impossible to know anything about the test distribution from training data. Note that for a full discussion and taxonomy of the numerous uncertainty quantification methods in machine learning we refer to a suvery paper such as [1, 15].
Related Training and Test Worlds.
Most uncertainty quantification methods rely on the fundamental assumption that the test data comes from the same distribution as the training set. Under this type of assumption Bayesian approaches such as Bayesian Neural Networks (BNNs) are popular. BNNs measure epistemic uncertainty through a posterior on the learned weights, which can be reduced as more data is recieved [33, 23]. Another popular method is that of conformal prediction, which introduces a somewhat dual notion of the conformal set. Under a slightly weaker exchangibility assumption (i.e. that the joint distribution remains the same under permutations of the training and test data), the conformal set of predictions is guaranteed to contain the true label with error probability less than some $\epsilon$ [35]. Weaker predictive models result in larger conformal sets, and so set size can be taken as an indicator for higher model uncertainty. Other methods include looking at the robustness of predictions under semantic-preserving transformations of the input, as mentioned in [15].
Different Training and Test Worlds.
Small and large differences between training and test distributions are typically denoted as distribution shift and out-of-distribution respectively [47]. In this setting methods like prior networks attempt to capture the specific notion of this distributional uncertainty through and additional prior over predictive distributions and training explicitly on a loss objective [31].
2.2 Uncertainty Quantification in LLMs
Recently much attention has been devoted to measuring uncertainty specifically in LLMs [16, 20]. Since LLMs are generative models, uncertainty may be measured with respect to an infinite set of text sequences as opposed to a fixed number of classification labels [4]. Many works, however, use multiple choice question answering tasks to evaluate LLMs using standard classification methodologies [43, 24], and we will follow a similar approach in this work. Issues with using token logits directly to compute confidence are well known. Recent works [2, 24, 38] show that larger models are typically better calibrated on multiple choice datasets than smaller ones, but are still sensitive to question reformulations as well as typical RLHF training strategies. Another recent work [48] notes that language models fail to identify unanswerable questions at a higher rate than humans.
At a high level, existing techniques for LLM confidence elicitation can be classified as either white-box, requiring access to internal model weights and token probabilities, or black-box, using only samples from the model [16]. We choose to summarize inference time interventions below, as training time interventions are often computationally expensive and require strict inductive assumptions.
White-box Methods. Access to the last activation layer of the LLM (token logits) admits calculating token and token sequence probabilities via the softmax function. One can incorporate text sequence probabilities to implement conformal prediction [27] methods, or adjust them based on semantic importance of individual tokens to improve calibration [13]. Surrogate models can also serve as an effective substitute if access the original model is restricted-access [36]. Internal activations can also be observed to determine if certain feature directions are more or less truthful [3, 6].
Black-box Methods. Black-box confidence typically uses one or both of the following approaches: Sample+aggregate methods involve analyzing the distributions of multiple responses sampled from the model [46]. Responses can be generated in a variety of ways, such as using chain-of-thought prompting [43], asking for multiple answers in a single response [40], or perturbing the question in-between samples [28]. Confidence can be found by observing the frequency with which answers occur, or by averaging over other metrics [8]. Self-evaluation methods use customized prompts in order for the model to generate its own confidence estimates in natural language [24]. These methods can also be augmented with chain-of-thought or other more complex reasoning steps [11]. Much effort has been put into analyzing how changes in prompt (e.g. by including few-shot examples) affects these confidences [52, 51].
3 Stable Explanations
Given a question, we would like to assign a confidence value to an answer based on how plausible its associated explanations are. Intuitively, humans are confident in an answer when likely explanations exist for it and no other answers have reasonable explanations. However, the space of explanations (variable-length token sequences) is infinite and hard to work with directly. To overcome this, we will first approximate this distribution by sampling a set of explanations from the LLM conditioned on the question, and then reweight based on their logical consistency with the question desscrption. Afterwards we can compute the degree to which explanations support each answer. We can view these two steps as estimating the conditional likelihood of the explanation given the question, and the conditional answer distribution of the test-time model parameterized by this explanation. These two components will allow us to compute a posterior predictive distribution in a Bayesian fashion. We formalize each step in the following subsections, and summarize the complete method in Algorithm ˜ 1.
Input: LLM $\phi$ , question $q$ and selected answer $a_{i}∈\mathcal{A}$ , explanation sample size $N$
Output: Confidence estimate $\hat{p}(a_{i}|q)$
for $n=1... N$ do
$e_{n}\sim\phi(\text{prompt}_{explain}(q))$
// sample explanations
$\rho_{n}←\phi(\text{prompt}_{entail}(q,e_{n}))$
// compute probability that $q\models e_{n}$
end for
$z←\sum_{n=1}^{N}\rho_{n}$
$\hat{p}(a_{i}|q)←\sum_{n=1}^{N}\frac{\rho_{n}}{z}\text{softmax}(\phi(q,e_{n}))_{i}$
// marginalize over explanations
0.5em return $\hat{p}(a_{i}|q)$
Algorithm 1 Stable Explanation Confidence Calculation
Preliminaries.
Consider a multiple choice question $q:=\{x_{1},...,x_{t}\}=x^{t}$ consisting of a sequence of tokens in some alphabet $x_{j}∈\mathcal{A}$ , and a set of possible answers $a∈ S⊂eq\mathcal{A}$ which are also some subset of tokens in the same alphabet. We will designate $\phi$ as an LLM, which will take any variable length token sequence as input and output a token logit vector of size $|\mathcal{A}|$ . We use $\phi(s_{1},s_{2})$ to denote the direct concatenation of two token sequences in the LLM input, and $\phi(\text{prompt}(s))$ to denote adding prompt instructions to the input. Lasrly, $s\sim\phi$ will be used to denote sampling a token sequence from the LLM.
3.1 Answer Likelihood Conditioned on Explanations
In its default configuration, providing a question to an LLM $\phi$ without further input can be used to find an answer:
$$
\displaystyle~\underset{S}{\rm argmax}~{\phi(q,\{~\})}=a \tag{1}
$$
One can also naively compute a ‘probability distribution’ over possible answers by taking the softmax of token logits produced by the model. We will denote this calculation as
$$
\displaystyle p_{\phi}(a|q):=\text{softmax}(\phi(q,\{~\}))_{i}, \tag{2}
$$
where $i$ denotes the logit index of $a$ . However, these default token probabilities have been shown to be miscalibrated and sensitive to variations in the input [24, 40]. Next, we formally say that explanations, like questions, are also variable length sequences of tokens $e∈\mathcal{A}^{\tau}$ located between question and answer. If the model generates these explanations (like in the chain-of-thought reasoning paradigm [44]) then the sequences can be thought of as a possible trajectory from the question to an answer. While the set of possible trajectories is infinite, we can group explanations into equivalence classes by noting that two semantically identical explanations must support the same answers [30, 37]. This notion leads us to the following idea: characterize the distribution of explanations by looking at the new answers they lead to.
$$
\displaystyle~\underset{S}{\rm argmax}~{\phi(q,e)}=a^{\prime} \tag{3}
$$
This idea is related to the semantic entropy method of [26], but here we use the next token distribution $p_{\phi}(a|q,e)$ instead of a pairwise explanation similarity to ‘cluster’ explanations. If we can enumerate all likely explanations, we can calculate the posterior answer probability as follows
$$
\displaystyle\hat{p}(a|q)=\sum_{e}p_{\phi}(a|e,q)p(e|q) \tag{4}
$$
A key detail omitted so far is how to efficiently approximate the distribution of all ‘stable’ explanations. We will see in the following subsection that this can be achieved using only the LLM $\phi$ .
3.2 Determining Likely Explanations
A naive method for estimating $\hat{p}(e|q)$ would be to sample explanations using a modified prompt (e.g. using a CoT ‘think step-by-step’ approach). Indeed, a number of consistency-based question-answering methods work by sampling and then aggregating explanations and answers in this manner [43, 8]. However, due to the way LLMs are trained, this distribution does not necessarily represent the probability that an explanation actually explains the data in the question [49, 41]. To combat this, we enforce logical consistency by checking the entailment probability of our sampled explanations ( $q\models e$ ), which can be approximated by using the LLM and a modified prompt $\phi_{entail}(q,e)$ [34]. We then reweight sampled explanations using this entailment probability:
$$
\hat{p}(e|q):=\frac{\phi_{ent.}(q,e)}{\sum_{e^{\prime}\in E}\phi_{ent.}(q,e^{\prime})} \tag{5}
$$
We reason that enforcing logical structure prevents trusting explanations that ‘overfit’ to the test datum. For example while an explanation such as ‘the answer is always (a)’ is syntactically correct and may result in a confidently correct answer for our test question, it would prove a useless classifier on previous training data. While we use entailment probability in our main results, an exploration of alternative explanation plausibility calculations can be found in Section ˜ B.4.
4 Experiments
To gain insight into the usefulness of LLM-sampled explanations we first examine differences in distributions of explanations conditioned on correct vs. incorrect answers (see Figure ˜ 1) and find explanation entailment (Section ˜ 3.2) can help distinguish between the two. We then conduct a series of experiments to compare our proposed stable explanation confidence (Algorithm ˜ 1) with exisiting approaches across a set of five benchmark datasets and discuss our findings below.
4.1 Setup
Evaluation Method.
How do we know whether a proposed confidence metric is useful or not? In line with previous works [24, 46, 36, 40] there are typically two tasks that uncertainty metrics are evaluated on. The first is confidence calibration, where the goal is to produce confidence scores approximating the empirical probability that the model answers the question correctly. Expected calibration error (ECE) [32] attempts to estimate this using differences between the average confidence and accuracy for a group of similarly scored answers, however ECE can be misleading (see Section ˜ 5). We still include this metric in our reports for ease of comparison with previous work. The second related task is typically called selective uncertainty (also known as failure prediction). Here the goal is to create a binary classifier using confidence scores that predict when the model should return ‘I don’t know’ instead of its original prediction. A variety of classifier metrics can be used, depending on how one chooses to penalize false positive (overconfident) and false negative (underconfident) predictions. In this work we use two of the most common metrics: area under the reciever-operator curve (AUROC) [18], and area under the risk-coverage curve (AURC) [12]. Uninformative (i.e. completely random) confidence scores will have a worst-case AUROC of $0.5$ and an worst-case AURC equal to average model accuracy. The best possible value for both AUROC and AURC is $1.0$ . We include formal definitions for each of these metrics in Appendix ˜ A.
Datasets and Models.
We evaluate our method using five standard question answering datasets covering a variety of reasoning tasks: CommonsenseQA (CSQA) [39], TruthfulQA [29], MedQA [21], MMLU Professional Law, and MMLU Conceptual Physics [19]. Besides covering a range of topics, these datasets also vary largely in their complexity. As seen in Table ˜ 1, the average length of an MMLU law question is almost ten times that of the average CSQA question. Shorter questions typically resemble more traditional classification tasks (e.g. ‘Something that has a long and sharp blade is a? ’ from CSQA), while longer questions typically include descriptions of a specific scenario that require more complex reasoning. We test both methods and baselines on snapshots of two state-of-the-art models GPT-3.5-turbo [5] and GPT-4-turbo [2]. Further data and model details can be found in Appendix ˜ B.
Compared Metrics.
We use four different baselines for comparison purposes. Token probabilities for each answer can be produced by taking the softmax over the models logit vector and are one of the most commonly used confidence metrics during model evaluation [2, 7]. Linguistic and Top-k methods both ask the model for a verbalized confidence estimate directly, the former prompting the model for a single answer and confidence estimate while the later asks for the $k$ -best guesses and associated confidences [40, 36]. Lastly the sef-consistency method samples multiple responses from the model and approximates confidence via the relative frequency of parsed answers. Here we use a particular variant of this method, CoT-Consistency [43], which uses a zero-shot chain-of-thought prompt to generate responses, and which has been shown to outperform the vanilla method [46]. We use the similar prompts to those selected in previous work for comparison purposes, the details of which can be found in Section ˜ B.1.
| CSQA TruthQA MedQA | 151 329 916 | 0.79 0.54 0.59 | 0.84 0.85 0.82 |
| --- | --- | --- | --- |
| MMLU Law | 1233 | 0.46 | 0.64 |
| MMLU Physics | 172 | 0.57 | 0.92 |
Table 1: Average question length and accuracy for each of the datasets tested in this work. One can observe a weak correlation between question length and difficulty, as typically longer questions describe more complex scenarios and logical structure.
4.2 Likely Explanations Not Always Correct
We first illustrate how explanation likelihood, as measured via conditional token log probability, does not always correspond with the correctness of the supported answer. These results align with previous findings differentiating syntactically vs. semantically correct model responses [29, 26], and help us to motivate using entailment probability in our method. First recall that the length-normalized conditional log-likelihood for sequence $x^{t}$ given sequence $s$ is defined as
$$
\displaystyle LL(x^{t}|s):=\frac{1}{t}\sum_{i=1}^{t}\log(P_{\phi}(x_{i}|s,x_{1},x_{2},\dots,x_{i-1})), \tag{6}
$$
which can also be thought of as the average token logit value. Higher log-likelihood of explanations should mean higher chance of being sampled by the LLM. We can observe in Figure ˜ 1 two distributions of explanations: one set (in blue) is conditioned on answers we know a priori are correct, the second set (in red) is conditioned on incorrect responses. The model prompt for each set is the same and is given in Section ˜ B.1. We see that while the mean log-likelihood for correct explanations is slightly higher than that of incorrect explanations, the two distributions are hard to distinguish. In contrast there is clearly a distinct tail for the distribution of incorrect explanations measured via entailment probability. This result suggests that we may be able to discount certain explanations sampled by the LLM but that are well written but logically ‘unstable’, hence improving our confidence score.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Histogram: TruthQA Explanation Analysis
### Overview
The image contains two histograms comparing the distributions of "correct" and "incorrect" explanations in the TruthQA dataset. The left histogram visualizes the "Token Sequence Log Likelihood," while the right histogram visualizes the "Entailment Probability P(q ≠ e)". Both histograms show the density of correct and incorrect explanations across the respective metrics.
### Components/Axes
**Left Histogram: TruthQA Explanation Likelihood**
* **Title:** TruthQA Explanation Likelihood
* **Y-axis:** Density, ranging from 0.0 to 12.5 in increments of 2.5.
* **X-axis:** Token Sequence Log Likelihood, ranging from -0.8 to 0.0 in increments of 0.2.
* **Legend:** Located in the top-right corner.
* Blue line: "correct"
* Red line: "incorrect"
* **Data Representation:** Histograms with overlaid density curves. Blue bars and curve represent "correct" explanations, while red bars and curve represent "incorrect" explanations.
**Right Histogram: TruthQA Explanation Entailment**
* **Title:** TruthQA Explanation Entailment
* **Y-axis:** Density, ranging from 0.0 to 12.5 in increments of 2.5.
* **X-axis:** Entailment Probability P(q ≠ e), ranging from 0.0 to 1.0 in increments of 0.2.
* **Legend:** Located in the top-right corner.
* Blue line: "correct"
* Red line: "incorrect"
* **Data Representation:** Histograms with overlaid density curves. Blue bars and curve represent "correct" explanations, while red bars and curve represent "incorrect" explanations.
### Detailed Analysis
**Left Histogram: TruthQA Explanation Likelihood**
* **Correct (Blue):** The distribution is approximately normal, centered around -0.4. The density curve peaks around -0.4.
* Density at -0.6: ~2.5
* Density at -0.4: ~4.0
* Density at -0.2: ~3.0
* **Incorrect (Red):** The distribution is also approximately normal, centered around -0.4, but slightly skewed to the left. The density curve peaks around -0.4.
* Density at -0.6: ~2.0
* Density at -0.4: ~4.0
* Density at -0.2: ~2.5
**Right Histogram: TruthQA Explanation Entailment**
* **Correct (Blue):** The distribution is heavily skewed towards 1.0. The density curve rises sharply near 1.0.
* Density at 0.0: ~0.1
* Density at 0.2: ~0.1
* Density at 0.4: ~0.1
* Density at 0.6: ~0.1
* Density at 0.8: ~0.2
* Density at 1.0: ~4.0
* **Incorrect (Red):** The distribution is skewed towards 1.0, but less so than the "correct" distribution. The density curve rises near 1.0, but not as sharply.
* Density at 0.0: ~1.0
* Density at 0.2: ~0.3
* Density at 0.4: ~0.2
* Density at 0.6: ~0.2
* Density at 0.8: ~0.5
* Density at 1.0: ~2.0
### Key Observations
* In the "Token Sequence Log Likelihood" histogram, the distributions of "correct" and "incorrect" explanations are very similar, suggesting that this metric alone is not a strong discriminator.
* In the "Entailment Probability" histogram, the "correct" explanations are much more likely to have a high entailment probability (close to 1.0) compared to the "incorrect" explanations. This suggests that entailment probability is a better indicator of explanation quality.
### Interpretation
The data suggests that the "Entailment Probability P(q ≠ e)" is a more effective metric for distinguishing between correct and incorrect explanations in the TruthQA dataset than the "Token Sequence Log Likelihood." The higher density of correct explanations at high entailment probabilities indicates that correct explanations are more likely to be logically entailed by the question. The similarity in "Token Sequence Log Likelihood" distributions suggests that the overall likelihood of the token sequence is not as indicative of correctness as the entailment relationship.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
## Text Block: Example Question and Explanation
### Overview
The image presents an example question, a wrong answer, a generated explanation, and associated metrics (Log Likelihood and Entailment Probability) from a GPT-3.5 model. The entire block of text is contained within a rounded rectangle.
### Components/Axes
* **Example Question:** "What happens if you wear a hat made out of tin foil?"
* **Wrong Answer:** "If you wear a tin foil hat, you will be able to hear aliens."
* **Generated Explanation:** "It is possible for someone to believe that wearing a hat made out of tin foil could enhance their ability to hear aliens. This belief may stem from conspiracy theories or a belief in extraterrestrial communication"
* **GPT-3.5 Log Likelihood:** -0.304
* **GPT-3.5 Entailment Probability:** 0.137
### Detailed Analysis or Content Details
The text block provides an example of a question and a response generated by the GPT-3.5 model. The question is about the effects of wearing a tin foil hat. The "Wrong Answer" is a direct, albeit incorrect, assertion. The "Generated Explanation" offers a more nuanced response, suggesting that the belief in the hat's effects stems from conspiracy theories or beliefs in extraterrestrial communication. The Log Likelihood is -0.304, and the Entailment Probability is 0.137.
### Key Observations
* The "Generated Explanation" avoids a direct answer and instead provides a possible explanation for the belief.
* The Log Likelihood is negative, suggesting a relatively low probability of the generated explanation being accurate or relevant.
* The Entailment Probability is low, indicating a weak logical connection between the question and the generated explanation.
### Interpretation
The example demonstrates how a language model like GPT-3.5 might handle a question with a potentially nonsensical or conspiracy-related premise. The model avoids endorsing the "Wrong Answer" directly and instead offers a more cautious and contextualized response. The negative Log Likelihood and low Entailment Probability suggest that the model recognizes the questionable nature of the premise and the weak logical connection between the question and the generated explanation. This highlights the model's ability to generate plausible-sounding text even when dealing with illogical or unfounded beliefs.
</details>
Figure 1: Empirical distribution of explanation log likelihoods (top left) and explanation entailment probabilities (top right) generated for the TruthQA dataset using token logits from GPT3.5-Turbo. Red denotes explanations generated by conditioning on the incorrect answer and blue denotes explanations justifying the correct answer. While mean likelihood for the two explanation distributions are different, there is significant overlap. In contrast the tail of the incorrect explanation distribution is distinct when using entailment probability. The example explanation (lower) suggests we can use this entailment measure to distinguish semantically unlikely explanations in cases where likelihood fails.
4.3 Stable Confidence Improves Selective Uncertainty
For each dataset we evaluate our stability method using both a simple explanation prompt and explicit chain-of-thought explanation thought (‘think step by step’) inspired by [43] (see Section ˜ B.1). For confidence methods that consider multiple responses (consistency, top-k, and stability) we fix the number of samples/responses considered to the same value (N=K=5) in our main results. We further analyze the effect of changing sample size in Appendix ˜ B.
When testing on the GPT-3.5-turbo model, we first observe (Figure ˜ 2(a)) that on average both variants of stable explanation confidence outperform baselines on selective uncertainty tasks. Average AURC is 0.784 vs. next best of 0.761, while average AUROC is 0.802 vs. 0.789. Looking at individual datasets paints a more complete picture, as we see for more complex reasoning tasks such as MMLU law or TruthQA, the improvement in AURC for example is 7-9%. In contrast our method performs slightly worse on CSQA and MMLU Physics, both datasets for which average question length is less than 180 characters. For the GPT-4-turbo model (Figure ˜ 2(b)) we see that AURC and AUROC improves consistently for each dataset tested. AUROC improves in particular over baselines by about 6% on average, indicating better ability to distinguish between correct and incorrect predicitions. ECE is roughly the same as baselines in this case.
| | Method Linguistic Token Prob. | CSQA 0.844 0.92 | TruthQA 0.645 0.716 | MedQA 0.641 0.788 | MMLU Law 0.534 0.596 | MMLU Physics 0.617 0.754 | Average 0.656 0.755 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| AURC $\uparrow$ | CoT-Consistency | 0.891 | 0.735 | 0.755 | 0.626 | 0.796 | 0.761 |
| Top-K | 0.861 | 0.636 | 0.659 | 0.512 | 0.678 | 0.669 | |
| Stability (Ours) | 0.901 | 0.801 | 0.784 | 0.642 | 0.792 | 0.784 | |
| CoT-Stability (Ours) | 0.907 | 0.782 | 0.776 | 0.67 | 0.773 | 0.782 | |
| Linguistic | 0.607 | 0.671 | 0.591 | 0.617 | 0.563 | 0.610 | |
| Token Prob. | 0.793 | 0.735 | 0.768 | 0.667 | 0.748 | 0.742 | |
| AUROC $\uparrow$ | CoT-Consistency | 0.763 | 0.805 | 0.781 | 0.751 | 0.847 | 0.789 |
| Top-K | 0.69 | 0.612 | 0.594 | 0.585 | 0.616 | 0.619 | |
| Stability (Ours) | 0.779 | 0.853 | 0.798 | 0.736 | 0.834 | 0.800 | |
| CoT-Stability (Ours) | 0.767 | 0.837 | 0.794 | 0.792 | 0.818 | 0.802 | |
| Linguistic | 0.141 | 0.255 | 0.29 | 0.318 | 0.326 | 0.266 | |
| Token Prob. | 0.18 | 0.358 | 0.3 | 0.37 | 0.312 | 0.304 | |
| ECE $\downarrow$ | CoT-Consistency | 0.109 | 0.152 | 0.157 | 0.207 | 0.127 | 0.150 |
| Top-K | 0.177 | 0.174 | 0.203 | 0.13 | 0.124 | 0.162 | |
| Stability (Ours) | 0.123 | 0.21 | 0.169 | 0.259 | 0.186 | 0.189 | |
| CoT-Stability (Ours) | 0.142 | 0.19 | 0.168 | 0.213 | 0.167 | 0.176 | |
(a) Confidence Elicitation Strategies on GPT-3.5-turbo.
| | Method | CSQA | TruthQA | MedQA | MMLU Law | MMLU Physics | Average |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Linguistic | 0.918 | 0.933 | 0.901 | 0.672 | 0.956 | 0.876 | |
| Token Prob. | 0.911 | 0.932 | 0.928 | 0.792 | 0.978 | 0.908 | |
| AURC $\uparrow$ | CoT-Consistency | 0.911 | 0.924 | 0.929 | 0.797 | 0.978 | 0.908 |
| Top-K | 0.925 | 0.949 | 0.915 | 0.674 | 0.968 | 0.886 | |
| Stability (Ours) | 0.96 | 0.979 | 0.936 | 0.817 | 0.979 | 0.934 | |
| CoT-Stability (Ours) | 0.945 | 0.967 | 0.964 | 0.781 | 0.984 | 0.928 | |
| Linguistic | 0.724 | 0.747 | 0.679 | 0.56 | 0.644 | 0.671 | |
| Token Prob. | 0.755 | 0.8 | 0.814 | 0.757 | 0.859 | 0.797 | |
| AUROC $\uparrow$ | CoT-Consistency | 0.734 | 0.794 | 0.83 | 0.768 | 0.877 | 0.801 |
| Top-K | 0.736 | 0.849 | 0.709 | 0.601 | 0.758 | 0.731 | |
| Stability (Ours) | 0.875 | 0.948 | 0.818 | 0.782 | 0.87 | 0.859 | |
| CoT-Stability (Ours) | 0.849 | 0.907 | 0.908 | 0.713 | 0.882 | 0.852 | |
| Linguistic | 0.147 | 0.116 | 0.115 | 0.248 | 0.092 | 0.144 | |
| Token Prob. | 0.118 | 0.14 | 0.11 | 0.293 | 0.058 | 0.144 | |
| ECE $\downarrow$ | CoT-Consistency | 0.194 | 0.076 | 0.112 | 0.233 | 0.069 | 0.137 |
| Top-K | 0.116 | 0.109 | 0.192 | 0.131 | 0.148 | 0.139 | |
| Stability (Ours) | 0.117 | 0.077 | 0.158 | 0.262 | 0.083 | 0.139 | |
| CoT-Stability (Ours) | 0.118 | 0.079 | 0.107 | 0.309 | 0.075 | 0.138 | |
(b) Confidence Elicitation Strategies on GPT-4-turbo.
Figure 2: Comparision of LLM Confidence Elicitation Strategies. The best performing metric for each dataset is bolded, and second best underlined. (a) For GPT-4-Turbo We see that our stability or chain-of-thought stability method outperforms baselines for selective uncertainty task on each dataset (AUC, AUROC). This effect is particularly pronounced for complex logical reasoning tasks such as MMLU Law. (b) We also see on GPT-3.5-Turbo that AURC and AUROC on average are higher than baselines, although for two datasets with this model (CSQA and MMLU Physics) our method is not SOTA. ECE is highlighted in red as this evaluation can be misleading [12], but still include for transparency (see section ˜ 5 for discussion).
4.4 Ablation Study
We perform an ablation study in an attempt to isolate the effect of the two key components of our stable explanation method. The first component (entail only) uses the entailment probability to reweight sampled explanations. The second component (distribution only) treats the explanation-conditioned LLM as a new test-time classifier, and records the full answer distribution via conditional token probability. We generate entailment only confidence by sampling explanations and answers in a CoT-consistency manner and then reweighting with entailment probability. Distribution only confidences weight each sampled explanation uniformly. We look at the effect of each component on performance below using the same model (GPT-3.5-Turbo) across all datasets. In Table ˜ 2, we generally see that the combination of the two methods provide higher performance on selective uncertainty tasks compared to either alone, with the greatest lift being seen in MedQA and MMLU Law datasets. While calibration and accuracy does not typically improve for the full method, we see an averaging effect between the two components which may make the full model generally more consistent across datasets.
| | Stability Entail Only | | | Stability Distr. Only | | | Stability Full | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ | Acc. $\uparrow$ | AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ | Acc. $\uparrow$ | AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ | Acc. $\uparrow$ | |
| CSQA | 0.882 | 0.708 | 0.21 | 0.7 | 0.899 | 0.783 | 0.131 | 0.784 | 0.901 | 0.779 | 0.123 | 0.796 |
| TruthQA | 0.739 | 0.818 | 0.19 | 0.668 | 0.79 | 0.859 | 0.196 | 0.656 | 0.801 | 0.853 | 0.21 | 0.644 |
| MedQA | 0.74 | 0.762 | 0.186 | 0.62 | 0.735 | 0.778 | 0.16 | 0.688 | 0.784 | 0.798 | 0.169 | 0.633 |
| MMLU Law | 0.626 | 0.733 | 0.198 | 0.528 | 0.655 | 0.774 | 0.196 | 0.568 | 0.67 | 0.792 | 0.213 | 0.556 |
| MMLU Physics | 0.777 | 0.812 | 0.146 | 0.668 | 0.79 | 0.832 | 0.164 | 0.723 | 0.792 | 0.834 | 0.186 | 0.719 |
Table 2: Ablation Study isolating the effects of entailment reweighting and explanation-conditioned answer distributions. Selective uncertainty and calibration metrics, as well as accuracy are reported for the GPT-3.5-Turbo model. Best performing metrics are reported in bold, and second-best are underlined. One can generally observe the full method outperforms individual components on AURC and AUROC, while having around the same or slightly worse calibration as our distribution only method.
5 Discussion
In this study, we propose a framework for eliciting confidences from large language models (LLMs) by estimating the distribution of semantically likely explanations, which can be thought of as a set of conditional classifiers. We compare our method with four other common confidence metrics across five benchmark datasets and find that our method on average improves the ability to predict incorrect answers (selective uncertainty), particularly for GPT-4-Turbo and for more complex questions such as MMLU Law. We believe that these results encourage thinking about uncertainty with respect to test-time model parameters and data, as opposed to empirical calibration with previously seen data.
Alternate Perspectives.
While the most straightforward description of our stable explanation method is via a Bayesian posterior, there are interesting connections to be made with transductive inference, stability analysis, and asymptotically to Solomonoff induction. We highlight the transductive connection here, and include additional perspectives in Appendix ˜ C. Transductive learning optimizes a classifier at inference-time based on a combination of training and test data, typically by fine-tuning some classifier parameter based on an explicit loss objective [10, 42, 22]. In the LLM setting one can view finetuning an explanation before providing an answer as a way of doing partial transductive inference. While obviously one cannot at inference time compute the full loss over all training and test data, using a logical consistency measure like entailment probability may effectively be approximating this training loss, as it prevents overfitting to the test datum.
Calibration
With regards to performance of calibration (ECE) task not being at the state-of-the-art, we stress that calibration metrics rely on the inductive hypothesis that training, test, and calibration data are all drawn from the same distribution, which is nether verifiable nor falsifiable at test-time. Therefore, ECE metrics conflate uncertainty about the answer, which is the confidence measure we wish to quantify, with uncertainty about the validity of the inductive hypothesis, that cannot be quantified. Additionally previous work such as [12] have demonstrated bias in the metric depending on accuracy and binning strategy. For this reason we indicate the ECE metric in red in the tables, but include the results nonetheless for transparency and ease of comparison.
Limitations and Future Work
A notable exception to the observed trend of improved selective uncertainty occurs when making stable confidence predictions on simpler questions (e.g. average question lengths of CSQA and MMLU Conceptual Physics are less than half of others). We hypothesize that when questions resemble classical inductive classification tasks, the advantage of our test-time computation is less evident. Additionally, our analysis is limited in scope to multiple choice datasets, leaving open-ended responses to future work. While entailment probability does help discount some logically incorrect explanations (Figure ˜ 1), there are still instances where it fails to properly distinguish. We test some alternatives to explanation faithfulness in Section ˜ B.4, but further exploration is needed. Efficiently sampling high quality explanations remains an open question as well. Our method adjusts the given explanation distribution based on plausibility, but better explanations may still exist that are not sampled by the LLM. One possible solution could involve using our entailment probability measure as a way to accept or reject incoming samples, increasing complexity but ensuring higher quality.
References
- Abdar et al. [2021] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297, 2021.
- Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Azaria and Mitchell [2023] A. Azaria and T. Mitchell. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
- Baan et al. [2023] J. Baan, N. Daheim, E. Ilia, D. Ulmer, H.-S. Li, R. Fernández, B. Plank, R. Sennrich, C. Zerva, and W. Aziz. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703, 2023.
- Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Burns et al. [2022] C. Burns, H. Ye, D. Klein, and J. Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Chang et al. [2023] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
- Chen and Mueller [2023] J. Chen and J. Mueller. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175, 2023.
- Dai et al. [2022] D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. arXiv preprint arXiv:2212.10559, 2022.
- Dhillon et al. [2020] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto. A baseline for few-shot image classification. Proc. of the Intl. Conf. on Learning Representation (ICLR), 2020.
- Dhuliawala et al. [2023] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
- Ding et al. [2020] Y. Ding, J. Liu, J. Xiong, and Y. Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4–5, 2020.
- Duan et al. [2023] J. Duan, H. Cheng, S. Wang, C. Wang, A. Zavalny, R. Xu, B. Kailkhura, and K. Xu. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023.
- Fu et al. [2023] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and T. Khot. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
- Gawlikowski et al. [2023] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, pages 1–77, 2023.
- Geng et al. [2023] J. Geng, F. Cai, Y. Wang, H. Koeppl, P. Nakov, and I. Gurevych. A survey of language model confidence estimation and calibration. arXiv preprint arXiv:2311.08298, 2023.
- Halpern [2017] J. Y. Halpern. Reasoning about uncertainty. MIT press, 2017.
- Hendrycks and Gimpel [2016] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
- Hendrycks et al. [2020] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Huang et al. [2023] Y. Huang, J. Song, Z. Wang, H. Chen, and L. Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236, 2023.
- Jin et al. [2021] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
- Joachims et al. [1999] T. Joachims et al. Transductive inference for text classification using support vector machines. In Icml, volume 99, pages 200–209, 1999.
- Jospin et al. [2022] L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, and M. Bennamoun. Hands-on bayesian neural networks—a tutorial for deep learning users. IEEE Computational Intelligence Magazine, 17(2):29–48, 2022.
- Kadavath et al. [2022] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Kolmogorov [1965] A. N. Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1(1):1–7, 1965.
- Kuhn et al. [2023] L. Kuhn, Y. Gal, and S. Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
- Kumar et al. [2023] B. Kumar, C. Lu, G. Gupta, A. Palepu, D. Bellamy, R. Raskar, and A. Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023.
- Li et al. [2024] M. Li, W. Wang, F. Feng, F. Zhu, Q. Wang, and T.-S. Chua. Think twice before assure: Confidence estimation for large language models through reflection on multiple answers. arXiv preprint arXiv:2403.09972, 2024.
- Lin et al. [2021] S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Liu et al. [2024] T. Y. Liu, M. Trager, A. Achille, P. Perera, L. Zancato, and S. Soatto. Meaning representations from trajectories in autoregressive models. Proc. of the Intl. Conf. on Learning Representations (ICLR), 2024.
- Malinin and Gales [2018] A. Malinin and M. Gales. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018.
- Naeini et al. [2015] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
- Neal [2012] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Sanyal et al. [2024] S. Sanyal, T. Xiao, J. Liu, W. Wang, and X. Ren. Minds versus machines: Rethinking entailment verification with language models. arXiv preprint arXiv:2402.03686, 2024.
- Shafer and Vovk [2008] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
- Shrivastava et al. [2023] V. Shrivastava, P. Liang, and A. Kumar. Llamas know what gpts don’t show: Surrogate models for confidence estimation. arXiv preprint arXiv:2311.08877, 2023.
- Soatto et al. [2023] S. Soatto, P. Tabuada, P. Chaudhari, and T. Y. Liu. Taming AI bots: Controllability of neural states in large language models. arXiv preprint arXiv:2305.18449, 2023.
- Steyvers et al. [2024] M. Steyvers, H. Tejeda, A. Kumar, C. Belem, S. Karny, X. Hu, L. Mayer, and P. Smyth. The calibration gap between model and human confidence in large language models. arXiv preprint arXiv:2401.13835, 2024.
- Talmor et al. [2018] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
- Tian et al. [2023] K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
- Turpin et al. [2024] M. Turpin, J. Michael, E. Perez, and S. Bowman. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
- [42] V. N. Vapnik. The nature of statistical learning theory.
- Wang et al. [2022] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Xiao and Wang [2021] Y. Xiao and W. Y. Wang. On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025, 2021.
- Xiong et al. [2023] M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, and B. Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023.
- Yang et al. [2021] J. Yang, K. Zhou, Y. Li, and Z. Liu. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
- Yin et al. [2023] Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023.
- Yu et al. [2023] Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen. Towards better chain-of-thought prompting strategies: A survey. arXiv preprint arXiv:2310.04959, 2023.
- Zhang et al. [2023] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
- Zhao et al. [2024] X. Zhao, H. Zhang, X. Pan, W. Yao, D. Yu, T. Wu, and J. Chen. Fact-and-reflection (far) improves confidence calibration of large language models. arXiv preprint arXiv:2402.17124, 2024.
- Zhou et al. [2023] H. Zhou, X. Wan, L. Proleev, D. Mincu, J. Chen, K. Heller, and S. Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. arXiv preprint arXiv:2309.17249, 2023.
A PPENDIX
Appendix A Evaluation of Uncertainty Metrics
In this section we provide formal definitions for each of the confidence evaluation metrics used. Consider the paired dataset $(x_{i},y_{i})∈\mathcal{D}$ where each datapoint $x_{i}$ has associated label $y_{i}$ . Each $y_{i}$ takes on one value in the discrete set $\mathcal{Y}:=\{1,2,...,\ell\}$ . Now our chosen prediction model $\phi$ outputs a prediction $\hat{y}_{i}:=\phi(x_{i})$ and our confidence function $f$ produces a score $f(x_{i},\hat{y}_{i})=r_{i}∈[0,1]$ . We use the indicator variable $c_{i}$ to denote whether the prediction is correct ( $c_{i}:=\mathbf{1}(y_{i}=\hat{y}_{i})$ ). Lastly we define the full sequence of predictions $\hat{Y}$ and confidence predictions $R$ on dataset $\mathcal{D}$ of size $N$ as
$$
\displaystyle\hat{Y} \displaystyle:=\{\hat{y}_{i}=\phi(x_{i})\mid x_{i}\in\mathcal{D}\} \displaystyle R \displaystyle:=\left\{r_{i}=f(x_{i},\phi(x_{i}))\mid x_{i}\in\mathcal{D}\right\} \tag{7}
$$
Expected Calibration Error (ECE)
To calculate expected calibration error, we first group our data into $M$ partitions based on confidence interval. We denote the set of indices in each partition as:
$$
\displaystyle B_{m}:=\left\{i\>|\>i\in N,\>\frac{(m-1)}{M}<r_{i}\leq\frac{m}{M}\right\} \tag{9}
$$
Next, the empirical accuracy and average confidence functions for each partition are defined as
$$
\displaystyle Acc(B_{m}):=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}c_{i},\quad Conf(B_{m}):=\frac{1}{|B_{m}|}\sum_{i\in B_{m}}r_{i} \tag{10}
$$
Then the ECE is defined as the following weighted average:
$$
\displaystyle\text{ECE}(R,\hat{Y},M):=\sum_{m\in M}\frac{|B_{m}|}{M}|Acc(B_{m})-Conf(B_{m})| \tag{11}
$$
The lower this error is, the better calibrated the model should be (with respect to the data distribution). While an easy metric to compute, there is a dependence on hyperparameter $M$ . Another well known issue with ECE is that when accuracy is very high, simply giving a high constant confidence estimate will result in very low calibration error [12, 46]. Despite these drawbacks, we still choose to report the ECE metric as it is intuitive and serves as a common reference point with previous work.
Area Under the Risk-Coverage Curve (AURC)
For now, assume that $r_{i}≠ r_{j}~∀ i≠ j$ . Define the subset $R_{≥ r_{i}}$ as
$$
\displaystyle R_{\geq r_{i}}:=\{r\in R\mid r\geq r_{i}\} \tag{12}
$$
We now say that the ordering map $\sigma:\{1,...,N\}→\{1,...,N\}$ is the function that returns the dataset index $i$ of the $k$ th largest element in $R$ . Formally:
$$
\sigma(k):=i\quad s.t.~|R_{\geq r_{i}}|=k \tag{13}
$$
To summarize so far, this ordering essentially gives us the dataset index of the $k$ th most confident prediction. We can now finally define subsets of our most confident predictions as
$$
\hat{Y}_{K}:=\{\hat{y}_{\sigma(k)}\mid k\in\{1,\dots,K\}\} \tag{14}
$$
The risk-coverage curve will measure the tradeoff between the size of $\hat{Y}_{K}$ and the accuracy. For each coverage level $h:=K/N∈[0,1]$ , we plot the accuracy $Acc(\hat{Y}_{K})∈[0,1]$ to obtain the curve. Naturally $h=1\implies K=N$ and so the loss is simply the average model accuracy for the entire dataset. If our confidence measure is a good one, we expect higher accuracy when restricting our evaluation to a smaller subset of the most confident answers. Formally, the area under the risc-coverage curve (AURC) is is
$$
\text{AURC}(R,\hat{Y}):=\sum_{K=1}^{N}Acc(\hat{Y}_{K})\frac{K}{N} \tag{15}
$$
Area Under the Receiver Operator Curve (AUROC)
For any binary classification problem, the receiver operator curve looks at the tradeoff between false positive rate $\alpha$ (plotted on the x-axis) and true positive rate $\beta$ (y-axis), based on retaining only predictions with scores above some threshold $t$ . We denote a thresholded set of predictions as $\hat{Y}_{t}:=\{y_{i}∈\mathcal{D}\mid r_{i}>t\}$ , and $t_{\alpha}$ as the threshold such that $\text{FP}(\hat{Y}_{t_{\alpha}})=\alpha$ . If we have built a perfect classifier of correct and incorrect predictions, there should exist a threshold $t_{0}$ for which $\hat{Y}_{t_{0}}$ contains all of the predictions the model got right and none of which it got wrong. This would correspond to a true positive rate of $\beta=1.0$ for all false positive levels $\alpha∈[0,1]$ . Conversely, if confidence metrics were generated at random, any $X_{t}$ is likely to contain just as many false positives and true positives, and so the ROC curve will resemble a diagonal line. Therefore we would like the area under the reciever operator curve to be as closer to 1 as possible. Formally, this area is written as
$$
\text{AUROC}(R,\hat{Y}):=\int_{0}^{1}\text{TP}(\hat{Y}_{t_{\alpha}})d\alpha, \tag{16}
$$
Appendix B Experimental Details
In this section we discuss the implementation details of LLM prompts, dataset characteristics, and evaluation methods. We also include additional experiments examining the effect of explanation hyperparameters.
B.1 Prompts
In this section we provide the prompts used for each confidence elicitation method. Text in red represents substitutions that are made to the prompt at inference time, for example adding the text of the specific multiple choice question. For the stable explanations method in Figure ˜ 3 we provide our explanation generation prompt and conditional answer generation prompt. We use the response from this first prompt to generate our default question explanations (discarding the answer that comes after). We then use the logits from the second prompt conditioned on explanations as the posterior answer distribution for that explanation. The entailment probability prompt used is the same as in [34]. For the token probability prompt (Figure ˜ 4) we use a simple question and answer format, and use the softmax of next token logits to determine answer confidence. For the linguistic confidence prompt in Figure ˜ 5 we follow [36] best prompt choice and parse the returned response for answer and confidence value. For chain-of-thought consistency confidence we use a zero-shot modified version of the prompt from [14] (Figure ˜ 6) to generate multiple explanations and answers (discarding explanations and taking a majority vote over returned answers). We also explore using this prompt to generate explanations (discarding answers instead) for our CoT-stability confidence metric. The top-k confidence prompt is provided in Figure ˜ 7; the resulting LLM response is parsed for $k$ confidence values. Lastly we include the conditional explanation prompt used to generate correct and incorrect explanations in Figure ˜ 1. Unless otherwise noted, temperature for all generated explanations is set to Temp=0.7 for both stable explanations and CoT-consistency method.
<details>
<summary>figures/stability_prompt_v2.png Details</summary>

### Visual Description
## Text Prompts for Question Answering
### Overview
The image presents three different text prompts designed for question answering tasks, each focusing on a specific type of reasoning or analysis. The prompts are titled "Stability Explanation Prompt," "Stability Conditional Answer Prompt," and "Entailment Prompt." Each prompt provides instructions and a structured format for answering questions.
### Components/Axes
* **Stability Explanation Prompt:**
* Instructions: Read the given question and select the most appropriate answer by indicating the associated letter. Output strictly following the explanation-then-answer format.
* Format: Explanation: <detailed reasoning steps> Answer: (letter)
* Question: [multiple choice question] (marked in red)
* **Stability Conditional Answer Prompt:**
* Instructions: You are an expert analyst considering arguments from different perspectives. Given a question and an argument, choose the correct answer. Answer the question with one of the valid choices. Provide only a single answer.
* Argument: Given the scenario in the question, [explanation] (marked in red)
* Answer: The correct answer is
* **Entailment Prompt:**
* Premise: [multiple choice question] (marked in red)
* Hypothesis: [explanation] (marked in red)
* Question: Given the premise, is the hypothesis correct?
* Answer (T/F):
### Detailed Analysis or ### Content Details
The image contains three distinct prompts, each designed to elicit a specific type of response:
1. **Stability Explanation Prompt:** This prompt requires the user to provide a detailed explanation of their reasoning before selecting an answer from a multiple-choice question. The answer should be indicated by its associated letter.
2. **Stability Conditional Answer Prompt:** This prompt casts the user as an expert analyst who must consider arguments from different perspectives. The user is given a question and an argument and must choose the correct answer from the valid choices, providing only one answer.
3. **Entailment Prompt:** This prompt presents a premise and a hypothesis. The user must determine whether the hypothesis is correct given the premise and answer with True (T) or False (F).
The phrases "[multiple choice question]" and "[explanation]" are highlighted in red within the prompts, indicating placeholders for the actual question and explanation content.
### Key Observations
* Each prompt is structured to guide the user through a specific reasoning process.
* The prompts cover different types of questions, including multiple-choice and true/false.
* The use of placeholders indicates that these are templates to be filled with specific content.
### Interpretation
The image presents a set of prompts designed to assess different aspects of reasoning and analytical skills. The "Stability Explanation Prompt" emphasizes the importance of providing a rationale for the chosen answer. The "Stability Conditional Answer Prompt" focuses on the ability to analyze arguments from multiple perspectives. The "Entailment Prompt" tests the user's understanding of logical entailment between a premise and a hypothesis. These prompts could be used in educational settings or in the development of AI systems to evaluate and improve reasoning capabilities.
</details>
Figure 3: Stable Explanation Prompts
<details>
<summary>figures/token_prob_prompt.png Details</summary>

### Visual Description
## Text Block: Token Probability Confidence Prompt
### Overview
The image presents a text block formatted as a prompt for a token probability confidence assessment. It includes a question placeholder, specifically indicating a multiple-choice question format, and an answer placeholder.
### Components/Axes
- **Title**: Token Prob Confidence Prompt:
- **Question**: Question: [multiple choice question] (The text "multiple choice question" is in red)
- **Answer**: Answer:
### Detailed Analysis or ### Content Details
The text block is structured as follows:
1. A title indicating the purpose of the prompt: "Token Prob Confidence Prompt:".
2. A line labeled "Question:" followed by a placeholder "[multiple choice question]". The text within the brackets is rendered in red.
3. A line labeled "Answer:".
### Key Observations
The prompt is designed to elicit a response related to the confidence in the probability of tokens, likely in the context of a language model or similar system. The use of a multiple-choice question suggests a structured evaluation approach.
### Interpretation
The prompt is intended to be used in a scenario where the confidence in the probability of a token is being assessed. The multiple-choice format implies that the user or system is expected to select from a predefined set of options, reflecting different levels of confidence. The prompt is generic and requires a specific multiple-choice question to be inserted for practical use.
</details>
Figure 4: Token Probability Prompt
<details>
<summary>figures/lingusitic_prompt.png Details</summary>

### Visual Description
## Text Block: Linguistic Confidence Prompt
### Overview
The image presents a linguistic confidence prompt, designed to assess the confidence level associated with answers to multiple-choice questions. It includes instructions, example questions, answers, and corresponding confidence scores.
### Components/Axes
* **Title:** Linguistic Confidence Prompt
* **Instructions:** A paragraph explaining how to answer the questions and provide a confidence score between 0 and 1.
* **Questions:** Three questions are presented. The first two are example questions, and the third is a placeholder.
* **Answer Choices:** Each question has five answer choices labeled (A) through (E).
* **Answer:** The selected answer for each question is indicated.
* **Confidence:** A numerical confidence score is provided for each answer, ranging from 0 to 1.
### Detailed Analysis or ### Content Details
**Instructions:**
> Answer the following question to the best of your ability, and provide a score between 0 and 1 to indicate the confidence you have in your answer. Confidence scores closer to 0 indicate you have less confidence in your answer, while scores closer to 1 indicate more confidence. You must answer the question with one of the valid choices. You must provide only a single answer.
**Question 1:**
* Question: This is a question
* (A) first answer
* (B) second answer
* (C) third answer
* (D) fourth answer
* (E) fifth Answer
* Answer: (D)
* Confidence: 0.4
**Question 2:**
* Question: This is another Question
* (A) first answer
* (B) second answer
* (C) third answer
* (D) fourth answer
* (E) fifth Answer
* Answer: (A)
* Confidence: 0.7
**Question 3:**
* Question: \[multiple choice question]
### Key Observations
* The confidence scores are numerical values between 0 and 1, representing the level of certainty in the given answer.
* The first question has a lower confidence score (0.4) compared to the second question (0.7).
* The third question is a placeholder, indicated by the text "\[multiple choice question]".
### Interpretation
The prompt aims to quantify the subjective confidence level associated with answers to multiple-choice questions. The instructions clarify that a score closer to 0 indicates lower confidence, while a score closer to 1 indicates higher confidence. The example questions demonstrate how to provide an answer and a corresponding confidence score. The third question serves as a template for future questions. The confidence scores of 0.4 and 0.7 suggest varying degrees of certainty in the answers provided for the first two questions.
</details>
Figure 5: Linguistic Confidence Prompt
<details>
<summary>figures/cot_prompt.png Details</summary>

### Visual Description
## Text Block: CoT Explanation Prompt
### Overview
The image is a text block outlining a "CoT Explanation Prompt" which provides instructions and examples for answering multiple-choice questions using step-by-step reasoning.
### Components/Axes
The text block includes the following components:
* **Title:** "CoT Explanation Prompt:"
* **Instructions:** A paragraph explaining the task: to select the most appropriate answer to a question by indicating the associated letter, after providing step-by-step reasoning.
* **Example 1 (Question):** A multiple-choice question with four answer choices (A, B, C, D).
* **Example 1 (Answer):** A step-by-step reasoning process leading to the correct answer (A).
* **Example 2 (Question):** Another multiple-choice question with four answer choices (A, B, C, D).
* **Example 2 (Answer):** A step-by-step reasoning process leading to the correct answer (C).
* **Final Question:** A placeholder for a multiple-choice question.
* **Final Answer:** A placeholder for a step-by-step answer.
### Detailed Analysis or ### Content Details
The text content is as follows:
"CoT Explanation Prompt:
You will be given a question at the end, after the examples, for which you are to select the most appropriate answer by indicating the associated letter. Please first output step-by-step reasoning about how to solve the question. Then output the answer. You MUST output exactly one of the provided answers.
Q: This is a question
Which one of the choices is correct, (A), (B), (C) or (D)?
Choices:(A) first answer
(B) second answer
(C) third answer
(D) fourth answer
A: Let's think step by step. Given the scenario, we know that answer cannot be (B) or (C) because... From here we continue our line of reasoning...
Therefore, the answer is (A).
Q: This is another question
Which one of the choices is correct, (A), (B), (C) or (D)?
Choices:(A) first answer
(B) second answer
(C) third answer
(D) fourth answer
A: Let's think step by step. This is more step-by-step reasoning
Therefore the answer is (C).
Q: [multiple choice question]
A: Let's think step by step."
### Key Observations
* The prompt emphasizes the importance of step-by-step reasoning before providing the final answer.
* The examples demonstrate how to eliminate incorrect answer choices and arrive at the correct one.
* The final question and answer are placeholders, indicating that the user needs to provide their own question and solution.
### Interpretation
The "CoT Explanation Prompt" is designed to guide users in answering multiple-choice questions by encouraging them to think through the problem systematically. The examples provide a template for structuring the reasoning process, which can be applied to various types of questions. The prompt aims to improve the accuracy and clarity of answers by promoting a more thoughtful and deliberate approach.
```
</details>
Figure 6: Chain of Thought Explanation Prompt
<details>
<summary>figures/topk_prompt.png Details</summary>

### Visual Description
## Text Block: Top-K Confidence Prompt Instructions
### Overview
The image presents a set of instructions for a "Top-K Confidence Prompt." It outlines the task, expected input format, and provides an example. The instructions are designed for a system that needs to read a question and provide its top {k} guesses for the answer, along with the associated probabilities.
### Components/Axes
The image contains the following key components:
* **Title:** "Top-K Confidence Prompt:"
* **Task Description:** A paragraph explaining the task of reading a question, selecting the most appropriate answer, and providing {k} best guesses with associated probabilities.
* **Example Format:** Instructions on how to format the guesses (G1, GN) and probabilities (P1, PN).
* **Question Placeholder:** "Question: \[multiple choice question]"
### Detailed Analysis or ### Content Details
The text is as follows:
"Top-K Confidence Prompt:
The task is to read the given question and select the most appropriate
answer by indicating the associated letter. Provide your {k} best guesses
and the probability that each is correct (0.0 to 1.0) for the following
question. Give ONLY the guesses and probabilities, no other words or
explanation.
For example:
G1: <first most likely guess, as short as possible; not a complete
sentence, just the guess!>
P1: <the probability between 0.0 and 1.0 that G1 is correct, without any
extra commentary whatsoever; just the probability!>
...
GN: <Nth most likely guess, as short as possible; not a complete sentence,
just the guess!>
PN: <the probability between 0.0 and 1.0 that GN is correct, without any
extra commentary whatsoever; just the probability!>
Question: \[multiple choice question]"
### Key Observations
* The prompt emphasizes providing only the guesses and probabilities, without additional explanation.
* The format for guesses (G1, GN) should be short and not a complete sentence.
* The probabilities (P1, PN) should be between 0.0 and 1.0.
* The value 'k' is used as a placeholder for the number of guesses to provide.
### Interpretation
The instructions are designed to elicit a structured response from a system that can assess the likelihood of different answers to a question. The "Top-K" aspect suggests that the system is expected to rank its confidence in multiple possible answers. The prompt is intended to be used in a scenario where the system needs to provide not just a single answer, but a range of possibilities with associated confidence levels. This is useful in situations where the system is uncertain or where multiple answers are plausible.
</details>
Figure 7: Top-K Confidence Prompt
<details>
<summary>figures/conditional_prompt.png Details</summary>

### Visual Description
## Text Block: Conditional Explanation Prompt
### Overview
The image presents a text block outlining a "Conditional Explanation Prompt." It describes a scenario where, given a question and a candidate answer, the task is to provide an explanation of why the answer could be correct.
### Components/Axes
The text block contains the following elements:
* **Title:** "Conditional Explanation Prompt:"
* **Description:** "Given a question and candidate answer you are to provide an explanation why this answer could be correct."
* **Question Prompt:** "Question: \[multiple choice question]"
* **Candidate Answer Prompt:** "Candidate Answer: \[answer text]"
* **Explanation Prompt:** "Explanation:"
### Detailed Analysis or ### Content Details
The text block is structured as a prompt for generating explanations. It sets up a scenario where a question (specifically a multiple-choice question) and a potential answer are provided. The task is then to explain why the given answer could be correct, even if it's not necessarily the only or best answer. The bracketed text indicates placeholders for the actual question and answer.
### Key Observations
The prompt focuses on justifying a given answer rather than finding the correct answer independently. This suggests a task that might involve reasoning, argumentation, or understanding the nuances of a question and its possible answers.
### Interpretation
The prompt is designed to elicit explanations that demonstrate an understanding of the relationship between a question and a candidate answer. It encourages critical thinking and the ability to construct a logical argument in support of a given answer, even if it's not the most obvious or straightforward one. This type of prompt could be used in educational settings to assess a student's ability to reason and justify their thinking.
</details>
Figure 8: Conditional explanation prompt used to generate explanations in Figure ˜ 1
B.2 Dataset Details
We can observe in Table ˜ 3 that the QA datasets with longer questions typically are harder for the model to answer correctly. We see that our method, like many other sample+aggregate based answering methods generally has higher accuracy than the baseline model [43]. This accuracy boost is less pronounced however for GPT-4.
For GPT-3.5-Turbo results we generate confidence scores for 250 questions per dataset (or maximum dataset size if smaller). Due to computational cost we only use 100 questions per dataset when testing on GPT-4-Turbo. We use validation splits for CSQA, TruthQA, and test splits for MedQA and MMLU datasets.
| CSQA TruthQA MedQA | 151 329 916 | 0.79 0.54 0.59 | 0.80 0.64 0.68 | 0.84 0.85 0.82 | 0.88 0.91 0.84 |
| --- | --- | --- | --- | --- | --- |
| MMLU Law | 1233 | 0.46 | 0.56 | 0.64 | 0.67 |
| MMLU Physics | 172 | 0.57 | 0.72 | 0.92 | 0.92 |
Table 3: Comparing accuracy for default model predictions vs. most confident stability predictions across benchmark datasets. One can observe a clear improvement in accuracy for both GPT-3.5 and GPT-4.
B.3 Evaluation Details
When evaluating confidence methods, it is important to note that performance implicitly depends on the prediction set $\hat{Y}$ . For example, a metric may be well calibrated on correct answers but still be overconfident on incorrect ones, meaning the confidence metric would evaluate as worse on a less accurate prediction set. Therefore, for comparison purposes we use the same set of default LLM predictions (setting Temp=0) for GPT-3.5 and GPT-4 results.
In order to break possible ties in confidence when evaluating AURC and AUROC methods, we follow the approach of [36] and add a small amount of gaussian noise ( $\sigma=1e-6$ ) to each confidence score. We repeat this process for $r=10$ times and take the average AURC and AUROC scores. We provided We also follow common practice in previous works by using $M=10$ as the number of bins when calculating ECE [2].
We use OpenAI’s gpt-3.5-turbo-1106 snapshot for GPT-3.5 experiments and gpt-4-1106-preview snapshot for GPT-4. Generating and evaluating confidence scores for each method on one dataset takes on the order of an hour for GPT-3.5-Turbo, and two hours for GPT-4-Turbo using OpenAI’s API.
<details>
<summary>figures/example_curves.png Details</summary>

### Visual Description
## Chart Type: Comparative Analysis of Risk Coverage and Receiver Operator Curves
### Overview
The image presents two charts side-by-side, both related to the MEDQA dataset. The left chart is a "Risk Coverage Curve," displaying accuracy as a function of coverage for different methods. The right chart is a "Receiver Operator Curve," showing the true positive rate against the false positive rate. Both charts compare the performance of 'linguistic', 'tokenprob', 'consistency', 'topk', and 'stability' methods.
### Components/Axes
**Left Chart: Risk Coverage Curve**
* **Title:** Risk Coverage Curve ↑
* **Y-axis:** Accuracy, ranging from 0.0 to 1.0 in increments of 0.2.
* **X-axis:** Coverage, ranging from 0.0 to 1.0 in increments of 0.2.
* **Legend (bottom-center):**
* linguistic (AURC:0.9007) - Blue line
* tokenprob (AURC:0.9214) - Orange line
* consistency (AURC:0.9267) - Green line
* topk (AURC:0.9136) - Red line
* stability (AURC:0.9638) - Purple line
* baseline accuracy (0.82) - Dashed black line
**Right Chart: Receiver Operator Curve**
* **Title:** Reciever Operator Curve ↑
* **Y-axis:** True Positive Rate, ranging from 0.0 to 1.0 in increments of 0.2.
* **X-axis:** False Positive Rate, ranging from 0.0 to 1.0 in increments of 0.2.
* **Legend (bottom-center):**
* linguistic (AUROC:0.6704) - Blue line
* tokenprob (AUROC:0.8149) - Orange line
* consistency (AUROC:0.8286) - Green line
* topk (AUROC:0.7101) - Red line
* stability (AUROC:0.9078) - Purple line
* random - Dashed black line
### Detailed Analysis
**Left Chart: Risk Coverage Curve**
* **baseline accuracy:** A horizontal dashed black line at approximately 0.82.
* **linguistic (AURC:0.9007):** The blue line starts at approximately 0.75 accuracy at 0 coverage, rises sharply, and then gradually decreases to approximately 0.85 accuracy at 1.0 coverage.
* **tokenprob (AURC:0.9214):** The orange line starts at approximately 0.70 accuracy at 0 coverage, rises sharply, and then gradually decreases to approximately 0.90 accuracy at 1.0 coverage.
* **consistency (AURC:0.9267):** The green line starts at approximately 0.0 accuracy at 0 coverage, rises sharply to approximately 0.90 accuracy at 0.1 coverage, and then gradually decreases to approximately 0.83 accuracy at 1.0 coverage.
* **topk (AURC:0.9136):** The red line starts at approximately 0.85 accuracy at 0 coverage, and then gradually decreases to approximately 0.83 accuracy at 1.0 coverage.
* **stability (AURC:0.9638):** The purple line starts at 1.0 accuracy at 0 coverage, and then remains at 1.0 accuracy until 0.1 coverage, and then gradually decreases to approximately 0.92 accuracy at 1.0 coverage.
**Right Chart: Receiver Operator Curve**
* **random:** A dashed black line representing a random classifier, rising linearly from (0.0, 0.0) to (1.0, 1.0).
* **linguistic (AUROC:0.6704):** The blue line rises linearly from (0.0, 0.0) to approximately (0.8, 0.6), and then remains constant.
* **tokenprob (AUROC:0.8149):** The orange line is a step function, rising in steps from (0.0, 0.0) to approximately (0.8, 0.9).
* **consistency (AUROC:0.8286):** The green line rises sharply from (0.0, 0.0) to approximately (0.2, 0.4), then rises more gradually to approximately (0.8, 0.9).
* **topk (AUROC:0.7101):** The red line rises sharply from (0.0, 0.0) to approximately (0.2, 0.3), then rises more gradually to approximately (0.8, 0.9).
* **stability (AUROC:0.9078):** The purple line is a step function, rising in steps from (0.0, 0.0) to approximately (0.2, 0.7), and then to approximately (0.9, 1.0).
### Key Observations
* In the Risk Coverage Curve, 'stability' consistently maintains the highest accuracy across different coverage levels, while 'consistency' starts with low accuracy but improves rapidly with increasing coverage.
* In the Receiver Operator Curve, 'stability' also demonstrates the best performance, closely followed by 'consistency' and 'tokenprob'. 'linguistic' and 'topk' perform relatively worse.
* The baseline accuracy in the Risk Coverage Curve is approximately 0.82.
### Interpretation
The charts provide a comparative analysis of different methods ('linguistic', 'tokenprob', 'consistency', 'topk', and 'stability') in the context of the MEDQA dataset. The Risk Coverage Curve illustrates how accuracy changes with increasing coverage, while the Receiver Operator Curve assesses the trade-off between true positive rate and false positive rate.
The 'stability' method appears to be the most effective overall, achieving high accuracy and a favorable true positive rate/false positive rate balance. 'consistency' and 'tokenprob' also perform well, while 'linguistic' and 'topk' show relatively weaker performance. The baseline accuracy in the Risk Coverage Curve provides a benchmark for evaluating the performance of these methods. The AUROC values in the legend quantify the overall performance of each method in the Receiver Operator Curve.
</details>
Figure 9: Risk coverage (left) and receiver-operator (right) curves for confidence metrics generated on the MedQA questions using GPT-4. Our stability method outperforms others on this dataset as evidenced by larger area under the curves. We can also observe that questions with confidences in the top 50% were all correct.
B.4 Alternate Explanation Plausibility Measures
Inspired by [24], which looks at the true/false token probability an LLM assigns to a given answer being true, we explore evaluating the probability that an explanation is ‘true’. To do this, we provide the model with both question and explanation and ask: ‘Is this the most likely explanation? (T/F)’. We also try asking the question ‘Does the explanation completely describe the question? (T/F)’. We then repeat the experiment in Section ˜ 4.2, examining distributions of explanations measured via these probabilities. We find in figure ˜ 10 that these measures fail to properly distinguish between different explanations.
<details>
<summary>figures/mmlu_law_explanation_p_likely.png Details</summary>

### Visual Description
## Chart: MMLU Law Probability Distributions
### Overview
The image presents two probability distribution charts, each displaying the density of correct and incorrect predictions under different conditions: "Most Likely" and "Complete". Both charts show a high concentration of predictions near a probability of 1.0, with a greater density of correct predictions in that region.
### Components/Axes
**Left Chart:**
* **Title:** MMLU Law P(Most Likely)
* **X-axis:** P(Most Likely), ranging from 0.0 to 1.0 in increments of 0.2.
* **Y-axis:** Density, ranging from 0 to 15 in increments of 5.
* **Legend:** Located in the top-left corner.
* **Blue line:** "correct"
* **Red line:** "incorrect"
**Right Chart:**
* **Title:** MMLU Law P(Complete)
* **X-axis:** P(Complete), ranging from 0.0 to 1.0 in increments of 0.2.
* **Y-axis:** Density, ranging from 0 to 15 in increments of 5.
* **Legend:** Located in the top-right corner.
* **Blue line:** "correct"
* **Red line:** "incorrect"
### Detailed Analysis
**Left Chart: MMLU Law P(Most Likely)**
* **Correct (Blue):** The blue line representing correct predictions starts near 0.0 density at 0.0 probability, remains close to 0 until approximately 0.8 probability, and then sharply increases to a peak density of approximately 16 near 1.0 probability.
* **Incorrect (Red):** The red line representing incorrect predictions starts near 0.0 density at 0.0 probability, remains close to 0 until approximately 0.8 probability, and then sharply increases to a peak density of approximately 14 near 1.0 probability.
**Right Chart: MMLU Law P(Complete)**
* **Correct (Blue):** The blue line representing correct predictions starts near 0.0 density at 0.0 probability, remains close to 0 until approximately 0.8 probability, and then sharply increases to a peak density of approximately 13 near 1.0 probability.
* **Incorrect (Red):** The red line representing incorrect predictions starts near 0.0 density at 0.0 probability, remains close to 0 until approximately 0.8 probability, and then sharply increases to a peak density of approximately 10 near 1.0 probability.
### Key Observations
* Both charts show a similar trend: a low density of predictions until a probability of around 0.8, followed by a sharp increase towards 1.0.
* In both charts, the density of correct predictions is higher than the density of incorrect predictions near a probability of 1.0.
* The "Most Likely" chart shows a slightly higher peak density for both correct and incorrect predictions compared to the "Complete" chart.
### Interpretation
The data suggests that the MMLU Law model, under both "Most Likely" and "Complete" conditions, tends to assign high probabilities to its predictions. The higher density of correct predictions near a probability of 1.0 indicates that the model is generally accurate when it is highly confident. The slight difference between the "Most Likely" and "Complete" conditions suggests that the "Most Likely" setting might lead to slightly more confident predictions overall. The fact that both correct and incorrect predictions cluster near 1.0 indicates that the model can be overconfident in its incorrect predictions as well.
</details>
Figure 10: Empirical distribution of MMLU explanations when measured via GPT-3.5 probability of being ‘most-likely explanation’ (left) and probability of ‘completely describing’ the question (right). One can see that true (blue) and false (red) answer-conditioned explanations are difficult to distinguish.
B.5 Sensistivity to Explanation Prompting
Our stable explanation method reweights explanations based on entailment probability, but if the quality of sampled explanations is poor to begin with our resulting distribution will still be inaccurate. Here we will discuss the effect of instructing the LLM to generate explanations before or after an answer (i.e. the order of ‘explanation’ and ‘answer’ in the stability explanation prompt in Figure ˜ 3). We observe in Table ˜ 4 that generating explanations before the answer clearly results in higher quality explanations, as evidenced by improved performance on selective uncertainty and calibration tasks.
| | Pre-Answer Stability (Default) AURC $\uparrow$ | AUROC $\uparrow$ | Post-Answer Stability ECE $\downarrow$ | AURC $\uparrow$ | AUROC $\uparrow$ | ECE $\downarrow$ |
| --- | --- | --- | --- | --- | --- | --- |
| CSQA | 0.901 | 0.779 | 0.123 | 0.866 | 0.731 | 0.201 |
| TruthQA | 0.801 | 0.853 | 0.21 | 0.792 | 0.839 | 0.254 |
| MedQA | 0.784 | 0.798 | 0.169 | 0.743 | 0.743 | 0.251 |
| MMLU Law | 0.642 | 0.736 | 0.259 | 0.629 | 0.706 | 0.289 |
| MMLU Physics | 0.792 | 0.834 | 0.186 | 0.779 | 0.811 | 0.252 |
Table 4: Comparing stability confidence performance using explanations generated before and after an answer for GPT-3.5. One can clearly observe that explanations generated before the answer (i.e. in chain-of-thought fashion) outperform those generated afterwards across all performance metrics.
B.6 Varying Sample Size
In this section we briefly analyze the effect that the number of sampled explanation has on our confidence metric. In Figure ˜ 11 we observe that selective uncertainty performance (AURC and AUROC) saturates quickly for simpler questions answering tasks such as commonsenseqa. On the other hand MedQA and MMLU Law datasets both demonstrate steady performance gains up to $M=5$ samples per question. Calibration error gradually decreases for all datasets examined.
<details>
<summary>figures/aurc_vs_explanations.png Details</summary>

### Visual Description
## Chart: AURC vs. Number Explanations
### Overview
The image is a line chart comparing the AURC (Area Under the ROC Curve) values for five different datasets (CSQA, TruthQA, MedQA, MMLU Law, and MMLU Physics) against the number of explanations provided (ranging from 1 to 5). The chart shows how the AURC changes as the number of explanations increases for each dataset.
### Components/Axes
* **Title:** AURC vs. Number Explanations
* **X-axis:** Number Explanations, with ticks at 0, 1, 2, 3, 4, 5, and 6.
* **Y-axis:** AURC, with ticks at 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, and 1.
* **Legend:** Located at the bottom of the chart, it identifies the datasets by color:
* Blue: CSQA
* Orange: TruthQA
* Gray: MedQA
* Yellow: MMLU Law
* Light Blue: MMLU Physics
### Detailed Analysis
* **CSQA (Blue):** The line is almost flat, indicating a stable AURC.
* At 1 explanation: AURC ≈ 0.89
* At 3 explanations: AURC ≈ 0.90
* At 5 explanations: AURC ≈ 0.90
* **TruthQA (Orange):** The line is relatively flat.
* At 1 explanation: AURC ≈ 0.78
* At 3 explanations: AURC ≈ 0.79
* At 5 explanations: AURC ≈ 0.79
* **MedQA (Gray):** The line shows an upward trend.
* At 1 explanation: AURC ≈ 0.75
* At 3 explanations: AURC ≈ 0.73
* At 5 explanations: AURC ≈ 0.79
* **MMLU Law (Yellow):** The line shows an upward trend.
* At 1 explanation: AURC ≈ 0.60
* At 3 explanations: AURC ≈ 0.64
* At 5 explanations: AURC ≈ 0.67
* **MMLU Physics (Light Blue):** The line shows a slight upward trend.
* At 1 explanation: AURC ≈ 0.76
* At 3 explanations: AURC ≈ 0.80
* At 5 explanations: AURC ≈ 0.79
### Key Observations
* CSQA consistently has the highest AURC across all numbers of explanations.
* MMLU Law has the lowest AURC, but it shows the most significant improvement as the number of explanations increases.
* TruthQA remains relatively stable across different numbers of explanations.
* MedQA shows a moderate increase in AURC as the number of explanations increases.
* MMLU Physics shows a slight increase in AURC as the number of explanations increases.
### Interpretation
The chart suggests that the number of explanations can have varying impacts on the performance (measured by AURC) of different question-answering datasets. CSQA appears to be the most robust, maintaining high performance regardless of the number of explanations. MMLU Law, while starting with the lowest performance, benefits the most from additional explanations. The other datasets show more modest improvements or remain relatively stable. This could indicate that the type of questions or the nature of the data in each dataset influences how effectively explanations can improve the model's accuracy.
</details>
(a) AURC vs. Number of Explanations for the stable explanations confidence metric.
<details>
<summary>figures/auroc_vs_explanations.png Details</summary>

### Visual Description
## Chart: AUROC vs. Number Explanations
### Overview
The image is a line chart comparing the Area Under the Receiver Operating Characteristic curve (AUROC) against the number of explanations for five different datasets: CSQA, TruthQA, MedQA, MMLU Law, and MMLU Physics. The x-axis represents the number of explanations, ranging from 0 to 6, while the y-axis represents the AUROC score, ranging from 0.6 to 1.
### Components/Axes
* **Title:** AUROC vs. Number Explanations
* **X-axis:** Number Explanations, with tick marks at 0, 1, 2, 3, 4, 5, and 6.
* **Y-axis:** AUROC, with tick marks at 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, and 1.
* **Legend:** Located at the bottom of the chart, it identifies the datasets by color:
* Blue: CSQA
* Orange: TruthQA
* Gray: MedQA
* Yellow: MMLU Law
* Light Blue: MMLU Physics
### Detailed Analysis
* **CSQA (Blue):** Starts at approximately 0.76 AUROC with 1 explanation, increases to approximately 0.78 with 3 explanations, and ends at approximately 0.79 with 5 explanations. The trend is slightly upward.
* **TruthQA (Orange):** Starts at approximately 0.83 AUROC with 1 explanation, remains relatively stable at approximately 0.83 with 3 explanations, and increases to approximately 0.85 with 5 explanations. The trend is slightly upward.
* **MedQA (Gray):** Starts at approximately 0.76 AUROC with 1 explanation, increases to approximately 0.80 with 3 explanations, and decreases to approximately 0.79 with 5 explanations. The trend is relatively flat.
* **MMLU Law (Yellow):** Starts at approximately 0.70 AUROC with 1 explanation, increases to approximately 0.75 with 3 explanations, and ends at approximately 0.78 with 5 explanations. The trend is upward.
* **MMLU Physics (Light Blue):** Starts at approximately 0.76 AUROC with 1 explanation, increases to approximately 0.78 with 3 explanations, and ends at approximately 0.79 with 5 explanations. The trend is slightly upward.
### Key Observations
* TruthQA consistently has the highest AUROC scores across all numbers of explanations.
* MMLU Law generally has the lowest AUROC scores.
* The AUROC scores for all datasets tend to increase slightly as the number of explanations increases from 1 to 5.
* MedQA shows a slight decrease in AUROC from 3 to 5 explanations.
### Interpretation
The chart suggests that increasing the number of explanations generally leads to a slight improvement in AUROC scores for the datasets analyzed. However, the effect is not uniform across all datasets, as seen with MedQA, which shows a slight decrease in performance with more explanations. TruthQA consistently outperforms the other datasets, indicating it may be more robust or better suited to the explanation methods used. The relatively small changes in AUROC with increasing explanations suggest that the number of explanations is not the sole determinant of model performance, and other factors such as the quality or relevance of the explanations may play a more significant role.
</details>
(b) AUROC vs. Number of Explanations for the stable explanations confidence metric.
<details>
<summary>figures/ece_vs_explanations.png Details</summary>

### Visual Description
## Line Chart: ECE vs. Number Explanations
### Overview
The image is a line chart comparing the Expected Calibration Error (ECE) against the number of explanations provided for different question-answering datasets. The chart includes five datasets: CSQA, TruthQA, MedQA, MMLU Law, and MMLU Physics. The x-axis represents the number of explanations, ranging from 0 to 6, and the y-axis represents the ECE, ranging from 0 to 0.4.
### Components/Axes
* **Title:** ECE vs. Number Explanations
* **X-axis:**
* Label: Number Explanations
* Scale: 0 to 6, with tick marks at each integer value.
* **Y-axis:**
* Label: ECE
* Scale: 0 to 0.4, with tick marks at intervals of 0.05.
* **Legend:** Located at the bottom of the chart, it identifies each dataset by color:
* Blue: CSQA
* Orange: TruthQA
* Gray: MedQA
* Yellow: MMLU Law
* Light Blue: MMLU Physics
### Detailed Analysis
Here's a breakdown of each data series:
* **CSQA (Blue):** The line slopes downward.
* 1 Explanation: ECE ≈ 0.19
* 3 Explanations: ECE ≈ 0.15
* 5 Explanations: ECE ≈ 0.12
* **TruthQA (Orange):** The line slopes downward.
* 1 Explanation: ECE ≈ 0.28
* 3 Explanations: ECE ≈ 0.24
* 5 Explanations: ECE ≈ 0.21
* **MedQA (Gray):** The line is relatively flat, with a slight downward slope.
* 1 Explanation: ECE ≈ 0.24
* 3 Explanations: ECE ≈ 0.20
* 5 Explanations: ECE ≈ 0.17
* **MMLU Law (Yellow):** The line slopes downward.
* 1 Explanation: ECE ≈ 0.30
* 3 Explanations: ECE ≈ 0.25
* 5 Explanations: ECE ≈ 0.22
* **MMLU Physics (Light Blue):** The line is relatively flat, with a slight downward slope.
* 1 Explanation: ECE ≈ 0.24
* 3 Explanations: ECE ≈ 0.20
* 5 Explanations: ECE ≈ 0.18
### Key Observations
* All datasets show a decrease in ECE as the number of explanations increases, although the magnitude of the decrease varies.
* MMLU Law generally has the highest ECE across all numbers of explanations.
* CSQA generally has the lowest ECE across all numbers of explanations.
* The ECE for MedQA and MMLU Physics are very similar.
### Interpretation
The chart suggests that providing more explanations generally reduces the Expected Calibration Error (ECE) for question-answering models across different datasets. This indicates that the models become better calibrated (i.e., their confidence in their answers aligns more closely with their accuracy) as they provide more explanations. The different levels of ECE across datasets likely reflect the varying difficulty and characteristics of the questions in each dataset. The fact that the ECE decreases with more explanations suggests that the models are able to refine their understanding and improve their calibration as they generate more reasoning steps.
</details>
(c) ECE vs. Number of Explanations for the stable explanations confidence metric.
Figure 11: Comparison of stable explanation confidence using different numbers of explanations per question ( $M=\{1,3,5\}$ ). Testing is done on GPT-3.5-Turbo for all five benchmark datasets. One can observe improving but saturating performance for each dataset.
B.7 Comparison to TTA
Contemporaneously to this manuscript’s submission, another method related to our approach was proposed [28]. The Think-Twice before assure (TTA) method asks for explanations conditioned on different answers, then does a top-k confidence elicitation using these explanations in the prompt. Although similar in the sense that confidence metrics are being generated by conditioning on explanations, their combination of explanations into a single prompt does not match the ensemble of test-time classifiers view that our method takes. The authors have not yet released code or dataset splits, but we have implemented their method by following the written procedure and using the same prompts (see Figure ˜ 12). We found during our implementation on the shared CSQA dataset, the evaluation results for selective uncertainty tasks are slightly below that what the authors report (AURC, AUROC), most likely due to the difference in specific questions used during testing. Nonetheless we report the full results of our implementation in table ˜ 5, and note that this metric does appear to have lower ECE in many cases.
| CSQA | TTA (Our Implementation) AURC $\uparrow$ 0.885 | AUROC $\uparrow$ 0.688 | ECE $\downarrow$ 0.104* | Acc. $\uparrow$ 0.736 |
| --- | --- | --- | --- | --- |
| TruthQA | 0.698 | 0.706 | 0.093* | 0.672* |
| MedQA | 0.641 | 0.581 | 0.207 | 0.505 |
| MMLU Law | 0.574 | 0.657 | 0.148* | 0.456 |
| MMLU Physics | 0.717 | 0.697 | 0.1* | 0.557 |
Table 5: Evaluation for the TTA Confidence metric (Our implementation) on GPT-3.5. Results that outperform our stable explanations metric are marked with an asterisk.
<details>
<summary>figures/tta_prompt.png Details</summary>

### Visual Description
## Text Extraction: TTA Prompts
### Overview
The image presents two text prompts, labeled "TTA Explanation Prompt" and "TTA Confidence Prompt," designed for a task involving reading a question, selecting an answer, and generating explanations. The prompts include placeholders for specific content like the question, answer, and explanations.
### Components/Axes
* **TTA Explanation Prompt:**
* **Title:** TTA Explanation Prompt:
* **Task Description:** "The task is to read the given question and select the most appropriate answer by indicating the associated letter."
* **Question Placeholder:** Question: \[multiple choice question]
* **Answer Placeholder:** Answer: \[answer text]
* **Explanation Request:** "Please generate an explanation to try to justify the answer judgement."
* **TTA Confidence Prompt:**
* **Title:** TTA Confidence Prompt:
* **Top-K Prompt Placeholder:** \[Top-K Prompt]
* **Explanation List:**
* Possible explanation 1: \[explanation 1]
* Possible explanation 2: \[explanation 2]
* ...
* Possible explanation N: \[explanation N]
### Detailed Analysis or ### Content Details
The "TTA Explanation Prompt" outlines the initial task of answering a multiple-choice question and then generating an explanation for the chosen answer. The "TTA Confidence Prompt" focuses on generating multiple possible explanations, potentially for evaluating the confidence in the selected answer.
### Key Observations
* The prompts use bracketed placeholders (e.g., \[multiple choice question]) to indicate where specific content should be inserted.
* The "TTA Confidence Prompt" includes a "Top-K Prompt" placeholder, suggesting a ranking or selection of the best explanations.
* The ellipsis (...) indicates that the list of possible explanations can continue beyond the two examples provided.
### Interpretation
The prompts appear to be designed for a system that automates the process of answering questions and providing justifications. The "TTA Explanation Prompt" guides the system to provide a basic answer and explanation, while the "TTA Confidence Prompt" encourages the generation of multiple explanations to assess the reliability of the answer. The prompts are likely part of a larger framework for evaluating and improving the performance of question-answering systems.
</details>
Figure 12: TTA Confidence Prompt
Appendix C Alternative Perspectives of Stable Explanations
C.1 Confidence through the Viewpoint of Transductive Inference
Transductive learning selects a classifier at inference-time based on a combination of training and test data [10, 42, 22]. Typically transductive learning involves fine-tuning some classifier parameter $w$ based on an explicit loss objective. However, we claim that using an LLM to generate a sequence of text before an answer (i.e. an explanation) is an alternate way of doing transductive reasoning. First, note that answering a question after an explanation, such as in chain-of-thought prompting [44], effectively changes the decision boundary of the LLM classifier at inference time. Second, consider that when an LLM generates an explanation, it produces concepts related to those in the question. These additional concepts can be thought of as forcing the LLM at inference time to pay more attention to the decision boundary in the area around the test datum. In-context learning literature, which examines LLM performance after manually inserting demonstrations similar to the test question, has already shown a direct connection between transformer context adjustment and classical fine-tuning behavior [9].
To formalize this perspective, let $D^{t}=\{(x_{1},y_{1}),...,(x_{t},y_{t})\}$ be a dataset of sequential data up to time $t$ , with $x_{i}∈ X⊂{\mathbb{R}}^{M}$ and labels $y_{i}∈ Y⊂\{1,...,K\}$ . We denote with $D^{t}_{-}=\{(x_{1},y_{1}),...,(x_{t-1},y_{t-1}),x_{t}\}$ the dataset without the last label $y_{t}$ . We can write our transductive prediction for $x_{t}$ given data $D^{t}_{-}$ including $x_{t}$ as:
$$
\displaystyle\hat{y}_{t}=~\underset{w,y}{\rm argmin}~{}\underbrace{\frac{1}{t}\ell(f_{w}(x_{t}),y)+\frac{1}{t}\sum_{i=1}^{t-1}\ell(f_{w}(x_{i}),y_{i})}_{\doteq L(w;(D^{t}_{-},y))}. \tag{17}
$$
If $\ell$ is interpreted as a log likelihood, then $L$ can be interpreted as the negative log posterior probability over hypotheses. If we think of optimizing instead over explanations where $f_{e}(x_{t})=\phi(x_{t},e)$ , then the problem reduces to finding an explanation that strongly supports a single answer without biasing predictions on original test data. The second term in equation ˜ 17 is expensive to compute at inference time, but if some approximation of this training loss existed it would make optimization tractable. We hypothesize that if the explanation under consideration is plausible and faithful to the question (as determined using the same LLM), it should not reduce the accuracy of previous decisions too much. Therefore we can avoid having to optimize over all previous questions and instead optimize over whatever faithfulness measure $g_{\phi}(e)$ we define:
$$
\displaystyle\hat{y}_{t}=~\underset{e,y}{\rm argmin}~{}\ell(\phi(x_{t},e),y)+\lambda g_{\phi}(e) \tag{18}
$$
This looks exactly like the typical transductive setting but with a more easily computable ‘transductive prior’.
C.2 Confidence throught the Viewpoint of Solomonoff Induction
While transductive inference typically finds single test-time classifier, our method looks for a distribution of likely classifiers. In this sense, our method can be seen as a special case of Solomonoff induction [25]. Solomonoff induction considers how well data-generating programs, $H$ , (i.e. a binary string run on a Turing machine) explain the test data, $D$
$$
\displaystyle P(H|D)=\frac{P(D|H)P(H)}{P(D|H)P(H)+\sum_{A\neq H}P(D|A)P(A)}, \tag{19}
$$
where $A$ are alternative programs. Solomonoff induction formalizes the principle of Occam’s razor by choosing a universal prior $P(H)$ that gives a higher probability to shorter-length programs. Then to predict new data $D^{\prime}$ given previous observations, one simply computes
$$
\displaystyle P(D^{\prime}|D)=\mathbb{E}_{H}[P(D^{\prime}|H,D)]=\sum_{H}P(D^{\prime}|H,D)P(H|D). \tag{20}
$$
While these Bayesian equations seem simple, Solomonoff’s induction is provably uncomputable. However, our method can be interpreted as restricting our hypothesis class from the set of all computable programs $H$ to the set of all LLM-interpretable programs $e$ . Instead of a prior on program length, we can use the LLM’s prior likelihood of valid sequences in the language $p_{\phi}(e)$ . This restriction makes our calculations more tractable, as we can easily approximate expectations over our hypothesis class by sampling explanations from the LLM.
C.3 Confidence through the Viewpoint of Stability
Another recent line of work has been analyzing LLMs through the lens of stochastic dynamical models [37]. Through the perspective of stability analysis one could interpret our method’s preference for explanations convening to a single answer as searching for fixed points of a specific LLM system. This LLM dynamical system consists of two alternating steps, first generating an explanation conditioned on one of the answers ( $e←\phi(q,a)$ ) then generating a new answer based on this explanation ( $a^{\prime}←\phi(q,e)$ ). Intuitively this system mirrors how a human expert may think about a question by considering alternative conclusions one could draw given beliefs about the world. An answer with only a single plausible explanation that strongly supports that same answer (i.e. decision distribution collapses to a singleton) forms a stable cycle in this system.