# Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
**Authors**: Vaibhav Mavi, Dyania Health, &Shubh Jaroria, Dyania Health, &Weiqi Sun, Dyania Health
Abstract
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
1 Introduction
Large language model (LLM) agents are increasingly deployed in complex applications such as task-planning [survey-planning], dialog systems [survey-dialog], collaborative problem-solving [survey-collab] and multi-hop question answering [survey-mhqa] where detecting errors and failures is a critical challenge. A common strategy for detecting failures is to extend the system with a self-evaluation component, where either the agent itself or an auxiliary evaluator assigns a confidence score to the response [survey].
Failure detection through confidence estimation has been extensively studied in single-step prediction tasks [survey, elicit, blackbox, activations1], but its role in multi-step reasoning remains largely underexplored. Multi-step interactions pose unique challenges: reasoning chains can be arbitrarily long, errors may occur at any step, and later steps often depend on earlier ones. Consequently, direct application of existing methods often fails to identify errors in a multi-step task reliably. For example, self-certainty [self-certainty] directly applied to CoQA (Conversational Question Answering) [coqa] yields poor performance (AUC-ROC 0.523, FPR@0.9 recall 0.95).
However, a trivial extension of detecting errors after each step improves the performance substantially (AUC-ROC 0.849, FPR@0.9 recall 0.374). This observation raises a key question: Should confidence estimation methods in multi-step tasks evaluate responses: i) after each reasoning step, enabling fine-grained error-detection, or ii) holistically, considering the final answer in full context?
We systematically investigate this question across two representative settings: tool-enhanced reasoning and LLMâuser dialog. Our experiments reveal that step-level evaluation often provides superior error detection, though holistic evaluation can still be advantageous in certain contexts.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Flowchart: Percentage Calculation of Students Enrolled in Hip-Hop Dance
### Overview
The image is a flowchart illustrating a step-by-step calculation to determine the percentage of students enrolled in hip-hop dance in a class of 20 students. The problem involves percentages for contemporary (20%), jazz (25% of the remaining), and hip-hop (the rest). The flowchart includes mathematical operations, intermediate results, and a final answer.
### Components/Axes
- **Question Box**: Contains the problem statement.
- **Calculation Steps**: Six green boxes with mathematical operations and explanations.
- **Final Answer Box**: Contains the result "75%".
- **Icons**:
- **Magnifying Glass Icons**: Positioned on the right side, each with a numerical value (e.g., 0.94, 0.97, 0.89, 0.91, 0.18, 0.84). These likely represent confidence scores or step-specific metrics.
- **Calculator and Pencil Icons**: Positioned on the left side, connected to the calculation steps via arrows.
- **Person Icon**: Located at the top-right corner, possibly indicating a user or student.
### Detailed Analysis
1. **Question Box**:
- Text: "In a dance class of 20 students, 20% enrolled in contemporary dance, 25% of the remaining enrolled in jazz dance, and the rest enrolled in hip-hop dance. What percentage of the entire students enrolled in hip-hop dance?"
2. **Calculation Steps**:
- **Step 1**: "There are 20 x 20/100 = <<20*20/100>> students who enrolled in contemporary dance."
- Result: 4 students.
- **Step 2**: "There are 20 x 20/100 = 4 students who enrolled in contemporary dance. So, 20 - 4 = <<20-4>> students are enrolled in either jazz or hip-hop dance."
- Result: 16 students.
- **Step 3**: "So, 20 - 4 = 16 students are enrolled in either jazz or hip-hop dance. There are 16 x 25/100 = <<16*25/100>> students who enrolled in jazz dance."
- Result: 4 students.
- **Step 4**: "There are 16 x 25/100 = 4 students enrolled in jazz. Hence, 16 - 4 = <<16-4>> students enrolled in hip-hop."
- Result: 12 students.
- **Step 5**: "Hence, 16 - 4 = 12 students enrolled in hip-hop. This is 12/16 x 100% = <<12/16*100>>% of the entire students."
- Result: 75%.
- **Step 6**: "This is 12/16 x 100% = 75% of the entire students. The final answer is 75%."
3. **Final Answer Box**:
- Text: "The final answer is 75%."
### Key Observations
- The flowchart contains a **mathematical error**: The percentage of students enrolled in hip-hop is calculated as 12/16 (75%) instead of 12/20 (60%).
- The magnifying glass icons with numerical values (e.g., 0.94, 0.97) are positioned on the right side, but their purpose is unclear (e.g., confidence scores, step weights).
- The flowchart uses a linear, top-to-bottom flow with arrows connecting steps.
### Interpretation
- The flowchart attempts to calculate the percentage of students in hip-hop dance but incorrectly uses the remaining students (16) as the denominator instead of the total (20). This leads to an erroneous final answer of 75% instead of the correct 60%.
- The magnifying glass icons may indicate step-specific confidence or accuracy metrics, but their exact meaning is not explained in the image.
- The flowchartâs structure is logical but flawed in its final calculation, highlighting the importance of verifying denominators in percentage problems.
## Notes
- **Language**: All text is in English.
- **Error**: The final answer (75%) is incorrect; the correct percentage is 60% (12/20).
- **Spatial Grounding**:
- Magnifying glass icons are aligned vertically on the right.
- Calculator and pencil icons are on the left, connected to steps via arrows.
- Person icon is at the top-right corner.
</details>
(a) step-level confidence scoring setup with GSM8K
<details>
<summary>x2.png Details</summary>

### Visual Description
## Screenshot: Chat Interface with Reddit Bot
### Overview
The image depicts a chat interface between a user and a Reddit bot. The conversation revolves around the history of the Rudolph and Santa Claus puppets, referencing a 2006 New York Times article. The bot provides structured answers to the user's questions, with responses color-coded (blue for user, green for bot). The Reddit logo (a blue robot face) is visible in the bottom-left corner.
### Components/Axes
- **Conversation Bubbles**:
- **User (Blue)**: Questions about the puppets' existence, quantity, names, storage location, and rediscovery year.
- **Bot (Green)**: Answers confirming the puppets' status, names, location, and year.
- **Reddit Logo**: A blue robot face in a black circle, positioned at the bottom-left.
- **Story Text**: A block of text at the top summarizing the 2006 NYT article and the puppets' history.
### Detailed Analysis
1. **Story Text**:
- Mentions the 2006 New York Times report about fans seeking Rudolph and Santa Claus puppets at the Center for Puppetry Arts in Atlanta.
- States the pair were thought to be the last surviving production puppets.
- Notes the nephew of the original rescuer found the puppets in a family attic in 2005.
2. **User Questions**:
- "Are all of the puppets still in existence?"
- "How many remain?"
- "Which?"
- "Where had they been stored since their use in the show?"
- "When were they rediscovered?"
3. **Bot Answers**:
- "No" (to existence).
- "Two" (quantity).
- "Rudolph and Santa Claus" (names).
- "Atlanta" (storage location).
- "2005" (rediscovery year).
### Key Observations
- The bot confirms only **two puppets** (Rudolph and Santa Claus) remain, contradicting the initial assumption that they were the last surviving.
- The storage location is explicitly stated as **Atlanta**, aligning with the 2006 article.
- The rediscovery year (**2005**) precedes the 2006 article, suggesting the puppets were found before the report.
- The userâs question "Which?" is answered with the specific names, indicating the botâs precision.
### Interpretation
The conversation highlights a discrepancy: the 2006 article implied Rudolph and Santa Claus were the last surviving puppets, but the bot clarifies only **two** remain, suggesting others may have been lost or destroyed. The storage location (Atlanta) and rediscovery year (2005) align with the articleâs timeline, but the botâs answers imply the puppets were found **before** the 2006 report. This raises questions about the accuracy of the original article or the botâs data source. The Reddit botâs structured responses suggest it is programmed to extract and relay factual details from the story, prioritizing clarity over narrative context.
## Notes on Uncertainty
- No numerical values require approximation; all data points (e.g., "two," "2005") are explicitly stated.
- The botâs answers are definitive, but the absence of additional context (e.g., total number of puppets) leaves room for interpretation about their historical significance.
</details>
(b) step-level confidence scoring setup with CoQA
2 Related Work
Prior work on confidence estimation for LLMs can be broadly divided into black-box and white-box approaches. Black-box methods assume no access to the underlying model and often rely on prompting-based strategies such as self-reflection, self-consistency, or generating multiple candidate answers [self-reflection, self-consistency, mutliple-answers, preferknow, mostly-know, steerconf-verbal]. Another line of work employs external evaluators to assess responses post hoc, using features such as similarity measures or structured scoring models [graph, fine-grained, blackbox, similarity-features].
White-box methods, in contrast, target open-source models where full access to parameters and activations is possible. These techniques include fine-tuning models to improve self-evaluation abilities [past-experience, doubt-rl], leveraging log-probabilities for calibrated confidence estimation [dst, multicalibration, self-certainty], and training regression models over hidden states to predict correctness [activations1, activations2, activations3].
survey [survey] provide a comprehensive survey of these approaches, organizing calibration techniques across settings and highlighting key limitations. However, the vast majority of existing work studies single-step tasks, where the model outputs a single response to a single query. Extending these methods to multi-step reasoning remains largely unaddressed.
3 Problem Definition
We define failure detection as the task of estimating the probability $pâ[0,1]$ that an agentâs ( $A$ ) response $R$ to a given input $I$ is incorrect: $p=\mathcal{F}(R\mid$ I).
Multi-step interactions require a more general formulation: the input is defined as $I=(C,Q)$ where $C$ denotes the initial context and $Q$ is the sequence of queries. The LLM agent produces a sequence of responses $R$ with interaction length $n=|Q|=|R|$ . We consider the following two adaptations:
Response-level scoring: Treat the queries and responses in all the steps as a single sequence and assign one confidence score to the whole solution. This holistic approach captures global coherence.
$$
p=\mathcal{S}_{whole}(R_{[1:n]}\mid C,Q_{[1:n]}) \tag{1}
$$
Step-level scoring: The response $R_{i}$ at a given step $i$ is dependent on the prior queries and responses and scoring it requires all of the previous context ( $C,Q_{[1:i]},R_{[1:i-1]}$ ). Accordingly, we assign a separate score to each response $R_{i}$ , conditioned on the previous queries and responses.
$$
p_{i}=\mathcal{F}_{step}(R_{i}\mid C,Q_{[1:i]},R_{[1:i-1]})\\ \tag{2}
$$
If any individual score exceeds the threshold, the entire response can be flagged as potentially incorrect. This can be achieved by using $p=min(\{p_{i}\}_{i=1}^{n})$ .
4 Data
4.1 Agent Inputs
To test error detection, we focus on tasks where correctness can be objectively defined at each step. Accordingly, we select the following datasets.
GSM8K (Grade School Math - 8K) [gsm8k] is a collection of grade-school math word problems that require multi-step reasoning and computation. At each step, the agent generates an intermediate formula, queries an expression evaluator, and incorporates the toolâs response into subsequent steps (Figure 1(a) ). Problems in the GSM8K test set require an average of 5.1 steps.
CoQA (Conversational Question Answering) [coqa] contains over 127,000 questionâanswer pairs spanning 8,000 conversations that are context-grounded, with later questions often depending on previous queries and answers (Figure 1(b)). Conversations have an average of 13.5 steps.
Responses: For both tasks, we fine-tune Llama-3.2-11B-Instruct [llama32] for two epochs. Because several confidence scoring methods require training, we hold out subsets of training and test splits for confidence estimation, while using the remainder to train the LLM agent. For GSM8K specifically, we use two sets of labels: Answer labels â assess whether the final answer matches the ground truth, and Reasoning labels â assess whether each intermediate reasoning step is correct. Further details on response labels and accuracy are included in Appendix A.
5 Experiments
5.1 Confidence Estimation Methods
We evaluate several confidence scoring methods, under both formulations response-level (Equation 1) and step-level (Equation 2). For methods requiring training, we use the instruction-tuned Llama-3.2-11B as the base model, replacing its generation head with a regression head for classification objectives. Details on each algorithm are mentioned in the Appendix. C
5.2 Evaluation Metrics
We frame error detection as a binary classification task and report the following metrics:
AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures how well the model separates correct from incorrect responses across thresholds. Higher is better.
FPR@0.9 Recall: Since the goal is to reliably flag potentially incorrect responses while minimizing false alarms, we measure the false positive rate (FPR) of the model at a threshold where it identifies the incorrect responses with at least $0.9$ recall. Some approaches fail to reach the target recall without trivially classifying all responses as incorrect. In these cases, we report FPR@0.9 recall as $1$ and additionally report the maximum achievable recall.
5.3 Results
Failure detection in multi-step interactions: For both tasks, the best performing methods achieve an AUC-ROC of $0.9$ and a recall of $0.9$ with FPR below one-third. Across techniques, regression model performs the best for both tasks. Interestingly, preference-based reward models perform poorly, suggesting that PRMs are better suited for ranking responses by quality, rather than tasks that have objective correctness labels [reward2].
Performance across granularity and task: For CoQA, step-level scoring significantly outperforms response-level scoring across all methods. For GSM8K, the difference is smaller and trends are less consistent. Notably, the step-level performance of self-certainty is significantly worse, likely due to tool interactions that alter the agentâs responses at each step, thereby distorting the logits. This degradation is not observed for the activations-based regressor, since it only relies on hidden states from the final token.
Most techniques perform better on CoQA than GSM8K, suggesting that reasoning-intensive math problems might be more challenging for evaluators than context-grounded QA. Interestingly, GPT-4.1-mini shows significantly improved performance on GSM8K, reflecting its superior reasoning ability.
Table 1: Evaluation results of different techniques on GSM8K and CoQA. An FPR@0.9 Recall of 1.0 (mr: x) means that the recall does not exceed x without flagging everything as low confidence.
| Self-verbalized | response step response | 0.556 0.546 (-0.2%) 0.586 | 1.0 (mr: 0.13) 1.0 (mr: 0.10) 1.0 (mr: 0.52) | 0.502 0.624 (+24%) 0.522 | 1.0 (mr: 0.77) 0.587 1.0 (mr: 0.73) |
| --- | --- | --- | --- | --- | --- |
| Llama-3.2-11B | step | 0.676 (+15%) | 1.0 (mr: 0.75) | 0.613 (+12%) | 0.81 |
| response | 0.880 | 1.0 (mr: 0.81) | 0.548 | 0.88 | |
| GPT-4.1-mini | step | 0.670 (-24%) | 1.0 (mr: 0.48) | 0.665 (+21%) | 0.476 |
| response | 0.843 | 0.441 | 0.689 | 0.732 | |
| Regression | step | 0.907 (+7%) | 0.314 | 0.952 (+38%) | 0.169 |
| response | 0.450 | 0.928 | 0.381 | 1.0 (mr: 0.57) | |
| PRM | step | - | - | 0.493 (+30%) | 0.887 |
| response | 0.649 | 0.812 | 0.523 | 0.95 | |
| Self-certainty | step | 0.395 (-40%) | 0.945 | 0.849 (+62%) | 0.374 |
| response | 0.608 | 1 (mr: 0.77) | 0.792 | 0.643 | |
| Activations | query | 0.750 (+23%) | 0.647 | 0.919 (+16%) | 0.169 |
Relation to final answer accuracy: For GSM8K, the agent reached the correct answer despite flawed intermediate reasoning in 60/879 test cases (Figure 2). Table 3 shows that step-level performance of all methods against final answers is slightly lower, while the response-level performance improves. This is expected since step-level scoring penalizes intermediate mistakes more strongly, while response-level scoring focuses on the overall outcome.
Identifying cases where the agent reaches the correct answer through flawed reasoning is crucial for trustworthy deployment. Table 2 shows that for most methods, step-level scoring is more effective at detecting such cases.
Table 2: Recall for cases with incorrect reasoning steps but correct answer. Higher the better
| response-step | 0.05 0 | 0.133 0.40 | 0.50 0.30 | 0.367 0.60 | 0.133 0.217 | 0.167 0.267 |
| --- | --- | --- | --- | --- | --- | --- |
Case study on real world data We also test the effectiveness of this approach on a private dataset with real clinical notes and questions. Consistent with the analysis on public datasets, a regression model generating step-level scores achieves the best performance with AUC-ROC of $\mathbf{=0.940}$ and FPR@0.9 rec $\mathbf{=0.152}$ . We include further details in Appendix D.
6 Conclusion
We extended confidence estimation to multi-step tasks in dialogue and tool-assisted reasoning, where maintaining consistency across steps is especially challenging. Through experiments on two multi-step tasks, we find that step-level scoring, though harder to implement in some cases, generally improves error detection and reveals when correct answers emerge from faulty reasoning. Our study highlights the limits of current methods and provides a basis for developing confidence estimators better suited to multi-step reasoning.
Appendix A Training details
A.1 Confidence Scorers: Teacher Forcing
For methods requiring supervised training, we adopt teacher forcing teacher-forcing. During training, the model receives the gold history (i.e., corrected responses) when evaluating the next response. The learning objective is:
$$
p_{i}=\mathcal{F}(R_{i},|C,Q_{[1:i]},\hat{R}_{[1:i-1]})=\mathbb{I}\{R_{i}\neq\hat{R}_{i}\} \tag{3}
$$
where $\hat{R}$ is the list of ground truth responses and $\mathbb{I}\{.\}$ is the indicator function. During inference, we do not assume access to the ground truth. At inference time, however, no ground truth is available, and the evaluator must operate solely on the modelâs predictions.
Appendix B Data preparation
B.1 Agent: Training and Inference
For both tasks, we fine-tune Llama-3.2-11B-Instruct llama32 for two epochs. Because several confidence scoring methods require training, we hold out subsets of train and test splits for confidence estimation, while using the remainder to train the LLM agent.
Performance varies across datasets and granularity: - On CoQA, the agent achieves $\mathbf{81.2\%}$ step-level accuracy but only $\mathbf{16.1\%}$ response-level accuracy. The large gap is expected, since even a single incorrect step can propagate errors downstream. - On GSM8K, the agent achieves $\mathbf{65.6\%}$ answer accuracy and $\mathbf{47.6\%}$ step-level accuracy. Here, answer accuracy is higher because the agent may arrive at correct final answers even if some intermediate steps are flawed (see Figure 2).
B.2 Labeling responses
We use GPT-5 to evaluate agentâs responses against ground truth answers and intermediate steps, producing labels at both the step-level and response-level. To verify the label quality, we manually reviewed $100$ samples from each dataset. We found labeling accuracy above $96\%$ in both settings.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Screenshot: Chat Interface with Age Problem Solution
### Overview
The image is a screenshot of a chat interface where a user asks a math problem about the sum of ages of two individuals, Mico and Marco, in 10 years. The chatbot provides a step-by-step solution, breaking down the problem into incremental calculations.
### Components/Axes
- **User Question**: Presented in a blue rectangular box with bold text.
- **Chatbot Responses**: Displayed in green rectangular boxes with a blue robot icon on the left.
- **Text Content**:
- **Question**: "Mico and Marco wanted to get to know each other. They realize that the sum of their ages is 20. What will be the sum of their ages in 10 years?"
- **Step 1**: "In 10 years, Mico will be 10 + 5 = <<10+5>> years old."
- **Step 2**: "In 10 years, Mico will be 10 + 5 = 15 years old."
- **Step 3**: "In 10 years, Marco will be 10 + 15 = <<10+15>> years old."
- **Step 4**: "In 10 years, Marco will be 10 + 15 = 25 years old."
- **Step 5**: "In 10 years, the sum of their ages will be 15 + 25 = <<15+25>> years old."
- **Final Answer**: "In 10 years, the sum of their ages will be 15 + 25 = 40 years old. The final answer is 40."
### Detailed Analysis
- **Mathematical Expressions**:
- All calculations use basic arithmetic (addition).
- Intermediate steps include placeholder syntax (e.g., `<<10+5>>`) and resolved values (e.g., `15`).
- **Structure**:
- The chatbot first calculates individual ages in 10 years, then sums them.
- The final answer is explicitly stated as "40."
### Key Observations
- The chatbotâs responses are methodical, isolating variables (Mico and Marcoâs current ages) and projecting their future ages.
- The problem assumes Mico and Marcoâs current ages sum to 20, but the chatbotâs steps imply Mico is 10 and Marco is 15 (since 10 + 15 = 25, not 20). This discrepancy suggests either an error in the problem setup or a misinterpretation by the chatbot.
### Interpretation
The chatbot correctly applies arithmetic to project future ages but appears to misinterpret the initial condition. If the sum of their current ages is 20, the final sum in 10 years should be 20 + 20 = 40 (since both ages increase by 10). However, the chatbotâs steps assume Mico is 10 and Marco is 15 (summing to 25 currently), leading to a final sum of 40. This inconsistency indicates either:
1. A typo in the problem statement (current sum should be 25, not 20).
2. A logical error in the chatbotâs reasoning.
The final answer (40) is mathematically valid if the current sum is 25, but the problemâs stated condition (current sum = 20) creates ambiguity. The chatbotâs process highlights the importance of verifying initial assumptions in problem-solving.
</details>
Figure 2: Case from GSM8K where the agent gets the answer correct through incorrect reasoning steps. The agent assumes the current ages of Mico and Marco to be $5$ and $15$ while the question does not mention it. The agent ends up getting to the correct answer nonetheless since it only concerns with the sum of their ages.
Appendix C Evaluated Confidence Estimation Methods
C.1 Black-box methods
C.1.1 Self-verbalized confidence
The LLM agent is prompted to verbalize its confidence in its own response. For step-level scoring, the agent outputs the confidence score at the end of each step.
C.1.2 Auxiliary evaluators
External models assess the agentâs responses.
Pre-trained LLMs: Instruction-tuned LLMs are prompted to evaluate the agentâs responses. We consider two evaluators: (a) Llama-3.2-11B (aligned with the agentâs base model), and (b) OpenAIâs GPT-4.1-mini gpt-4.1 (independent of the agent).
Regression model: We fine-tune the instruction tuned Llama-3.2-11B model with a sequence classification head to regress confidence scores in the range $[0,1]$ .
Preference-based reward model (PRM): We train a reward model on preference data, treating completions with correct answers as âchosenâ and incorrect agent outputs as ârejected.â For GSM8K, multiple valid reasoning paths to solve the same problem make generating step-level preference data infeasible, since each incorrect step in the interaction would require a corrected version. Hence, we evaluate PRMs only at the response-level and leave the step-level training and evaluation as a future work.
Table 3: Answer label performance on GSM8K. An FPR@0.9 Recall of 1.0 (mr: x) means that the recall does not exceed x without flagging everything as low confidence.
| Self-eval | response step response | 0.560 0.559 (-0.2%) 0.590 | 0.317 0.3125 0.291 | 1.0 (mr: 0.15) 1.0 (mr: 0.12) 1.0 (mr: 0.52) |
| --- | --- | --- | --- | --- |
| Llama-3.2-11B | step | 0.669 (+13%) | 0.159 | 1.0 (mr: 0.76) |
| response | 0.895 | 0.088 | 1.0 (mr: 0.88) | |
| GPT-4.1-mini | step | 0.662 (-26%) | 0.280 | 1.0 (mr: 0.49) |
| response | 0.869 | 0.075 | 0.4385 | |
| Regression | step | 0.872 (+1%) | 0.144 | 0.369 |
| response | 0.460 | 0.629 | 0.915 | |
| PRM | step | - | - | - |
| response | 0.658 | 0.219 | 0.773 | |
| Self-certainty | step | 0.342 (-48%) | 0.320 | 0.958 |
| response | 0.605 | 0.339 | 1 (mr: 0.77) | |
| Activations | query | 0.738 (+21%) | 0.279 | 0.655 |
C.2 White-box methods
Logits: Following Self-certainty self-certainty, we compute the KL divergence of the agentâs output logits from the uniform distribution as a measure of certainty. Since this approach consistently outperforms other logit-based methods, we use it as the representative logit-based white-box baseline. Self-certainty scores are normalized to fall within $[0,1]$ .
Activations: Prior work activations1, activations2, activations3 suggests that hidden states of the modelâs final LLM layer contain information on modelâs behavior and can be used to extract its confidence in its response. Following this, we train a 5-layer MLP classifier on the modelâs final hidden states to predict a correctness score.
Appendix D Evaluating on private dataset
<details>
<summary>x4.png Details</summary>

### Visual Description
## Screenshot: Clinical Conversation Interface
### Overview
The image depicts a structured dialogue between a user (likely a healthcare professional) and a chatbot (likely an AI medical assistant) discussing a patient's medical history. The conversation focuses on hypertension, congestive heart failure, and the New York Heart Association (NYHA) classification. A redacted clinical note is referenced at the start.
### Components/Axes
- **User Interface Elements**:
- **User Messages**: Blue speech bubbles with a human icon (top-right).
- **Bot Responses**: Green speech bubbles with a robot icon (bottom-left).
- **Clinical Note Header**: Blue box labeled "Clinical Note <redacted>".
- **Textual Content**:
- Questions and answers about hypertension, diagnosis dates, and NYHA scores.
### Detailed Analysis
1. **Clinical Note Header**:
- Label: "Clinical Note <redacted>".
- Position: Top-right corner.
- Note: The content of the clinical note is redacted, but its presence is explicitly mentioned.
2. **Conversation Flow**:
- **Question 1**: "Has the patient been diagnosed with hypertension?"
- **Answer**: "Yes" (Bot response).
- **Question 2**: "Is the hypertension resistant or uncontrolled?"
- **Answer**: "No relevant information is mentioned in the note" (Bot response).
- **Question 3**: "When was the diagnosis first established?"
- **Answer**: "May, 2022" (Bot response).
- **Question 4**: "Is there a diagnosis of congestive heart failure?"
- **Answer**: "Yes" (Bot response).
- **Question 5**: "What is the most recent NYHA score?"
- **Answer**: "II-III" (Bot response).
### Key Observations
- The conversation systematically gathers structured medical data, prioritizing hypertension and heart failure.
- The NYHA score (II-III) indicates moderate heart failure severity (Class II: slight limitation of physical activity; Class III: marked limitation of physical activity).
- The redacted clinical note suggests sensitive or confidential information is omitted.
- The botâs responses are concise, focusing on factual answers without elaboration.
### Interpretation
This interface appears to be part of a clinical decision-support system or electronic health record (EHR) tool. The structured questions ensure critical patient data is extracted efficiently, likely for diagnosis, treatment planning, or documentation. The redacted note implies compliance with privacy regulations (e.g., HIPAA). The NYHA score (II-III) highlights the patientâs functional limitations, which could influence treatment strategies. The absence of details about hypertension control (resistant/uncontrolled) suggests either incomplete data or a focus on other priorities in the conversation.
</details>
Figure 3: An example from the private clinical data.
To evaluate the applicability of confidence estimation methods in real-world settings, we tested some of the approaches on a private dataset consisting of conversational question-answering interactions over real patient clinical notes. A redacted example from this dataset is provided in Figure 3. We do not publicly release the data due to conflict of interest as well as HIPAA compliance, and the results are therefore not reproducible. Nevertheless, it provides a valuable demonstration in a domain where trustworthiness is critical. Consistent with the analysis on public datasets, a regression model generating step-level scores achieves the best performance with AUC-ROC $\mathbf{=0.940}$ and FPR@0.9 rec $\mathbf{=0.152}$ . These results indicate that step-level confidence scoring with a regression model remains effective in complex, real-world interactions.