# Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
**Authors**: Vaibhav Mavi, Dyania Health, &Shubh Jaroria, Dyania Health, &Weiqi Sun, Dyania Health
## Abstract
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
## 1 Introduction
Large language model (LLM) agents are increasingly deployed in complex applications such as task-planning [survey-planning], dialog systems [survey-dialog], collaborative problem-solving [survey-collab] and multi-hop question answering [survey-mhqa] where detecting errors and failures is a critical challenge. A common strategy for detecting failures is to extend the system with a self-evaluation component, where either the agent itself or an auxiliary evaluator assigns a confidence score to the response [survey].
Failure detection through confidence estimation has been extensively studied in single-step prediction tasks [survey, elicit, blackbox, activations1], but its role in multi-step reasoning remains largely underexplored. Multi-step interactions pose unique challenges: reasoning chains can be arbitrarily long, errors may occur at any step, and later steps often depend on earlier ones. Consequently, direct application of existing methods often fails to identify errors in a multi-step task reliably. For example, self-certainty [self-certainty] directly applied to CoQA (Conversational Question Answering) [coqa] yields poor performance (AUC-ROC 0.523, FPR@0.9 recall 0.95).
However, a trivial extension of detecting errors after each step improves the performance substantially (AUC-ROC 0.849, FPR@0.9 recall 0.374). This observation raises a key question: Should confidence estimation methods in multi-step tasks evaluate responses: i) after each reasoning step, enabling fine-grained error-detection, or ii) holistically, considering the final answer in full context?
We systematically investigate this question across two representative settings: tool-enhanced reasoning and LLMâuser dialog. Our experiments reveal that step-level evaluation often provides superior error detection, though holistic evaluation can still be advantageous in certain contexts.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Step-by-Step Solution to a Percentage Word Problem
### Overview
The image is a flowchart-style diagram illustrating a step-by-step mathematical solution to a word problem about student enrollment in dance classes. It includes the problem statement, a sequence of calculations with intermediate results, and a final answer. On the right side, there is a column of numerical scores (likely confidence or accuracy metrics) associated with each step.
### Components/Axes
1. **Problem Statement (Top, Blue Box):**
* **Text:** "Question: In a dance class of 20 students, 20% enrolled in contemporary dance, 25% of the remaining enrolled in jazz dance, and the rest enrolled in hip-hop dance. What percentage of the entire students enrolled in hip-hop dance?"
* **Icon:** A small blue person silhouette is to the right of the text.
2. **Solution Steps (Main Column, Beige Background):**
* Six sequential steps are presented in light green rounded rectangles.
* Each step is connected by a downward arrow.
* Each step is accompanied by a robot head icon (left) and a calculator icon (far left).
* The text within the steps includes both the reasoning and the mathematical operations, with some calculations highlighted in purple within `<< >>` brackets.
3. **Confidence Scores (Right Column):**
* Six magnifying glass icons, each containing a small bar chart symbol.
* Next to each icon is a numerical score, aligned horizontally with the corresponding solution step.
* The scores are: **0.94, 0.97, 0.89, 0.91, 0.18, 0.84**.
### Detailed Analysis / Content Details
**Step-by-Step Transcription and Logic:**
* **Step 1:** "There are 20 x 20/100 = <<20*20/100>> students who enrolled in contemporary dance."
* *Calculation:* 20 * 0.20 = 4 students.
* *Confidence Score:* 0.94
* **Step 2:** "There are 20 x 20/100 = 4 students who enrolled in contemporary dance. So, 20 - 4 = <<20-4>> students are enrolled in either jazz or hip-hop dance."
* *Calculation:* 20 - 4 = 16 students.
* *Confidence Score:* 0.97
* **Step 3:** "So, 20 - 4 = 16 students are enrolled in either jazz or hip-hop dance. There are 16 x 25/100 = <<16*25/100>> students who enrolled in jazz dance."
* *Calculation:* 16 * 0.25 = 4 students.
* *Confidence Score:* 0.89
* **Step 4:** "There are 16 x 25/100 = 4 students enrolled in jazz. Hence, 16 - 4 = <<16-4>> students enrolled in hip-hop."
* *Calculation:* 16 - 4 = 12 students.
* *Confidence Score:* 0.91
* **Step 5:** "Hence, 16 - 4 = 12 students enrolled in hip-hop. This is 12/16 x 100% = <<12/16x100>>% of the entire students."
* *Critical Analysis:* This step contains a **logical error**. The fraction `12/16` represents the percentage of the *remaining* students (after contemporary) who chose hip-hop, not the percentage of the *entire* class. The phrase "of the entire students" is incorrect here.
* *Confidence Score:* **0.18** (This is the lowest score, correctly flagging this step as problematic).
* **Step 6:** "This is 12/16 x 100% = 75% of the entire students. The final answer is 75%."
* *Final Answer Given:* 75%.
* *Correct Calculation:* The percentage of the *entire* class (20 students) enrolled in hip-hop is (12 / 20) * 100% = **60%**.
* *Confidence Score:* 0.84
### Key Observations
1. **Procedural Error:** The solution correctly calculates the number of students in each category (4 contemporary, 4 jazz, 12 hip-hop) but makes a critical error in the final percentage calculation. It uses the denominator 16 (the subset of students not in contemporary) instead of 20 (the total class size).
2. **Confidence Score Anomaly:** The confidence score for Step 5 is **0.18**, which is drastically lower than all others (which range from 0.84 to 0.97). This strongly suggests an automated system has identified this specific step as having low reliability or containing an error.
3. **Visual Flow:** The diagram uses a clear top-down flow with icons to denote reasoning (robot) and calculation (calculator). The right-hand column of scores provides a parallel meta-analysis of each step's validity.
### Interpretation
This image appears to be a visualization from an AI or automated tutoring system that not only solves a problem but also self-evaluates its own reasoning process. The data suggests:
* **The system can perform multi-step arithmetic correctly** (Steps 1-4 have high confidence scores).
* **The system has a mechanism for error detection.** The plummeting confidence score at Step 5 indicates the system's internal logic or a separate validator has flagged a discrepancy between the step's conclusion ("of the entire students") and the correct mathematical relationship.
* **The final answer (75%) is incorrect based on the problem statement.** The correct answer is 60%. The diagram thus serves as a case study in how a reasoning error in the final interpretation step can corrupt an otherwise correct procedural solution.
* **The presence of the confidence scores is the most significant informational element.** It transforms the image from a simple wrong solution into a diagnostic tool showing *where* and *with what certainty* the reasoning process broke down. The low score on Step 5 is a direct, quantitative indicator of the precise point of failure.
</details>
(a) step-level confidence scoring setup with GSM8K
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Screenshot: Chat Interface with Story and Q&A Session
### Overview
The image displays a simulated chat interface on a pale yellow background. It consists of two main sections: a large, light-blue rounded rectangle at the top containing a narrative "Story," and a series of alternating user and system chat bubbles below it. To the left of the system's answers is a vertical sidebar containing magnifying glass icons and numerical confidence scores.
### Components/Axes
* **Main Container:** A large, rounded rectangle with a black border and pale yellow fill.
* **Story Block:** A light-blue rounded rectangle at the top of the main container.
* **Chat Bubbles:**
* **User Bubbles:** Light blue, right-aligned, with a generic blue person icon to their right.
* **System/Answer Bubbles:** Light green, left-aligned, with a dark blue robot icon to their left.
* **Sidebar:** A vertical column to the left of the system answers, containing:
* **Icons:** Five identical black magnifying glass icons, each with a small bar chart inside the lens.
* **Numerical Scores:** Five numbers, four in black and one in red, positioned to the right of each icon.
### Content Details
#### 1. Story Block Text (Top of Container)
**Text:**
```
Story
... In 2006, the New York Times reported that
fans drove for miles to see the Rudolph and
Santa Claus puppets at the Center for
Puppetry Arts in Atlanta. The pair were thought
to be the last of the surviving
production puppets. ...
In 2005, the nephew of the original rescuer
found the puppets in a family attic ...
```
#### 2. Chat Conversation Transcript
The conversation flows from top to bottom.
* **User Question 1:** "Are all of the puppets still in existence?"
* **System Answer 1:** "No"
* *Associated Sidebar Score (Left of Answer):* **0.93** (Black text)
* **User Question 2:** "How many remain?"
* **System Answer 2:** "Two"
* *Associated Sidebar Score:* **0.98** (Black text)
* **User Question 3:** "Which?"
* **System Answer 3:** "Rudolph and Santa Claus"
* *Associated Sidebar Score:* **0.83** (Black text)
* **User Question 4:** "Where had they been stored since their use in the show?"
* **System Answer 4:** "Atlanta"
* *Associated Sidebar Score:* **0.33** (Red text)
* **User Question 5:** "When were they rediscovered?"
* **System Answer 5:** "2005"
* *Associated Sidebar Score:* **0.88** (Black text)
#### 3. Sidebar Elements (Left Margin)
The sidebar is vertically aligned with the system's answer bubbles.
* **Position 1 (Top):** Magnifying glass icon + **0.93**
* **Position 2:** Magnifying glass icon + **0.98**
* **Position 3:** Magnifying glass icon + **0.83**
* **Position 4:** Magnifying glass icon + **0.33** (in red)
* **Position 5 (Bottom):** Magnifying glass icon + **0.88**
### Key Observations
1. **Confidence Score Variance:** The numerical scores, likely representing confidence or accuracy metrics for the system's answers, are mostly high (0.83 to 0.98). The score for the answer "Atlanta" is a significant outlier at **0.33** and is highlighted in red, indicating low confidence or a potential error.
2. **Information Source:** The story cites a 2006 New York Times report and a 2005 rediscovery event as the basis for the information being queried.
3. **Conversation Logic:** The Q&A session extracts specific factual details (existence, count, identity, location, date) from the provided narrative in a sequential, logical manner.
### Interpretation
This image appears to be a visualization or evaluation of a question-answering system's performance on a specific text passage. The "Story" block serves as the context or knowledge source. The subsequent chat demonstrates the system's ability to extract precise answers from that context.
The sidebar scores are critical. They suggest an external evaluation mechanism is assessing each answer. The high scores for "No," "Two," "Rudolph and Santa Claus," and "2005" indicate the system correctly extracted these facts from the story. The very low, red-highlighted score for "Atlanta" is the most notable anomaly. This implies one of two things:
1. The answer "Atlanta" is factually incorrect based on the story (though the story does mention Atlanta as the location of the Center for Puppetry Arts, not necessarily the storage location).
2. The evaluation metric penalized the answer for being incomplete or imprecise (e.g., the story says puppets were found in a "family attic," not specifically in "Atlanta," making the answer partially correct but lacking specificity).
The image thus demonstrates not just information extraction, but also a layer of meta-analysis on the reliability of that extraction. It highlights how an AI's response can be right, wrong, or somewhere in between, and how such performance can be quantified and visualized.
</details>
(b) step-level confidence scoring setup with CoQA
## 2 Related Work
Prior work on confidence estimation for LLMs can be broadly divided into black-box and white-box approaches. Black-box methods assume no access to the underlying model and often rely on prompting-based strategies such as self-reflection, self-consistency, or generating multiple candidate answers [self-reflection, self-consistency, mutliple-answers, preferknow, mostly-know, steerconf-verbal]. Another line of work employs external evaluators to assess responses post hoc, using features such as similarity measures or structured scoring models [graph, fine-grained, blackbox, similarity-features].
White-box methods, in contrast, target open-source models where full access to parameters and activations is possible. These techniques include fine-tuning models to improve self-evaluation abilities [past-experience, doubt-rl], leveraging log-probabilities for calibrated confidence estimation [dst, multicalibration, self-certainty], and training regression models over hidden states to predict correctness [activations1, activations2, activations3].
survey [survey] provide a comprehensive survey of these approaches, organizing calibration techniques across settings and highlighting key limitations. However, the vast majority of existing work studies single-step tasks, where the model outputs a single response to a single query. Extending these methods to multi-step reasoning remains largely unaddressed.
## 3 Problem Definition
We define failure detection as the task of estimating the probability $pâ[0,1]$ that an agentâs ( $A$ ) response $R$ to a given input $I$ is incorrect: $p=F(R\mid$ I).
Multi-step interactions require a more general formulation: the input is defined as $I=(C,Q)$ where $C$ denotes the initial context and $Q$ is the sequence of queries. The LLM agent produces a sequence of responses $R$ with interaction length $n=|Q|=|R|$ . We consider the following two adaptations:
Response-level scoring: Treat the queries and responses in all the steps as a single sequence and assign one confidence score to the whole solution. This holistic approach captures global coherence.
$$
p=S_whole(R_[1:n]\mid C,Q_[1:n]) \tag{1}
$$
Step-level scoring: The response $R_i$ at a given step $i$ is dependent on the prior queries and responses and scoring it requires all of the previous context ( $C,Q_[1:i],R_[1:i-1]$ ). Accordingly, we assign a separate score to each response $R_i$ , conditioned on the previous queries and responses.
$$
p_i=F_step(R_i\mid C,Q_[1:i],R_[1:i-1])\\ \tag{2}
$$
If any individual score exceeds the threshold, the entire response can be flagged as potentially incorrect. This can be achieved by using $p=min(\{p_i\}_i=1^n)$ .
## 4 Data
### 4.1 Agent Inputs
To test error detection, we focus on tasks where correctness can be objectively defined at each step. Accordingly, we select the following datasets.
GSM8K (Grade School Math - 8K) [gsm8k] is a collection of grade-school math word problems that require multi-step reasoning and computation. At each step, the agent generates an intermediate formula, queries an expression evaluator, and incorporates the toolâs response into subsequent steps (Figure 1(a) ). Problems in the GSM8K test set require an average of 5.1 steps.
CoQA (Conversational Question Answering) [coqa] contains over 127,000 questionâanswer pairs spanning 8,000 conversations that are context-grounded, with later questions often depending on previous queries and answers (Figure 1(b)). Conversations have an average of 13.5 steps.
Responses: For both tasks, we fine-tune Llama-3.2-11B-Instruct [llama32] for two epochs. Because several confidence scoring methods require training, we hold out subsets of training and test splits for confidence estimation, while using the remainder to train the LLM agent. For GSM8K specifically, we use two sets of labels: Answer labels â assess whether the final answer matches the ground truth, and Reasoning labels â assess whether each intermediate reasoning step is correct. Further details on response labels and accuracy are included in Appendix A.
## 5 Experiments
### 5.1 Confidence Estimation Methods
We evaluate several confidence scoring methods, under both formulations response-level (Equation 1) and step-level (Equation 2). For methods requiring training, we use the instruction-tuned Llama-3.2-11B as the base model, replacing its generation head with a regression head for classification objectives. Details on each algorithm are mentioned in the Appendix. C
### 5.2 Evaluation Metrics
We frame error detection as a binary classification task and report the following metrics:
AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures how well the model separates correct from incorrect responses across thresholds. Higher is better.
FPR@0.9 Recall: Since the goal is to reliably flag potentially incorrect responses while minimizing false alarms, we measure the false positive rate (FPR) of the model at a threshold where it identifies the incorrect responses with at least $0.9$ recall. Some approaches fail to reach the target recall without trivially classifying all responses as incorrect. In these cases, we report FPR@0.9 recall as $1$ and additionally report the maximum achievable recall.
### 5.3 Results
Failure detection in multi-step interactions: For both tasks, the best performing methods achieve an AUC-ROC of $0.9$ and a recall of $0.9$ with FPR below one-third. Across techniques, regression model performs the best for both tasks. Interestingly, preference-based reward models perform poorly, suggesting that PRMs are better suited for ranking responses by quality, rather than tasks that have objective correctness labels [reward2].
Performance across granularity and task: For CoQA, step-level scoring significantly outperforms response-level scoring across all methods. For GSM8K, the difference is smaller and trends are less consistent. Notably, the step-level performance of self-certainty is significantly worse, likely due to tool interactions that alter the agentâs responses at each step, thereby distorting the logits. This degradation is not observed for the activations-based regressor, since it only relies on hidden states from the final token.
Most techniques perform better on CoQA than GSM8K, suggesting that reasoning-intensive math problems might be more challenging for evaluators than context-grounded QA. Interestingly, GPT-4.1-mini shows significantly improved performance on GSM8K, reflecting its superior reasoning ability.
Table 1: Evaluation results of different techniques on GSM8K and CoQA. An FPR@0.9 Recall of 1.0 (mr: x) means that the recall does not exceed x without flagging everything as low confidence.
| Self-verbalized | response step response | 0.556 0.546 (-0.2%) 0.586 | 1.0 (mr: 0.13) 1.0 (mr: 0.10) 1.0 (mr: 0.52) | 0.502 0.624 (+24%) 0.522 | 1.0 (mr: 0.77) 0.587 1.0 (mr: 0.73) |
| --- | --- | --- | --- | --- | --- |
| Llama-3.2-11B | step | 0.676 (+15%) | 1.0 (mr: 0.75) | 0.613 (+12%) | 0.81 |
| response | 0.880 | 1.0 (mr: 0.81) | 0.548 | 0.88 | |
| GPT-4.1-mini | step | 0.670 (-24%) | 1.0 (mr: 0.48) | 0.665 (+21%) | 0.476 |
| response | 0.843 | 0.441 | 0.689 | 0.732 | |
| Regression | step | 0.907 (+7%) | 0.314 | 0.952 (+38%) | 0.169 |
| response | 0.450 | 0.928 | 0.381 | 1.0 (mr: 0.57) | |
| PRM | step | - | - | 0.493 (+30%) | 0.887 |
| response | 0.649 | 0.812 | 0.523 | 0.95 | |
| Self-certainty | step | 0.395 (-40%) | 0.945 | 0.849 (+62%) | 0.374 |
| response | 0.608 | 1 (mr: 0.77) | 0.792 | 0.643 | |
| Activations | query | 0.750 (+23%) | 0.647 | 0.919 (+16%) | 0.169 |
Relation to final answer accuracy: For GSM8K, the agent reached the correct answer despite flawed intermediate reasoning in 60/879 test cases (Figure 2). Table 3 shows that step-level performance of all methods against final answers is slightly lower, while the response-level performance improves. This is expected since step-level scoring penalizes intermediate mistakes more strongly, while response-level scoring focuses on the overall outcome.
Identifying cases where the agent reaches the correct answer through flawed reasoning is crucial for trustworthy deployment. Table 2 shows that for most methods, step-level scoring is more effective at detecting such cases.
Table 2: Recall for cases with incorrect reasoning steps but correct answer. Higher the better
| response-step | 0.05 0 | 0.133 0.40 | 0.50 0.30 | 0.367 0.60 | 0.133 0.217 | 0.167 0.267 |
| --- | --- | --- | --- | --- | --- | --- |
Case study on real world data We also test the effectiveness of this approach on a private dataset with real clinical notes and questions. Consistent with the analysis on public datasets, a regression model generating step-level scores achieves the best performance with AUC-ROC of $=0.940$ and FPR@0.9 rec $=0.152$ . We include further details in Appendix D.
## 6 Conclusion
We extended confidence estimation to multi-step tasks in dialogue and tool-assisted reasoning, where maintaining consistency across steps is especially challenging. Through experiments on two multi-step tasks, we find that step-level scoring, though harder to implement in some cases, generally improves error detection and reveals when correct answers emerge from faulty reasoning. Our study highlights the limits of current methods and provides a basis for developing confidence estimators better suited to multi-step reasoning.
## Appendix A Training details
### A.1 Confidence Scorers: Teacher Forcing
For methods requiring supervised training, we adopt teacher forcing teacher-forcing. During training, the model receives the gold history (i.e., corrected responses) when evaluating the next response. The learning objective is:
$$
p_i=F(R_i,|C,Q_[1:i],\hat{R}_[1:i-1])=I\{R_iâ \hat{R}_i\} \tag{3}
$$
where $\hat{R}$ is the list of ground truth responses and $I\{.\}$ is the indicator function. During inference, we do not assume access to the ground truth. At inference time, however, no ground truth is available, and the evaluator must operate solely on the modelâs predictions.
## Appendix B Data preparation
### B.1 Agent: Training and Inference
For both tasks, we fine-tune Llama-3.2-11B-Instruct llama32 for two epochs. Because several confidence scoring methods require training, we hold out subsets of train and test splits for confidence estimation, while using the remainder to train the LLM agent.
Performance varies across datasets and granularity: - On CoQA, the agent achieves $\mathbf{81.2\$ step-level accuracy but only $\mathbf{16.1\$ response-level accuracy. The large gap is expected, since even a single incorrect step can propagate errors downstream. - On GSM8K, the agent achieves $\mathbf{65.6\$ answer accuracy and $\mathbf{47.6\$ step-level accuracy. Here, answer accuracy is higher because the agent may arrive at correct final answers even if some intermediate steps are flawed (see Figure 2).
### B.2 Labeling responses
We use GPT-5 to evaluate agentâs responses against ground truth answers and intermediate steps, producing labels at both the step-level and response-level. To verify the label quality, we manually reviewed $100$ samples from each dataset. We found labeling accuracy above $96\$ in both settings.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Chat Screenshot: Math Problem Solution
### Overview
The image is a screenshot of a chat interface displaying a word problem about ages and a step-by-step solution provided by a chatbot. The interface has a light beige background with a dark border. The conversation consists of one initial question from a user and four subsequent response bubbles from a chatbot, each containing a step in the solution process.
### Components/Axes
* **UI Elements:**
* **Question Box:** A light blue rounded rectangle at the top containing the user's query. A generic user icon (blue silhouette) is positioned to its right.
* **Chatbot Response Bubbles:** Four light green rounded rectangles below the question, each containing a line of the solution. A consistent chatbot icon (a blue robot head) is positioned to the left of each bubble.
* **Text Content:** All text is in English. The chatbot's responses contain mathematical expressions, some of which use a placeholder notation `<<...>>`.
### Detailed Analysis
The text content is transcribed and analyzed below, following the spatial flow from top to bottom.
**1. Question Box (Top, User)**
* **Text:** "Question: Mico and Marco wanted to get to know each other. They realize that the sum of their ages is 20. What will be the sum of their ages in 10 years?"
**2. Chatbot Response Bubbles (Sequential Steps)**
* **Bubble 1 (First response):**
* **Text:** "In 10 years, Mico will be 10 + 5 = <<10+5>> years old."
* **Analysis:** This step introduces an assumption not present in the original question: that Mico is currently 5 years old. The calculation `10 + 5` is presented, with the result `15` implied but not yet filled into the `<<>>` placeholder.
* **Bubble 2 (Second response):**
* **Text:** "In 10 years, Mico will be 10 + 5 = 15 years old. In 10 years, Marco will be 10 + 15 = <<10+15>> years old."
* **Analysis:** This step confirms the result from the previous step (Mico will be 15). It then introduces a second assumption: that Marco is currently 15 years old. This is derived from the initial sum (20) minus the assumed age of Mico (5). The calculation for Marco's future age is presented with a placeholder.
* **Bubble 3 (Third response):**
* **Text:** "In 10 years, Marco will be 10 + 15 = 25 years old. In 10 years, the sum of their ages will be 15 + 25 = <<15+25>> years old."
* **Analysis:** This step confirms Marco's future age (25). It then sets up the final calculation for the sum of their future ages (15 + 25), with the result placeholder.
* **Bubble 4 (Final response):**
* **Text:** "In 10 years, the sum of their ages will be 15 + 25 = 40 years old. The final answer is **40**."
* **Analysis:** This step provides the final result (40) and states the answer explicitly. The number "40" is bolded.
### Key Observations
1. **Logical Flow with Unstated Assumptions:** The solution follows a clear, linear arithmetic progression. However, it critically depends on the unstated and arbitrary assumption that Mico is currently 5 years old and Marco is 15. The original problem only states the sum of their ages (20), not their individual ages.
2. **Placeholder Notation:** The chatbot uses a `<<expression>>` notation in the first three steps, which is filled in with the correct result in the subsequent step. This suggests a pedagogical or interactive format where calculations are revealed progressively.
3. **Correct Arithmetic within Assumptions:** Given the assumed current ages (5 and 15), all subsequent arithmetic (10+5=15, 10+15=25, 15+25=40) is correct.
4. **Final Answer Emphasis:** The conclusion is clearly stated and the numerical answer is bolded for emphasis.
### Interpretation
The image demonstrates a chatbot's step-by-step reasoning process for solving a simple algebraic word problem. The core mathematical insightâthat the sum of two people's ages increases by 20 over a 10-year period (10 years added to each person)âis **not** explicitly stated or used. Instead, the chatbot employs a specific example by assigning arbitrary individual ages that satisfy the initial condition (5 + 15 = 20).
This approach, while leading to the correct final answer (40), reveals a potential limitation in the chatbot's problem-solving strategy: it defaults to concrete example substitution rather than applying the general algebraic principle. The `<<>>` placeholders indicate an interactive or tutorial design, likely intended to show the work before revealing the answer. The screenshot captures a moment of "showing your work" in a digital, conversational format. The primary information conveyed is not just the answer, but the (flawed yet arithmetically sound) procedural path taken to reach it.
</details>
Figure 2: Case from GSM8K where the agent gets the answer correct through incorrect reasoning steps. The agent assumes the current ages of Mico and Marco to be $5$ and $15$ while the question does not mention it. The agent ends up getting to the correct answer nonetheless since it only concerns with the sum of their ages.
## Appendix C Evaluated Confidence Estimation Methods
### C.1 Black-box methods
#### C.1.1 Self-verbalized confidence
The LLM agent is prompted to verbalize its confidence in its own response. For step-level scoring, the agent outputs the confidence score at the end of each step.
#### C.1.2 Auxiliary evaluators
External models assess the agentâs responses.
Pre-trained LLMs: Instruction-tuned LLMs are prompted to evaluate the agentâs responses. We consider two evaluators: (a) Llama-3.2-11B (aligned with the agentâs base model), and (b) OpenAIâs GPT-4.1-mini gpt-4.1 (independent of the agent).
Regression model: We fine-tune the instruction tuned Llama-3.2-11B model with a sequence classification head to regress confidence scores in the range $[0,1]$ .
Preference-based reward model (PRM): We train a reward model on preference data, treating completions with correct answers as âchosenâ and incorrect agent outputs as ârejected.â For GSM8K, multiple valid reasoning paths to solve the same problem make generating step-level preference data infeasible, since each incorrect step in the interaction would require a corrected version. Hence, we evaluate PRMs only at the response-level and leave the step-level training and evaluation as a future work.
Table 3: Answer label performance on GSM8K. An FPR@0.9 Recall of 1.0 (mr: x) means that the recall does not exceed x without flagging everything as low confidence.
| Self-eval | response step response | 0.560 0.559 (-0.2%) 0.590 | 0.317 0.3125 0.291 | 1.0 (mr: 0.15) 1.0 (mr: 0.12) 1.0 (mr: 0.52) |
| --- | --- | --- | --- | --- |
| Llama-3.2-11B | step | 0.669 (+13%) | 0.159 | 1.0 (mr: 0.76) |
| response | 0.895 | 0.088 | 1.0 (mr: 0.88) | |
| GPT-4.1-mini | step | 0.662 (-26%) | 0.280 | 1.0 (mr: 0.49) |
| response | 0.869 | 0.075 | 0.4385 | |
| Regression | step | 0.872 (+1%) | 0.144 | 0.369 |
| response | 0.460 | 0.629 | 0.915 | |
| PRM | step | - | - | - |
| response | 0.658 | 0.219 | 0.773 | |
| Self-certainty | step | 0.342 (-48%) | 0.320 | 0.958 |
| response | 0.605 | 0.339 | 1 (mr: 0.77) | |
| Activations | query | 0.738 (+21%) | 0.279 | 0.655 |
### C.2 White-box methods
Logits: Following Self-certainty self-certainty, we compute the KL divergence of the agentâs output logits from the uniform distribution as a measure of certainty. Since this approach consistently outperforms other logit-based methods, we use it as the representative logit-based white-box baseline. Self-certainty scores are normalized to fall within $[0,1]$ .
Activations: Prior work activations1, activations2, activations3 suggests that hidden states of the modelâs final LLM layer contain information on modelâs behavior and can be used to extract its confidence in its response. Following this, we train a 5-layer MLP classifier on the modelâs final hidden states to predict a correctness score.
## Appendix D Evaluating on private dataset
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Clinical Data Extraction Dialogue
### Overview
The image displays a structured, vertical flow diagram representing a question-and-answer dialogue, likely between a healthcare professional (or querying system) and an AI or database interface. The dialogue extracts specific clinical information from a redacted patient note. The visual style uses color-coded speech bubbles and icons to distinguish between the questioner and the responder.
### Components/Axes
* **Layout:** A single-column, top-to-bottom sequence of alternating question and answer bubbles.
* **Color Coding & Icons:**
* **Questions:** Light blue speech bubbles, right-aligned, accompanied by a blue "person" icon on the right.
* **Answers:** Light green speech bubbles, left-aligned, accompanied by a black circular icon containing a white abstract symbol (resembling a robot or AI) on the left.
* **Header Element:** A light blue bubble at the top center containing the text "Clinical Note <redacted>".
* **Background:** A solid light beige (#f5f0e6) field with a thin, rounded black border.
### Detailed Analysis
The dialogue proceeds sequentially as follows:
1. **Header:** "Clinical Note <redacted>" - Indicates the source document is a clinical note with sensitive information removed.
2. **Q1 (Blue, Right):** "Has the patient been diagnosed with hypertension"
* **A1 (Green, Left):** "Yes"
3. **Q2 (Blue, Right):** "Is the hypertension resistant or uncontrolled?"
* **A2 (Green, Left):** "No relevant information is mentioned in the note"
4. **Q3 (Blue, Right):** "When was the diagnosis first established?"
* **A3 (Green, Left):** "May, 2022"
5. **Q4 (Blue, Right):** "Is there a diagnosis of congestive heart failure?"
* **A4 (Green, Left):** "Yes"
6. **Q5 (Blue, Right):** "What is the most recent NYHA score?"
* **A5 (Green, Left):** "II-III"
### Key Observations
* **Information Gaps:** The system explicitly states when information is absent from the source note (Answer A2).
* **Specific Data Points Extracted:**
* Hypertension Diagnosis: **Confirmed (Yes)**
* Hypertension Status: **Not specified in source**
* Hypertension Diagnosis Date: **May, 2022**
* Congestive Heart Failure Diagnosis: **Confirmed (Yes)**
* NYHA Functional Classification: **Class II-III** (indicating mild to moderate symptoms with activity).
* **Dialogue Logic:** The questions follow a logical clinical inquiry path, starting with a broad diagnosis (hypertension), probing for severity, establishing timeline, then checking for a related comorbidity (CHF) and its functional severity (NYHA score).
### Interpretation
This diagram models a targeted information retrieval process from an electronic health record (EHR) or clinical note. It demonstrates how a structured query system can parse unstructured or semi-structured clinical text to answer specific, clinically relevant questions. The extracted data paints a picture of a patient with two significant, interrelated cardiovascular conditions: hypertension (diagnosed in May 2022) and congestive heart failure with a functional limitation of NYHA class II-III. The inability to determine if the hypertension is resistant or uncontrolled highlights a common challenge in data extractionâthe source note may not contain every relevant detail, or the detail may be documented in a non-standardized way. The process shown is foundational for clinical decision support, quality reporting, and research data phenotyping.
</details>
Figure 3: An example from the private clinical data.
To evaluate the applicability of confidence estimation methods in real-world settings, we tested some of the approaches on a private dataset consisting of conversational question-answering interactions over real patient clinical notes. A redacted example from this dataset is provided in Figure 3. We do not publicly release the data due to conflict of interest as well as HIPAA compliance, and the results are therefore not reproducible. Nevertheless, it provides a valuable demonstration in a domain where trustworthiness is critical. Consistent with the analysis on public datasets, a regression model generating step-level scores achieves the best performance with AUC-ROC $=0.940$ and FPR@0.9 rec $=0.152$ . These results indicate that step-level confidence scoring with a regression model remains effective in complex, real-world interactions.