# Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
**Authors**: Vaibhav Mavi, Dyania Health, &Shubh Jaroria, Dyania Health, &Weiqi Sun, Dyania Health
Abstract
Reliability and failure detection of large language models (LLMs) is critical for their deployment in high-stakes, multi-step reasoning tasks. Prior work explores confidence estimation for self-evaluating LLM-scorer systems, with confidence scorers estimating the likelihood of errors in LLM responses. However, most methods focus on single-step outputs and overlook the challenges of multi-step reasoning. In this work, we extend self-evaluation techniques to multi-step tasks, testing two intuitive approaches: holistic scoring and step-by-step scoring. Using two multi-step benchmark datasets, we show that stepwise evaluation generally outperforms holistic scoring in detecting potential errors, with up to 15% relative increase in AUC-ROC. Our findings demonstrate that self-evaluating LLM systems provide meaningful confidence estimates in complex reasoning, improving their trustworthiness and providing a practical framework for failure detection.
1 Introduction
Large language model (LLM) agents are increasingly deployed in complex applications such as task-planning [survey-planning], dialog systems [survey-dialog], collaborative problem-solving [survey-collab] and multi-hop question answering [survey-mhqa] where detecting errors and failures is a critical challenge. A common strategy for detecting failures is to extend the system with a self-evaluation component, where either the agent itself or an auxiliary evaluator assigns a confidence score to the response [survey].
Failure detection through confidence estimation has been extensively studied in single-step prediction tasks [survey, elicit, blackbox, activations1], but its role in multi-step reasoning remains largely underexplored. Multi-step interactions pose unique challenges: reasoning chains can be arbitrarily long, errors may occur at any step, and later steps often depend on earlier ones. Consequently, direct application of existing methods often fails to identify errors in a multi-step task reliably. For example, self-certainty [self-certainty] directly applied to CoQA (Conversational Question Answering) [coqa] yields poor performance (AUC-ROC 0.523, FPR@0.9 recall 0.95).
However, a trivial extension of detecting errors after each step improves the performance substantially (AUC-ROC 0.849, FPR@0.9 recall 0.374). This observation raises a key question: Should confidence estimation methods in multi-step tasks evaluate responses: i) after each reasoning step, enabling fine-grained error-detection, or ii) holistically, considering the final answer in full context?
We systematically investigate this question across two representative settings: tool-enhanced reasoning and LLMâuser dialog. Our experiments reveal that step-level evaluation often provides superior error detection, though holistic evaluation can still be advantageous in certain contexts.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Problem Solution Breakdown: Dance Class Enrollment
### Overview
The image presents a step-by-step solution to a word problem concerning dance class enrollment. The problem asks for the percentage of students enrolled in hip-hop dance, given the total number of students, the percentage in contemporary dance, and the percentage of the remainder in jazz dance. The solution is broken down into several steps, each accompanied by a calculator icon and a confidence score.
### Components/Axes
The image is structured into three main sections:
1. **Question:** Located at the top-left, presenting the word problem.
2. **Solution Steps:** A series of text blocks detailing the calculation process, arranged vertically in the center. Each step is paired with a calculator icon and a confidence score.
3. **Confidence Scores:** A vertical column on the right side, displaying confidence scores associated with each step.
### Detailed Analysis or Content Details
The solution proceeds as follows:
1. **Step 1:** "There are 20 x 20/100 = <<20*20/100>> students who enrolled in contemporary dance." Confidence: 0.94
2. **Step 2:** "There are 20 x 20/100 = 4 students who enrolled in contemporary dance." Confidence: 0.97
3. **Step 3:** "So, 20 - 4 = <<20-4>> students are enrolled in either jazz or hip-hop dance." Confidence: 0.89
4. **Step 4:** "So, 20 - 4 = 16 students are enrolled in either jazz or hip-hop dance." Confidence: 0.91
5. **Step 5:** "There are 16 x 25/100 = <<16*25/100>> students who enrolled in jazz dance." Confidence: 0.18
6. **Step 6:** "There are 16 x 25/100 = 4 students enrolled in jazz." Confidence: 0.84
7. **Step 7:** "Hence, 16 - 4 = <<16-4>> students enrolled in hip-hop."
8. **Step 8:** "Hence, 16 - 4 = 12 students enrolled in hip-hop."
9. **Step 9:** "This is 12/16 x 100% = <<12/16x100>>% of the entire students."
10. **Step 10:** "This is 12/16 x 100% = 75% of the entire students."
11. **Final Answer:** "The final answer is 75%."
The confidence scores range from 0.18 to 0.97.
### Key Observations
The confidence score of 0.18 in Step 5 is significantly lower than the other scores, suggesting a potential area of uncertainty or error in the calculation. The problem is solved using a series of arithmetic operations, and the solution is presented in a clear, step-by-step manner.
### Interpretation
The image demonstrates a problem-solving approach to a percentage-based word problem. The breakdown into individual steps, coupled with confidence scores, suggests an attempt to assess the reliability of each calculation. The low confidence score in Step 5 indicates a possible point of review. The overall solution correctly calculates that 75% of the students are enrolled in hip-hop dance. The use of calculator icons alongside each step visually reinforces the computational nature of the solution. The "<<" and ">>" symbols around intermediate calculations likely denote placeholders for the actual numerical results, perhaps for a dynamic calculation system. The confidence scores could be generated by an automated system evaluating the correctness of each step.
</details>
(a) step-level confidence scoring setup with GSM8K
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Puppet Story & Question Answering
### Overview
This diagram presents a narrative about Rudolph and Santa Claus puppets, followed by a series of question-answer pairs. Each question is associated with a confidence score and an answer provided by a chatbot-like entity. The diagram is visually structured as a flow, with the story at the top and questions descending below.
### Components/Axes
The diagram consists of the following components:
* **Story Block:** A rectangular block at the top containing a textual narrative.
* **Question Blocks:** A series of rectangular blocks, each containing a question posed to a chatbot.
* **Answer Blocks:** Rectangular blocks directly below each question block, containing the chatbot's answer.
* **Confidence Scores:** Numerical values (0.33 to 0.98) displayed alongside each question, representing the chatbot's confidence in its answer.
* **Chatbot Icon:** A robot-like icon associated with each answer.
* **Human Icon:** A person-like icon associated with each question.
* **Magnifying Glass Icon:** A magnifying glass icon associated with each question and confidence score.
### Detailed Analysis / Content Details
The Story Block contains the following text:
"... In 2006, the New York Times reported that fans drove for miles to see the Rudolph and Santa Claus puppets at the Center for Puppetry Arts in Atlanta. The pair were thought to be the last of the surviving production puppets... In 2005, the nephew of the original rescuer found the puppets in a family attic..."
The Question-Answer pairs are as follows:
1. **Question:** "Are all of the puppets still in existence?"
* **Confidence Score:** 0.93
* **Answer:** "No"
2. **Question:** "How many remain?"
* **Confidence Score:** 0.98
* **Answer:** "Two"
3. **Question:** "Which?"
* **Confidence Score:** 0.83
* **Answer:** "Rudolph and Santa Claus"
4. **Question:** "Where had they been stored since their use in the show?"
* **Confidence Score:** 0.33
* **Answer:** "Atlanta"
5. **Question:** "When were rediscovered?"
* **Confidence Score:** 0.88
* **Answer:** "2005"
### Key Observations
* The chatbot demonstrates high confidence (0.93 and 0.98) in answering the first two questions.
* The chatbot's confidence is lowest (0.33) when answering the question about where the puppets were stored.
* The answers provided are consistent with the information presented in the story block.
* The confidence scores appear to correlate with the directness of the answer within the provided text.
### Interpretation
This diagram illustrates a simple question-answering system applied to a short narrative. The confidence scores suggest the system is more reliable when the answer is explicitly stated in the text (e.g., "No" regarding the puppets' existence) and less reliable when the answer requires inference or external knowledge (e.g., the storage location). The system successfully extracts key information from the story, such as the names of the puppets and the year they were rediscovered. The diagram highlights the challenges of natural language processing, particularly in understanding context and providing accurate answers to nuanced questions. The confidence scores provide a valuable metric for assessing the reliability of the system's responses. The placement of the human icon next to the questions and the robot icon next to the answers visually reinforces the interaction between a user and a chatbot.
</details>
(b) step-level confidence scoring setup with CoQA
2 Related Work
Prior work on confidence estimation for LLMs can be broadly divided into black-box and white-box approaches. Black-box methods assume no access to the underlying model and often rely on prompting-based strategies such as self-reflection, self-consistency, or generating multiple candidate answers [self-reflection, self-consistency, mutliple-answers, preferknow, mostly-know, steerconf-verbal]. Another line of work employs external evaluators to assess responses post hoc, using features such as similarity measures or structured scoring models [graph, fine-grained, blackbox, similarity-features].
White-box methods, in contrast, target open-source models where full access to parameters and activations is possible. These techniques include fine-tuning models to improve self-evaluation abilities [past-experience, doubt-rl], leveraging log-probabilities for calibrated confidence estimation [dst, multicalibration, self-certainty], and training regression models over hidden states to predict correctness [activations1, activations2, activations3].
survey [survey] provide a comprehensive survey of these approaches, organizing calibration techniques across settings and highlighting key limitations. However, the vast majority of existing work studies single-step tasks, where the model outputs a single response to a single query. Extending these methods to multi-step reasoning remains largely unaddressed.
3 Problem Definition
We define failure detection as the task of estimating the probability $pâ[0,1]$ that an agentâs ( $A$ ) response $R$ to a given input $I$ is incorrect: $p=\mathcal{F}(R\mid$ I).
Multi-step interactions require a more general formulation: the input is defined as $I=(C,Q)$ where $C$ denotes the initial context and $Q$ is the sequence of queries. The LLM agent produces a sequence of responses $R$ with interaction length $n=|Q|=|R|$ . We consider the following two adaptations:
Response-level scoring: Treat the queries and responses in all the steps as a single sequence and assign one confidence score to the whole solution. This holistic approach captures global coherence.
$$
p=\mathcal{S}_{whole}(R_{[1:n]}\mid C,Q_{[1:n]}) \tag{1}
$$
Step-level scoring: The response $R_{i}$ at a given step $i$ is dependent on the prior queries and responses and scoring it requires all of the previous context ( $C,Q_{[1:i]},R_{[1:i-1]}$ ). Accordingly, we assign a separate score to each response $R_{i}$ , conditioned on the previous queries and responses.
$$
p_{i}=\mathcal{F}_{step}(R_{i}\mid C,Q_{[1:i]},R_{[1:i-1]})\\ \tag{2}
$$
If any individual score exceeds the threshold, the entire response can be flagged as potentially incorrect. This can be achieved by using $p=min(\{p_{i}\}_{i=1}^{n})$ .
4 Data
4.1 Agent Inputs
To test error detection, we focus on tasks where correctness can be objectively defined at each step. Accordingly, we select the following datasets.
GSM8K (Grade School Math - 8K) [gsm8k] is a collection of grade-school math word problems that require multi-step reasoning and computation. At each step, the agent generates an intermediate formula, queries an expression evaluator, and incorporates the toolâs response into subsequent steps (Figure 1(a) ). Problems in the GSM8K test set require an average of 5.1 steps.
CoQA (Conversational Question Answering) [coqa] contains over 127,000 questionâanswer pairs spanning 8,000 conversations that are context-grounded, with later questions often depending on previous queries and answers (Figure 1(b)). Conversations have an average of 13.5 steps.
Responses: For both tasks, we fine-tune Llama-3.2-11B-Instruct [llama32] for two epochs. Because several confidence scoring methods require training, we hold out subsets of training and test splits for confidence estimation, while using the remainder to train the LLM agent. For GSM8K specifically, we use two sets of labels: Answer labels â assess whether the final answer matches the ground truth, and Reasoning labels â assess whether each intermediate reasoning step is correct. Further details on response labels and accuracy are included in Appendix A.
5 Experiments
5.1 Confidence Estimation Methods
We evaluate several confidence scoring methods, under both formulations response-level (Equation 1) and step-level (Equation 2). For methods requiring training, we use the instruction-tuned Llama-3.2-11B as the base model, replacing its generation head with a regression head for classification objectives. Details on each algorithm are mentioned in the Appendix. C
5.2 Evaluation Metrics
We frame error detection as a binary classification task and report the following metrics:
AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures how well the model separates correct from incorrect responses across thresholds. Higher is better.
FPR@0.9 Recall: Since the goal is to reliably flag potentially incorrect responses while minimizing false alarms, we measure the false positive rate (FPR) of the model at a threshold where it identifies the incorrect responses with at least $0.9$ recall. Some approaches fail to reach the target recall without trivially classifying all responses as incorrect. In these cases, we report FPR@0.9 recall as $1$ and additionally report the maximum achievable recall.
5.3 Results
Failure detection in multi-step interactions: For both tasks, the best performing methods achieve an AUC-ROC of $0.9$ and a recall of $0.9$ with FPR below one-third. Across techniques, regression model performs the best for both tasks. Interestingly, preference-based reward models perform poorly, suggesting that PRMs are better suited for ranking responses by quality, rather than tasks that have objective correctness labels [reward2].
Performance across granularity and task: For CoQA, step-level scoring significantly outperforms response-level scoring across all methods. For GSM8K, the difference is smaller and trends are less consistent. Notably, the step-level performance of self-certainty is significantly worse, likely due to tool interactions that alter the agentâs responses at each step, thereby distorting the logits. This degradation is not observed for the activations-based regressor, since it only relies on hidden states from the final token.
Most techniques perform better on CoQA than GSM8K, suggesting that reasoning-intensive math problems might be more challenging for evaluators than context-grounded QA. Interestingly, GPT-4.1-mini shows significantly improved performance on GSM8K, reflecting its superior reasoning ability.
Table 1: Evaluation results of different techniques on GSM8K and CoQA. An FPR@0.9 Recall of 1.0 (mr: x) means that the recall does not exceed x without flagging everything as low confidence.
| Self-verbalized | response step response | 0.556 0.546 (-0.2%) 0.586 | 1.0 (mr: 0.13) 1.0 (mr: 0.10) 1.0 (mr: 0.52) | 0.502 0.624 (+24%) 0.522 | 1.0 (mr: 0.77) 0.587 1.0 (mr: 0.73) |
| --- | --- | --- | --- | --- | --- |
| Llama-3.2-11B | step | 0.676 (+15%) | 1.0 (mr: 0.75) | 0.613 (+12%) | 0.81 |
| response | 0.880 | 1.0 (mr: 0.81) | 0.548 | 0.88 | |
| GPT-4.1-mini | step | 0.670 (-24%) | 1.0 (mr: 0.48) | 0.665 (+21%) | 0.476 |
| response | 0.843 | 0.441 | 0.689 | 0.732 | |
| Regression | step | 0.907 (+7%) | 0.314 | 0.952 (+38%) | 0.169 |
| response | 0.450 | 0.928 | 0.381 | 1.0 (mr: 0.57) | |
| PRM | step | - | - | 0.493 (+30%) | 0.887 |
| response | 0.649 | 0.812 | 0.523 | 0.95 | |
| Self-certainty | step | 0.395 (-40%) | 0.945 | 0.849 (+62%) | 0.374 |
| response | 0.608 | 1 (mr: 0.77) | 0.792 | 0.643 | |
| Activations | query | 0.750 (+23%) | 0.647 | 0.919 (+16%) | 0.169 |
Relation to final answer accuracy: For GSM8K, the agent reached the correct answer despite flawed intermediate reasoning in 60/879 test cases (Figure 2). Table 3 shows that step-level performance of all methods against final answers is slightly lower, while the response-level performance improves. This is expected since step-level scoring penalizes intermediate mistakes more strongly, while response-level scoring focuses on the overall outcome.
Identifying cases where the agent reaches the correct answer through flawed reasoning is crucial for trustworthy deployment. Table 2 shows that for most methods, step-level scoring is more effective at detecting such cases.
Table 2: Recall for cases with incorrect reasoning steps but correct answer. Higher the better
| response-step | 0.05 0 | 0.133 0.40 | 0.50 0.30 | 0.367 0.60 | 0.133 0.217 | 0.167 0.267 |
| --- | --- | --- | --- | --- | --- | --- |
Case study on real world data We also test the effectiveness of this approach on a private dataset with real clinical notes and questions. Consistent with the analysis on public datasets, a regression model generating step-level scores achieves the best performance with AUC-ROC of $\mathbf{=0.940}$ and FPR@0.9 rec $\mathbf{=0.152}$ . We include further details in Appendix D.
6 Conclusion
We extended confidence estimation to multi-step tasks in dialogue and tool-assisted reasoning, where maintaining consistency across steps is especially challenging. Through experiments on two multi-step tasks, we find that step-level scoring, though harder to implement in some cases, generally improves error detection and reveals when correct answers emerge from faulty reasoning. Our study highlights the limits of current methods and provides a basis for developing confidence estimators better suited to multi-step reasoning.
Appendix A Training details
A.1 Confidence Scorers: Teacher Forcing
For methods requiring supervised training, we adopt teacher forcing teacher-forcing. During training, the model receives the gold history (i.e., corrected responses) when evaluating the next response. The learning objective is:
$$
p_{i}=\mathcal{F}(R_{i},|C,Q_{[1:i]},\hat{R}_{[1:i-1]})=\mathbb{I}\{R_{i}\neq\hat{R}_{i}\} \tag{3}
$$
where $\hat{R}$ is the list of ground truth responses and $\mathbb{I}\{.\}$ is the indicator function. During inference, we do not assume access to the ground truth. At inference time, however, no ground truth is available, and the evaluator must operate solely on the modelâs predictions.
Appendix B Data preparation
B.1 Agent: Training and Inference
For both tasks, we fine-tune Llama-3.2-11B-Instruct llama32 for two epochs. Because several confidence scoring methods require training, we hold out subsets of train and test splits for confidence estimation, while using the remainder to train the LLM agent.
Performance varies across datasets and granularity: - On CoQA, the agent achieves $\mathbf{81.2\%}$ step-level accuracy but only $\mathbf{16.1\%}$ response-level accuracy. The large gap is expected, since even a single incorrect step can propagate errors downstream. - On GSM8K, the agent achieves $\mathbf{65.6\%}$ answer accuracy and $\mathbf{47.6\%}$ step-level accuracy. Here, answer accuracy is higher because the agent may arrive at correct final answers even if some intermediate steps are flawed (see Figure 2).
B.2 Labeling responses
We use GPT-5 to evaluate agentâs responses against ground truth answers and intermediate steps, producing labels at both the step-level and response-level. To verify the label quality, we manually reviewed $100$ samples from each dataset. We found labeling accuracy above $96\%$ in both settings.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Educational Material: Age Calculation Problem
### Overview
The image presents a step-by-step solution to a word problem involving the ages of two individuals, Mico and Marco. The problem asks for the sum of their ages in 10 years, given that their current combined age is 20. The solution is presented as a series of statements, each building upon the previous one, with accompanying robot icons.
### Components/Axes
The image is structured as a series of rectangular blocks, each containing a statement related to the problem-solving process. There are no axes or traditional chart elements. The blocks are arranged vertically, representing the sequence of steps.
### Detailed Analysis or Content Details
The problem statement is: "Mico and Marco wanted to get to know each other. They realize that the sum of their ages is 20. What will be the sum of their ages in 10 years?"
The solution steps are as follows:
1. "In 10 years, Mico will be 10 + 5 = <<10+5>> years old."
2. "In 10 years, Mico will be 10 + 5 = 15 years old."
3. "In 10 years, Marco will be 10 + 15 = <<10+15>> years old."
4. "In 10 years, Marco will be 10 + 15 = 25 years old."
5. "In 10 years, the sum of their ages will be 15 + 25 = <<15+25>> years old."
6. "In 10 years, the sum of their ages will be 15 + 25 = 40 years old."
7. "The final answer is 40."
The values within the double angle brackets (<< >>) appear to be placeholders for the calculation result, but are not filled in.
### Key Observations
The solution demonstrates a straightforward application of arithmetic to solve the age-related word problem. Each step clearly shows the calculation performed. The initial ages of Mico and Marco are implicitly defined as 10 and 10, respectively, to reach a combined age of 20.
### Interpretation
The material is designed to teach basic algebraic thinking and problem-solving skills. It breaks down a word problem into smaller, manageable steps, making it easier for a student to understand the logic behind the solution. The use of robot icons adds a visual element, potentially to engage younger learners. The placeholders suggest an interactive element where the student might be expected to fill in the missing calculations. The problem implicitly assumes that Mico is currently 10 years old and Marco is currently 10 years old, as this is the only way to arrive at the final answer of 40.
</details>
Figure 2: Case from GSM8K where the agent gets the answer correct through incorrect reasoning steps. The agent assumes the current ages of Mico and Marco to be $5$ and $15$ while the question does not mention it. The agent ends up getting to the correct answer nonetheless since it only concerns with the sum of their ages.
Appendix C Evaluated Confidence Estimation Methods
C.1 Black-box methods
C.1.1 Self-verbalized confidence
The LLM agent is prompted to verbalize its confidence in its own response. For step-level scoring, the agent outputs the confidence score at the end of each step.
C.1.2 Auxiliary evaluators
External models assess the agentâs responses.
Pre-trained LLMs: Instruction-tuned LLMs are prompted to evaluate the agentâs responses. We consider two evaluators: (a) Llama-3.2-11B (aligned with the agentâs base model), and (b) OpenAIâs GPT-4.1-mini gpt-4.1 (independent of the agent).
Regression model: We fine-tune the instruction tuned Llama-3.2-11B model with a sequence classification head to regress confidence scores in the range $[0,1]$ .
Preference-based reward model (PRM): We train a reward model on preference data, treating completions with correct answers as âchosenâ and incorrect agent outputs as ârejected.â For GSM8K, multiple valid reasoning paths to solve the same problem make generating step-level preference data infeasible, since each incorrect step in the interaction would require a corrected version. Hence, we evaluate PRMs only at the response-level and leave the step-level training and evaluation as a future work.
Table 3: Answer label performance on GSM8K. An FPR@0.9 Recall of 1.0 (mr: x) means that the recall does not exceed x without flagging everything as low confidence.
| Self-eval | response step response | 0.560 0.559 (-0.2%) 0.590 | 0.317 0.3125 0.291 | 1.0 (mr: 0.15) 1.0 (mr: 0.12) 1.0 (mr: 0.52) |
| --- | --- | --- | --- | --- |
| Llama-3.2-11B | step | 0.669 (+13%) | 0.159 | 1.0 (mr: 0.76) |
| response | 0.895 | 0.088 | 1.0 (mr: 0.88) | |
| GPT-4.1-mini | step | 0.662 (-26%) | 0.280 | 1.0 (mr: 0.49) |
| response | 0.869 | 0.075 | 0.4385 | |
| Regression | step | 0.872 (+1%) | 0.144 | 0.369 |
| response | 0.460 | 0.629 | 0.915 | |
| PRM | step | - | - | - |
| response | 0.658 | 0.219 | 0.773 | |
| Self-certainty | step | 0.342 (-48%) | 0.320 | 0.958 |
| response | 0.605 | 0.339 | 1 (mr: 0.77) | |
| Activations | query | 0.738 (+21%) | 0.279 | 0.655 |
C.2 White-box methods
Logits: Following Self-certainty self-certainty, we compute the KL divergence of the agentâs output logits from the uniform distribution as a measure of certainty. Since this approach consistently outperforms other logit-based methods, we use it as the representative logit-based white-box baseline. Self-certainty scores are normalized to fall within $[0,1]$ .
Activations: Prior work activations1, activations2, activations3 suggests that hidden states of the modelâs final LLM layer contain information on modelâs behavior and can be used to extract its confidence in its response. Following this, we train a 5-layer MLP classifier on the modelâs final hidden states to predict a correctness score.
Appendix D Evaluating on private dataset
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Clinical Note Information Extraction Flow
### Overview
The image depicts a flow diagram illustrating a series of questions posed to a clinical note, likely by an automated system (represented by a robot icon), and the corresponding answers extracted. The diagram visually represents a conversational information extraction process from a clinical note.
### Components/Axes
The diagram consists of a series of rounded rectangular boxes connected by arrows, representing a question-answer flow. Each question is posed by a human figure icon (blue silhouette) and answered by a robot icon (black and teal). The top box represents the input "Clinical Note" and the subsequent boxes represent questions and answers.
### Detailed Analysis or Content Details
The flow proceeds as follows:
1. **Input:** "Clinical Note <redacted>".
2. **Question 1:** "Has the patient been diagnosed with hypertension?".
* **Answer 1:** "Yes" (Green box).
3. **Question 2:** "Is the hypertension resistant or uncontrolled?".
* **Answer 2:** "No relevant information is mentioned in the note" (Orange box).
4. **Question 3:** "When was the diagnosis first established?".
* **Answer 3:** "May, 2022" (Green box).
5. **Question 4:** "Is there a diagnosis of congestive heart failure?".
* **Answer 4:** "Yes" (Green box).
6. **Question 5:** "What is the most recent NYHA score?".
* **Answer 5:** "II-III" (Green box).
The questions are positioned vertically, one below the other, with arrows indicating the flow of information. The robot icon is consistently positioned to the left of each answer box. The human icon is consistently positioned to the right of each question box.
### Key Observations
The diagram demonstrates a successful information extraction process, with answers being provided for each question. The redaction of the clinical note suggests sensitivity of the data. The answers are concise and directly address the questions. The use of different box colors (green, orange) may indicate the type of information extracted (e.g., positive finding, negative finding/missing information).
### Interpretation
This diagram illustrates a workflow for automated extraction of key clinical information from unstructured text (clinical notes). The system appears to be capable of identifying diagnoses (hypertension, congestive heart failure), dates (May 2022), and scores (NYHA II-III). The "No relevant information" response suggests the system can also identify when information is absent from the note. This type of system could be used to streamline clinical workflows, improve data quality, and support clinical decision-making. The redaction of the clinical note highlights the importance of data privacy and security in healthcare applications. The diagram suggests a successful extraction process, but further evaluation would be needed to assess the system's accuracy and robustness.
</details>
Figure 3: An example from the private clinical data.
To evaluate the applicability of confidence estimation methods in real-world settings, we tested some of the approaches on a private dataset consisting of conversational question-answering interactions over real patient clinical notes. A redacted example from this dataset is provided in Figure 3. We do not publicly release the data due to conflict of interest as well as HIPAA compliance, and the results are therefore not reproducible. Nevertheless, it provides a valuable demonstration in a domain where trustworthiness is critical. Consistent with the analysis on public datasets, a regression model generating step-level scores achieves the best performance with AUC-ROC $\mathbf{=0.940}$ and FPR@0.9 rec $\mathbf{=0.152}$ . These results indicate that step-level confidence scoring with a regression model remains effective in complex, real-world interactions.