2511.09381
Model: healer-alpha-free
# Self-Correcting Large Language Models: Generation vs. Multiple Choice
> Work was done prior to joining Amazon.
## Abstract
Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes:
While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs. Codes and experiments are available at https://github.com/rahmanidashti/llm-self-correction
Self-Correcting Large Language Models: Generation vs. Multiple Choice
Hossein A. Rahmani $\dagger$ , Satyapriya Krishna $\ddagger$ thanks: Work was done prior to joining Amazon., Xi Wang $\nabla$ , Mohammadmehdi Naghiaei $\diamondsuit$ , Emine Yilmaz $\dagger$ $\dagger$ University College London, $\ddagger$ Amazon AGI, $\nabla$ University of Sheffield, $\diamondsuit$ University of Southern California {hossein.rahmani.22, emine.yilmaz}@ucl.ac.uk, skrishna@g.harvard.edu xi.wang@sheffield.ac.uk, naghiaei@usc.edu
## 1 Introduction
Recent advances in Large Language Models (LLMs) have illustrated that iterative self-correction, where a model re-examines and revises its output under a self-reflection framework, can lead to significant performance gains across a variety of tasks (Madaan et al., 2023; Cook et al., 2024; Shinn et al., 2023; Gou et al., 2024, inter alia). This emergent ability is often attributed to the modelsā capacity to integrate chain-of-thought reasoning Kamoi et al. (2024); Chang et al. (2024); Wei et al. (2022), prompting them to refine their own outputs as addressed by a human proofreader or mentor. Regarding performance validation, existing studies on self-correction have generally focused on free-form text generation (Huang et al., 2023; Madaan et al., 2023; Zelikman et al., 2022; Ma et al., 2025; Kumar et al., 2025; Krishna et al., 2024, inter alia), such as dialogue response, code optimization, and acronym generation. These tasks align with the strategy of language model optimization in addressing next token prediction.
However, as LLM applications expand, evaluation restricted to free-form generation offers an incomplete picture. For instance, NVIDIA advocates the deployment of smaller language models in agentic systems for tasks such as API calls and orchestration with external tools, motivated by sustainability and efficiency considerations Belcak et al. (2025). This highlights the need to examine self-correction beyond open-ended generation. In this study, we categorize natural language modeling tasks into two broad paradigms: free-form text generation and multi-choice prediction. The former treats modeling as unconstrained sequence generation over the full vocabulary, while the latter frames it as classification over a fixed set of candidate answers. These paradigms are complementary: multi-choice tasks test precise discrimination under constraints, whereas free-form tasks assess expressive generation, and together they capture the main modes of LLM use in applications such as question answering, reasoning, and open-ended dialogue.
In this paper, we investigate how self-correction unfolds when comparing open-ended generation against multiple-choice question scenarios. We hypothesize that while open-ended generation may benefit from enhanced flexibility and creativity, it also faces a larger search space and the risk of compounding errors. By contrast, multiple-choice models operate in a constrained space, which can reduce semantic drift yet limit creative corrections. Our study explores how these respective factors interact with iterative refinement, shedding light on whether self-correction aligns more naturally with either unconstrained or constrained output space.
To address these questions, we conduct comprehensive experiments on two distinct datasets that differ in nature, one focusing on knowledge-intensive question answering and the other on reasoning-oriented problems. We perform iterative inference, giving the model multiple opportunities to reevaluate and revise. By comparing error rates, consistency across iterations, and eventual convergence in each paradigm, we expose nuanced trade-offs in how LLMs adapt to different output constraints under a self-correction regime. Our results provide practical insights for the design and deployment of LLM-based systems, highlighting opportunities to harness better or tailor self-correction behaviors for diverse application settings. Furthermore, we discuss how our findings inform the broader research agenda of aligning emergent capabilities in large-scale models with varied real-world task requirements.
## 2 Related Works
Iterative Reasoning and Self-correction in LLMs.
Large language models first showed an emergent ability to reason step-by-step when prompted with chain-of-thought (CoT) examples (Wei et al., 2022). Shortly after, Wang et al. (2023) demonstrated that sampling several independent reasoning traces and selecting the majority answerādubbed self-consistency (SC)āboosts accuracy on arithmetic and commonsense tasks. Follow-up studies made the correction loop explicit by asking the model to critique its own draft before rewriting it, leading to sizeable gains in factual QA and code generation (Madaan et al., 2023). Variants that call external tools such as Python or knowledge bases during the critique stage further reduce hallucinations in open-ended generation (Chen et al., 2023; Yao et al., 2023; Gou et al., 2024). These works collectively suggest that LLMs can act as both solver and reviewer, but they focus almost exclusively on free-form text outputs.
Verificationābased Refinement.
Instead of trusting the modelās final token distribution, several papers add lightweight verifiers. Cobbe et al. (2021) attach unit tests to code synthesis; Dixit et al. (2023) use factuality checkers for summarization; Pryzant (2023) adopt entailment models for reading comprehension. The common pattern is a two-step pipeline where the LLM proposes an answer, then a cheaper or more precise module scores it. Our work keeps the entire loop inside the language model, isolating the effect of output format itself (generation vs. multiple-choice) from external verification.
Answer Selection and Multiple-Choice Prompting.
Tasks with a closed candidate set (e.g., MMLU (Hendrycks et al., 2021), ARC (Clark and et al., 2018)) are typically solved by mapping each option to an independent prompt and picking the highest-logit answer (Brown and et al., 2020). Several groups have tried to retrofit iterative reasoning onto this template. Zhu and et al. (2024) prepend a self-explanation, rescore the options with the explanation as additional context, and report modest but consistent gains. Li and et al. (2024) show that calibrating logits with contrastive rationales helps low-parameter models, while Pan and et al. (2023) explore ensembling diverse rationales. Yet a systematic comparison between correction dynamics in open versus closed output spaces is missing; our study provides that head-to-head analysis.
Bridging the paradigms.
Contemporary benchmarks increasingly mix free-form and categorical sub-tasksāe.g., TruthfulQA has both short-answer and multiple-choice splits (Lin et al., 2022). Deployment settings such as tutoring agents or search assistants likewise alternate between generating explanations and selecting the best passages. Understanding whether self-correction behaves differently under these two regimes is therefore more than a methodological curiosity as it affects prompt engineering, compute budgeting, and safety guard-rail design. By re-implementing the main correction strategies from the literature under a unified experimental budget, we show that the shape of the output space itself controls how much an LLM can benefit from extra reflection rounds.
## 3 Open-ended Generation vs. Multiple-Choice Answer Selection
Large language models are increasingly expected to handle a wide spectrum of downstream tasks, ranging from unconstrained natural language generation, such as open-domain question answering, to highly structured classification problems, like sentiment analysis. Two of the most commonly encountered settings are (i) open-ended generation, where the model must produce a free-form text response, and (ii) multiple-choice answer selection, where it must select a single correct option from a predefined set of choices. While these two paradigms are often operationalized using the same model architecture and weights, they impose fundamentally different constraints on the output space and influence how self-correction unfolds over successive inference steps. This section formalizes these two paradigms, describes how self-correction mechanisms are instantiated within each, and presents qualitative differences that help explain the empirical patterns observed in Section 5.
Open-Ended Generation.
In the open-ended generation setting, the model is required to produce an output sequence $y^{(0)}=(y^{(0)}_{1},\ldots,y^{(0)}_{T})\in\mathcal{V}^{*}$ , where $\mathcal{V}$ denotes the vocabulary and $T$ is the (variable) sequence length. The generation is conditioned on an input $x$ , which may correspond to a question, prompt, or instruction, such that the model defines a conditional distribution:
$p(y^{(0)}\mid x)=\prod_{t=1}^{T}p(y^{(0)}_{t}\mid y^{(0)}_{<t},x)$
This formulation captures the standard auto-regressive decoding process for open-ended text generation. The generated sequence may consist of a sentence, paragraph, or longer passage, and there are no explicit structural constraints beyond syntactic plausibility and task relevance.
Self-correction in this paradigm typically proceeds by prompting the model to critique its initial output ā either via explicit instructions (āidentify any flawsā) or implicit prompting strategies (āthink step by stepā) ā followed by a new generation $y^{(1)}$ . This iterative process can be repeated multiple times, resulting in a sequence $\{y^{(k)}\}_{k=0}^{K}$ , where each revised answer aims to improve upon the previous one. A final answer can be selected using majority voting, log-probability re-ranking, or verifier-based scoring. Because generation is unconstrained, each iteration can introduce new content, restructure previous arguments, or expand omitted details. While this offers flexibility and the potential for substantial improvements, it also opens the door to risks such as semantic drift Ji et al. (2023b, a), where the answer becomes misaligned with the original question over time, or hallucinations, where fictitious facts are introduced in an attempt to improve fluency or apparent coherence. These failure modes tend to accumulate if the model āover-correctsā by deviating from the initial context Spataru (2024).
Multiple-Choice Answer Selection.
By contrast, the multi-choice setting restricts the output space to a finite set of candidate answers $A=\{a_{1},a_{2},\ldots,a_{M}\}$ . For each question $x$ , the model computes a logit vector $\ell(x)\in\mathbb{R}^{M}$ , from which a softmax distribution is derived, and selects the most probable answer. Self-correction in this paradigm does not involve rewriting text but rather involves revisiting the initial logits after incorporating additional information. One common strategy is to generate a rationale $r^{(t)}$ for why a particular answer is correct, then concatenate this rationale to the original prompt and recompute the logits to obtain $\ell^{(t+1)}(x,r^{(t)})$ Huang et al. (2023); Liu et al. (2024). Over successive iterations, this allows the model to refine its beliefs based on its own reasoning. However, since the answer set is fixed, the model cannot explore novel hypotheses or restructure the space of answers; instead, it can only shift probability mass among existing options. This bounded nature of the output space makes multiple-choice settings more stable and less prone to semantic drift, but also potentially less effective at recovering from early errors ā especially if the correct answer has low initial probability and the generated rationales fail to meaningfully influence the logits.
Qualitative Differences.
The two paradigms, i.e., open-ended generation and multiple-choice selection, exhibit distinct self-correction dynamics due to their differing output constraints. In open-ended generation, performance gains are typically front-loaded, with the most significant improvements occurring in the first few iterations as the model repairs inconsistencies or fills in missing details Cook et al. (2024); Huang et al. (2023); Gou et al. (2024). However, this flexibility also increases the risk of semantic drift in later rounds Spataru (2024): if the modelās revisions start to go off-topic or introduce inaccuracies, the session can degrade without external intervention. In contrast, multiple-choice tasks show steadier, more incremental improvements, benefiting from the stability of a fixed answer set. They may suffer, however, from logit inertia when the correct option is initially underweighted. The model can be difficult to move to a low-probability answer unless a very compelling rationale shifts the balance. Generation tends to be more compute-intensive due to longer outputs per iteration, while multiple-choice achieves better accuracy-to-token efficiency by focusing on short discriminative outputs. Additionally, model scale interacts differently across formats. Larger models can better mitigate drift in generation through coherent reasoning chains, while smaller models perform more reliably in multiple-choice settings due to the structured nature of the output space and the guidance provided by explicit options.
Understanding these qualitative and quantitative differences between the two paradigms is crucial for designing robust systems that use LLMs in iterative inference settings. Depending on the task requirements, whether correctness, stability, creativity, or inference budget is the primary constraint, one or the other format may be more appropriate, and self-correction strategies should be tailored accordingly.
## 4 Experimental Setup
Problem Statement.
In this study, we aim to evaluate the dynamics of iterative self-correction under constrained generation and multiple-choice selection across representative tasks. Let $x\in\mathcal{X}$ denote an input instance (e.g., a question) with ground-truth answer $y^{\star}$ . An LLM parameterised by $\theta$ produces an initial response $y^{(0)}$ whose format depends on the task paradigm. For open-ended generation, the model outputs a sequence $y^{(0)}\in V^{\ast}$ with $p_{\theta}\!\big(y^{(0)}\mid x\big)\;=\;\prod_{t=1}^{T}p_{\theta}\!\big(y^{(0)}t\mid y^{(0)}{<t},x\big)$ . In contrast, for multiple-choice selection, the model selects $y^{(0)}\in A=\{a_{1},\dots,a_{M}\}$ from logits $\ell(x)\in\mathbb{R}^{M}$ , i.e., $y^{(0)}\;=\;\arg\max_{a_{i}\in A}\ell_{i}(x),\qquad$ $\sigma_{i}^{(0)}(x)\;=\;\frac{e^{\ell_{i}(x)}}{\sum_{j=1}^{M}e^{\ell_{j}(x)}}$ . By applying iterative self-correct, given history $\mathcal{H}^{(k-1)}=(x,y^{(0)},\dots,y^{(k-1)})$ , the model produces a revision $y^{(k)}\sim p_{\theta}\!\big(\cdot\mid\mathcal{H}^{(k-1)}\big),\qquad k=1,\dots,K$ .
We study the sequence $\mathcal{Y}(x)=\{y^{(k)}\}_{k=0}^{K}$ and aim to maximize task accuracy of the terminal output $y^{(K)}$ over $x\sim\mathcal{D}$ . We seek to observe how performance evolves with successive self-correction iterations and how error correction or degradation manifests in each paradigm. To that end, we set up experiments on two distinct question-answering benchmarks and examine multiple LLMs under various prompting strategies.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Iteration for Two Methods
### Overview
The image is a line chart comparing the accuracy performance of two methods, "Generation" and "Multiple-choice," over a series of iterations. The chart includes shaded regions around each line, likely representing confidence intervals or standard deviation.
### Components/Axes
* **Chart Type:** Line chart with shaded error bands.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, from 0 to 5.
* **Markers:** 0, 1, 2, 3, 4, 5.
* **Y-Axis:**
* **Label:** "Accuracy (%)"
* **Scale:** Linear, from 0.0 to 1.0 (representing 0% to 100%).
* **Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Items:**
1. **Blue line with circle markers:** "Generation"
2. **Orange line with diamond markers:** "Multiple-choice"
### Detailed Analysis
**Data Series 1: Generation (Blue Line)**
* **Trend:** The line shows a steady, gradual upward slope from left to right.
* **Approximate Data Points:**
* Iteration 0: ~0.25 (25%)
* Iteration 1: ~0.28
* Iteration 2: ~0.30
* Iteration 3: ~0.32
* Iteration 4: ~0.34
* Iteration 5: ~0.35 (35%)
* **Shaded Region (Blue):** A relatively narrow band surrounds the line, indicating lower variance or a tighter confidence interval. The band spans approximately ±0.05 to ±0.08 in accuracy around the central line.
**Data Series 2: Multiple-choice (Orange Line)**
* **Trend:** The line also shows a steady upward slope, positioned consistently above the blue line.
* **Approximate Data Points:**
* Iteration 0: ~0.45 (45%)
* Iteration 1: ~0.48
* Iteration 2: ~0.50
* Iteration 3: ~0.52
* Iteration 4: ~0.54
* Iteration 5: ~0.55 (55%)
* **Shaded Region (Orange):** A wider band surrounds this line, indicating greater variance or a broader confidence interval. The band spans approximately ±0.10 to ±0.15 in accuracy around the central line.
### Key Observations
1. **Performance Gap:** The "Multiple-choice" method demonstrates consistently higher accuracy than the "Generation" method at every measured iteration. The gap is approximately 20 percentage points at iteration 0 and narrows slightly to about 20 percentage points at iteration 5.
2. **Positive Trend:** Both methods show improvement in accuracy as the number of iterations increases.
3. **Variance Difference:** The "Multiple-choice" method exhibits significantly higher variance (wider shaded area) in its performance compared to the "Generation" method, which has a more consistent output (narrower shaded area).
4. **Convergence:** The slopes of the two lines appear roughly parallel, suggesting the rate of improvement per iteration is similar for both methods.
### Interpretation
The data suggests that for the task being measured, employing a "Multiple-choice" approach yields a substantial and consistent accuracy advantage over a "Generation" approach across the observed training or evaluation iterations. Both methods benefit from more iterations, indicating a learning or refinement process.
The key trade-off highlighted is between **performance and consistency**. While "Multiple-choice" achieves higher average accuracy, its results are more variable, as shown by the wider confidence band. This could imply that its success is more sensitive to specific conditions or data samples. In contrast, "Generation" is less accurate on average but produces more reliable and predictable outcomes.
From a technical standpoint, this chart would be critical for deciding which method to deploy. If the priority is maximizing peak accuracy and some variability is acceptable, "Multiple-choice" is superior. If stable, predictable performance is paramount, "Generation" might be preferred despite its lower average score. The parallel improvement trends suggest that further iterations could continue to benefit both methods, but the fundamental performance gap between them is likely to persist.
</details>
(a) Baseline
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Iteration for Two Methods
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," over a series of iterations. The chart plots accuracy percentage against iteration number, showing the progression and variability of each method's performance.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence intervals.
* **Y-Axis:** Labeled "Accuracy (%)". Scale ranges from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Labeled "Iteration". Scale ranges from 0 to 5, with major tick marks at 0, 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-left corner of the plot area.
* Blue line with circular markers: "Generation"
* Orange line with circular markers: "Multiple-choice"
* **Data Series:** Two lines, each with a shaded band (likely representing a confidence interval or standard deviation).
* **Generation (Blue):** Starts lower, increases gradually.
* **Multiple-choice (Orange):** Starts higher, increases more steeply.
### Detailed Analysis
**Trend Verification:**
* **Generation (Blue Line):** Shows a steady, moderate upward slope from iteration 0 to 5.
* **Multiple-choice (Orange Line):** Shows a steeper upward slope, particularly between iterations 0 and 2, then continues to rise.
**Approximate Data Points (Visual Estimation):**
* **Iteration 0:**
* Generation: ~0.22 (22%)
* Multiple-choice: ~0.38 (38%)
* **Iteration 1:**
* Generation: ~0.28 (28%)
* Multiple-choice: ~0.48 (48%)
* **Iteration 2:**
* Generation: ~0.32 (32%)
* Multiple-choice: ~0.52 (52%)
* **Iteration 3:**
* Generation: ~0.35 (35%)
* Multiple-choice: ~0.55 (55%)
* **Iteration 4:**
* Generation: ~0.36 (36%)
* Multiple-choice: ~0.58 (58%)
* **Iteration 5:**
* Generation: ~0.37 (37%)
* Multiple-choice: ~0.60 (60%)
**Shaded Regions (Confidence Intervals):**
* The shaded blue area around the "Generation" line spans approximately ±0.15 (15%) in accuracy at its widest point (around iteration 2-3).
* The shaded orange area around the "Multiple-choice" line spans approximately ±0.10 (10%) in accuracy at its widest point (around iteration 1-2).
* The bands for both methods narrow slightly as iterations increase, suggesting reduced variance in later stages.
### Key Observations
1. **Performance Gap:** The "Multiple-choice" method consistently outperforms the "Generation" method at every measured iteration. The initial gap at iteration 0 is approximately 16 percentage points.
2. **Growth Rate:** "Multiple-choice" shows a faster rate of improvement, especially in the early iterations (0 to 2). Its accuracy nearly doubles from ~38% to ~52% in the first two iterations.
3. **Convergence:** Both methods show continued improvement through iteration 5, with no clear plateau. The performance gap between them remains relatively stable after iteration 2.
4. **Variability:** The "Generation" method exhibits higher variability (wider confidence band) compared to "Multiple-choice," particularly in the middle iterations.
### Interpretation
The data suggests that for the task being measured, the "Multiple-choice" approach is fundamentally more effective than the "Generation" approach, starting from a higher baseline accuracy and improving more rapidly. The steeper initial slope for "Multiple-choice" indicates it learns or adapts more efficiently in the early stages.
The persistent gap implies a core advantage in the "Multiple-choice" methodology that is not overcome by additional iterations within the observed range. The narrowing confidence intervals for both methods suggest that performance becomes more consistent and predictable as the process iterates.
From a Peircean perspective, the chart acts as an indexical sign of a learning or optimization process. The upward trends are iconic of improvement. The key symbolic takeaway is the superiority of selection-based ("Multiple-choice") over generative methods for this specific metric and timeframe. The investigation would next question *why* this gap existsāis it due to the nature of the task, the quality of the choices provided, or an inherent limitation in the generative model's precision?
</details>
(b) CoT
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Iteration for Two Methods
### Overview
The image is a line chart comparing the accuracy performance of two methods, "Generation" and "Multiple-choice," over a series of iterations. The chart displays the mean accuracy for each method at each iteration, accompanied by shaded regions representing the uncertainty or variance (likely confidence intervals).
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Linear, from 0 to 5.
* **Markers:** Major ticks at integers 0, 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy (%)"
* **Scale:** Linear, from 0.0 to 1.0 (representing 0% to 100%).
* **Markers:** Major ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend:**
* **Position:** Top-center of the chart area.
* **Items:**
1. **Blue line with circular markers:** Labeled "Generation".
2. **Orange line with circular markers:** Labeled "Multiple-choice".
* **Data Series & Visual Encoding:**
* **Generation (Blue):** A solid blue line connecting blue circular data points. A semi-transparent blue shaded area surrounds the line.
* **Multiple-choice (Orange):** A solid orange line connecting orange circular data points. A semi-transparent orange shaded area surrounds the line.
### Detailed Analysis
**Trend Verification:**
* **Generation (Blue Line):** The line exhibits a clear, steady upward trend from iteration 0 to 5.
* **Multiple-choice (Orange Line):** The line also exhibits a clear, steady upward trend from iteration 0 to 5. It is positioned consistently above the blue line.
**Data Point Extraction (Approximate Values):**
* **Iteration 0:**
* Generation: ~0.25 (25%)
* Multiple-choice: ~0.45 (45%)
* **Iteration 1:**
* Generation: ~0.28 (28%)
* Multiple-choice: ~0.48 (48%)
* **Iteration 2:**
* Generation: ~0.30 (30%)
* Multiple-choice: ~0.50 (50%)
* **Iteration 3:**
* Generation: ~0.32 (32%)
* Multiple-choice: ~0.52 (52%)
* **Iteration 4:**
* Generation: ~0.34 (34%)
* Multiple-choice: ~0.54 (54%)
* **Iteration 5:**
* Generation: ~0.35 (35%)
* Multiple-choice: ~0.55 (55%)
**Uncertainty Bands (Shaded Areas):**
* The shaded regions for both methods are widest at iteration 0 and appear to narrow slightly as iterations increase, suggesting decreasing variance over time.
* The orange band (Multiple-choice) is consistently positioned above the blue band (Generation). The bands do not overlap after iteration 0, indicating a statistically significant performance difference.
### Key Observations
1. **Consistent Performance Gap:** The "Multiple-choice" method maintains a lead of approximately 20 percentage points in accuracy over the "Generation" method at every measured iteration.
2. **Parallel Improvement:** Both methods improve at a very similar, nearly linear rate. The slope of both lines is approximately +0.02 (2%) accuracy per iteration.
3. **No Crossover:** The performance lines do not intersect; the hierarchy established at iteration 0 is maintained throughout.
4. **Diminishing Uncertainty:** The narrowing of the confidence bands suggests that the performance of both methods becomes more consistent (less variable) with more iterations.
### Interpretation
The data demonstrates that for the task measured, the "Multiple-choice" approach is fundamentally more effective than the "Generation" approach, yielding significantly higher accuracy from the outset. The parallel upward trends indicate that both methods benefit from additional iterations (e.g., more training steps, more data, or more refinement cycles) at a comparable rate. This suggests the core advantage of "Multiple-choice" is not in its learning *rate*, but in its *baseline* capability or efficiency for this specific task.
The lack of overlap in the confidence bands after the first iteration strongly implies that the observed performance difference is reliable and not due to random chance. The narrowing variance could indicate that the models are converging toward a stable performance level. From a practical standpoint, if resources (iterations) are limited, "Multiple-choice" provides a better accuracy return at every point. If the goal is to maximize final accuracy, both methods would need to be run for many more iterations to see if the gap closes, widens, or if one method plateaus before the other.
</details>
(c) SC
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison Over Iterations
### Overview
The image displays a line chart comparing the accuracy performance of two methodsā"Generation" and "Multiple-choice"āacross a series of iterations. The chart includes shaded regions representing confidence intervals or variance around each trend line.
### Components/Axes
- **Chart Type**: Line chart with shaded confidence bands.
- **X-Axis**:
- **Label**: "Iteration"
- **Scale**: Linear, from 0 to 5.
- **Tick Marks**: 0, 1, 2, 3, 4, 5.
- **Y-Axis**:
- **Label**: "Accuracy (%)"
- **Scale**: Linear, from 0.0 to 1.0 (representing 0% to 100%).
- **Tick Marks**: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
- **Legend**:
- **Placement**: Bottom-right corner of the plot area.
- **Series 1**: "Generation" ā represented by a blue line with circular markers.
- **Series 2**: "Multiple-choice" ā represented by an orange line with circular markers.
- **Data Series**:
- Each series consists of a solid line connecting data points at each iteration (0 through 5).
- Each line is accompanied by a semi-transparent shaded band of the same color, indicating a range (likely confidence interval or standard deviation).
### Detailed Analysis
**Trend Verification & Data Points (Approximate Values):**
1. **Generation (Blue Line)**:
- **Visual Trend**: The line shows a steady, monotonic upward slope from iteration 0 to 5. The rate of increase appears to slow slightly after iteration 3.
- **Data Points**:
- Iteration 0: ~0.75
- Iteration 1: ~0.78
- Iteration 2: ~0.81
- Iteration 3: ~0.83
- Iteration 4: ~0.84
- Iteration 5: ~0.85
- **Confidence Band**: The blue shaded region is widest at iteration 0 (spanning roughly 0.65 to 0.85) and narrows progressively, becoming tightest at iteration 5 (spanning roughly 0.82 to 0.88).
2. **Multiple-choice (Orange Line)**:
- **Visual Trend**: The line also shows a steady upward slope from iteration 0 to 5. Its slope is less steep than the Generation line throughout.
- **Data Points**:
- Iteration 0: ~0.55
- Iteration 1: ~0.60
- Iteration 2: ~0.65
- Iteration 3: ~0.68
- Iteration 4: ~0.69
- Iteration 5: ~0.70
- **Confidence Band**: The orange shaded region is also widest at iteration 0 (spanning roughly 0.45 to 0.65) and narrows over iterations, but remains wider than the Generation band at iteration 5 (spanning roughly 0.65 to 0.75).
### Key Observations
1. **Performance Gap**: The "Generation" method consistently achieves higher accuracy than the "Multiple-choice" method at every iteration point.
2. **Convergence Rate**: Both methods improve over iterations, but "Generation" improves at a faster rate, widening the performance gap from ~20 percentage points at iteration 0 to ~15 percentage points at iteration 5.
3. **Uncertainty Reduction**: The narrowing of the confidence bands for both methods indicates that the variance or uncertainty in the accuracy measurement decreases as iterations progress.
4. **Non-Overlap**: After iteration 0, the confidence bands of the two methods do not appear to overlap, suggesting the performance difference is statistically significant.
### Interpretation
The chart demonstrates a clear and sustained advantage for the "Generation" method over the "Multiple-choice" method in terms of accuracy for the given task. The upward trend for both indicates that performance improves with more iterations (e.g., more training, more attempts, or more data).
The "Generation" method not only starts at a higher baseline but also learns or improves more efficiently, as evidenced by its steeper slope. The narrowing confidence intervals suggest that both methods become more consistent and reliable in their performance as the process continues, but the "Generation" method achieves higher consistency (a tighter band) at the final iteration.
This data suggests that for the underlying task, a generative approach is fundamentally more effective than a multiple-choice selection approach. The persistent gap implies the advantage is not due to initial conditions but is a property of the method itself. The lack of overlap in confidence intervals after the first iteration strongly supports the conclusion that the observed performance difference is real and not due to random chance.
</details>
(d) Baseline
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Accuracy Comparison Over Iterations
### Overview
The image displays a line chart comparing the accuracy performance of two methodsā"Generation" and "Multiple-choice"āacross a series of iterations. The chart includes shaded regions around each line, likely representing confidence intervals or standard deviation.
### Components/Axes
* **Chart Type:** Line chart with shaded error bands.
* **Y-Axis:**
* **Label:** "Accuracy (%)"
* **Scale:** Linear, ranging from 0.0 to 1.0 (representing 0% to 100%).
* **Major Ticks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, discrete integer values.
* **Major Ticks:** 0, 1, 2, 3, 4, 5.
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Entries:**
1. **Blue line with circle marker:** "Generation"
2. **Orange line with circle marker:** "Multiple-choice"
### Detailed Analysis
**Data Series 1: Generation (Blue Line)**
* **Trend:** The line shows a clear upward trend, with the steepest increase occurring between iterations 0 and 1. The rate of improvement slows after iteration 2, approaching a plateau.
* **Approximate Data Points:**
* Iteration 0: ~0.75 (75%)
* Iteration 1: ~0.82 (82%)
* Iteration 2: ~0.85 (85%)
* Iteration 3: ~0.86 (86%)
* Iteration 4: ~0.87 (87%)
* Iteration 5: ~0.87 (87%)
* **Shaded Region (Blue):** Represents the uncertainty or variance for the "Generation" method. The band is widest at iteration 0 (spanning roughly 0.65 to 0.85) and narrows significantly as iterations increase, indicating more consistent performance.
**Data Series 2: Multiple-choice (Orange Line)**
* **Trend:** This line also shows a consistent upward trend, but starts from a lower baseline and maintains a lower accuracy than the "Generation" method at every iteration. Its growth appears more linear and gradual.
* **Approximate Data Points:**
* Iteration 0: ~0.55 (55%)
* Iteration 1: ~0.62 (62%)
* Iteration 2: ~0.65 (65%)
* Iteration 3: ~0.67 (67%)
* Iteration 4: ~0.68 (68%)
* Iteration 5: ~0.69 (69%)
* **Shaded Region (Orange):** Represents the uncertainty for the "Multiple-choice" method. This band is also widest at the start (spanning roughly 0.45 to 0.65 at iteration 0) and narrows over time, though it remains slightly wider than the blue band at later iterations.
### Key Observations
1. **Performance Gap:** The "Generation" method consistently outperforms the "Multiple-choice" method by a significant margin (approximately 15-20 percentage points) across all iterations.
2. **Convergence:** Both methods show diminishing returns, with accuracy gains slowing after iteration 2 or 3. The "Generation" method appears to converge to a higher final accuracy (~87%) compared to "Multiple-choice" (~69%).
3. **Uncertainty Reduction:** For both methods, the variance in performance (indicated by the shaded bands) decreases substantially with more iterations, suggesting the methods become more stable and predictable.
4. **Initial Advantage:** The "Generation" method starts with a much higher accuracy at iteration 0, indicating a stronger initial performance or better "zero-shot" capability.
### Interpretation
The chart demonstrates a clear superiority of the "Generation" approach over the "Multiple-choice" approach for the task being measured. The data suggests that the "Generation" method is not only more accurate from the outset but also learns or improves more efficiently in the early iterations.
The narrowing confidence intervals imply that with more iterations (or more data/training), both methods become more reliable, but the "Generation" method achieves high reliability faster. The persistent gap between the lines indicates a fundamental difference in the efficacy of the two methods for this specific task, rather than a temporary or easily closed advantage.
From a technical perspective, this could imply that generative models (if that's what "Generation" represents) are better suited for this problem domain than discriminative or selection-based ("Multiple-choice") models. The chart provides strong evidence to favor the "Generation" method if the goal is maximizing accuracy and achieving stable performance with fewer iterations.
</details>
(e) CoT
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Model Accuracy Over Iterations
### Overview
The image displays a line chart comparing the accuracy performance of two different methods, labeled "Generation" and "Multiple-choice," across a series of training or evaluation iterations. The chart includes shaded regions representing confidence intervals or standard deviation around each mean accuracy line.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, from 0 to 5, with major tick marks at each integer (0, 1, 2, 3, 4, 5).
* **Y-Axis:**
* **Label:** "Accuracy (%)"
* **Scale:** Linear, from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:**
* **Position:** Bottom-center of the chart area.
* **Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange line with circular markers.
* **Data Series & Confidence Intervals:**
* Each line is accompanied by a semi-transparent shaded band of the corresponding color (blue for Generation, orange for Multiple-choice), indicating the range of uncertainty or variance around the mean accuracy at each iteration.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
1. **Generation (Blue Line):**
* **Trend:** The line shows a sharp initial increase from iteration 0 to 1, followed by a more gradual rise that plateaus from iteration 2 onward.
* **Data Points (Mean Accuracy):**
* Iteration 0: ~0.75
* Iteration 1: ~0.80
* Iteration 2: ~0.82
* Iteration 3: ~0.82
* Iteration 4: ~0.82
* Iteration 5: ~0.82
* **Confidence Interval:** The blue shaded band is widest at iteration 0 (spanning roughly 0.65 to 0.85) and narrows slightly as iterations progress, indicating decreasing variance.
2. **Multiple-choice (Orange Line):**
* **Trend:** The line shows a steady, moderate increase from iteration 0 to 2, after which it plateaus.
* **Data Points (Mean Accuracy):**
* Iteration 0: ~0.60
* Iteration 1: ~0.65
* Iteration 2: ~0.68
* Iteration 3: ~0.68
* Iteration 4: ~0.68
* Iteration 5: ~0.68
* **Confidence Interval:** The orange shaded band is also widest at the start (spanning roughly 0.45 to 0.75 at iteration 0) and narrows over time, though it remains wider than the Generation band at corresponding iterations.
### Key Observations
* **Performance Gap:** The "Generation" method consistently achieves higher accuracy than the "Multiple-choice" method at every measured iteration. The gap is approximately 0.15 (15 percentage points) at iteration 0 and narrows slightly to about 0.14 by the plateau.
* **Convergence:** Both methods appear to converge to a stable accuracy level by iteration 2 or 3, with minimal improvement thereafter.
* **Variance:** Both methods show higher variance (wider confidence intervals) in early iterations, which decreases as training progresses. The "Multiple-choice" method exhibits greater variance than "Generation" at all points.
* **Initial Learning Rate:** "Generation" shows a steeper initial learning curve (larger gain from iter 0 to 1) compared to the more gradual initial ascent of "Multiple-choice."
### Interpretation
The data suggests that the "Generation" approach is both more effective (higher final accuracy) and more efficient (reaches near-peak performance faster) than the "Multiple-choice" approach for the given task. The plateau indicates that further iterations beyond 2 or 3 yield diminishing returns for both methods under the current conditions.
The consistently wider confidence interval for "Multiple-choice" implies its performance is less stable or more sensitive to initial conditions or data variations compared to "Generation." The narrowing of both intervals over time is a typical sign of model stabilization during training.
From a Peircean perspective, the chart is an *icon* representing the relationship between training time (iteration) and performance (accuracy). It is also an *index* pointing to the underlying cause: the "Generation" method's architecture or training procedure is fundamentally better suited to this task than the "Multiple-choice" method. The *interpretant* for a viewer is the conclusion that "Generation" is the superior method, warranting its selection for deployment or further development. The notable outlier is the high initial variance, which might prompt an investigation into the stability of the early training phases.
</details>
(f) SC
Figure 1: Average cumulative accuracy on generation and multiple-choice. (Top) Accuracy on the DisambiguationQA dataset shows that models perform better on the multiple-choice task when we iteratively self-correct the model response to the questions, while (bottom) shows the accuracy on the tinyTruthfulQA dataset, indicating that models perform better in generation tasks.
Research Questions.
Our study is guided by the following three research questions:
- RQ1: How do self-correction dynamics differ between open-ended and multiple-choice tasks?
- RQ2: How do model scale and prompting strategy influence self-correction across the two paradigms?
- RQ3: How does iterative self-correction affect correctness, stability, and semantic drift, and what mechanisms explain these effects?
Datasets.
We evaluate on two benchmarks, DisambiguationQA and tinyTruthfulQA, that each provide parallel formulations for both multiple-choice questions and open-ended generation. This allows us to study self-correction dynamics under consistent task content but different output constraints.
- DisambiguationQA Kazemi et al. (2025) is typically phrased in multiple-choice form, where each question presents a pronoun or reference with referential ambiguity and provides four candidate referents. However, the same questions can also be cast into an open-ended format by asking models to generate the referent rather than choose among options. Thus, DisambiguationQA instantiates a scenario where the answer space is tightly constrained but also amenable to open-ended generation in a parallel setup.
- tinyTruthfulQA Polo et al. (2024) is a challenging subset of the TruthfulQA benchmark Lin et al. (2022) focused on short-form factual queries that tend to provoke false or misleading answers from LLMs. While TruthfulQA is usually evaluated via free-form generation, where models must produce a truthful answer, a multiple-choice variant has also been developed, offering for each question a small set of candidate answers drawn from the same reference answer pool. Therefore, tinyTruthfulQA inherits this dual-format nature, where the same questions support both open-ended and multiple-choice instantiations. This dataset exemplifies scenarios requiring knowledge retrieval and precision in generation.
By evaluating both tasks, we cover one case where the ground-truth answer is within a closed set of options and one case where the answer must be generated. We therefore can compare how iterative self-correction dynamics differ when the modelās output is tightly constrained versus freely generative.
Models.
We evaluate the dynamics of iterative selfācorrection under unconstrained generation and multipleāchoice selection using six preātrained language models ranging from small to large parameters. We evaluate SmolLM2-1.7B Allal et al. (2025), Qwen2.5-3B Qwen et al. (2025), Llama-3.1-8B Grattafiori et al. (2024), Qwen2.5-14B Qwen et al. (2025), DeepSeek-R1-Distill-Llama-8B Guo et al. (2025), and Gemini-2.0-Flash Comanici et al. (2025). These models represent diverse families and scales (from distilled smaller models to state-of-the-art large models). For each model and dataset, we compare three aforementioned prompting strategies: a direct Baseline prompt, zeroāshot chaināofāthought (CoT) prompting Kojima et al. (2022), and our iterative SC procedure that reviews and refines the modelās own previous response for up to five rounds. We use HuggingFace to run the models except Gemini-2.0-Flash, which is accessed through the API.
Prompts.
In our experiments, we use simplified prompts to minimize the impact of prompt design on performance across tasks, keeping the focus on the self-correction mechanism Huang et al. (2023). Specifically, we apply a basic prompt for the Baseline method and adopt zero-shot Chain-of-Thought (CoT) prompting Kojima et al. (2022) for both the CoT and Self-Consistency (SC) approaches. The initial prompts are used for the first attempt (iteration 0) under each strategy. They differ only in whether the model is encouraged to produce an explicit chain of reasoning before the final answer. For iterations beyond the first, we prepend instructions to review the prior attempts. In both cases, the model is reminded of its earlier answers (which are included in the conversation context) and encouraged to refine them. The CoT variant additionally maintains the directive to use a step-by-step reasoning process during revision. Our full prompts can be found in Appendix A.2.
Final Answer Extraction.
For all of our problems, we added the ā The final answer is: ā suffix to the text of the prompt to encourage the model to produce the final answer in a format that we can easily extract. More details in Appendix A.1.
## 5 Results
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Average Correct Flips per Iteration
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," over five iterations. The performance metric is the "Average Correct Flips." Both methods show a decreasing trend, with shaded regions indicating confidence intervals or variability around the mean values.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **Y-Axis:**
* **Label:** "Average Correct Flips"
* **Scale:** Linear, ranging from 0.000 to 0.100.
* **Ticks:** 0.000, 0.025, 0.050, 0.075, 0.100.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Linear, discrete integer values.
* **Ticks:** 1, 2, 3, 4, 5.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange line with circular markers.
* **Visual Elements:**
* **Confidence Bands:** Each line has a corresponding shaded area of the same color but with lower opacity, representing the range of data (e.g., standard deviation, confidence interval).
* **Grid:** A light gray grid is present in the background.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
1. **Generation (Blue Line):**
* **Trend:** The line shows a steep downward slope from iteration 1 to 2, followed by a more gradual decline through iterations 3, 4, and 5.
* **Approximate Values:**
* Iteration 1: ~0.070
* Iteration 2: ~0.050
* Iteration 3: ~0.040
* Iteration 4: ~0.030
* Iteration 5: ~0.028
* **Confidence Band:** The blue shaded area is widest at iteration 1 (spanning roughly 0.040 to 0.100) and narrows considerably by iteration 5.
2. **Multiple-choice (Orange Line):**
* **Trend:** The line starts higher than the blue line, drops sharply between iterations 1 and 2, then plateaus with a very slight downward trend from iteration 2 to 5.
* **Approximate Values:**
* Iteration 1: ~0.080
* Iteration 2: ~0.065
* Iteration 3: ~0.040
* Iteration 4: ~0.040
* Iteration 5: ~0.030
* **Confidence Band:** The orange shaded area is also widest at iteration 1 (spanning roughly 0.055 to 0.100) and narrows over time, though it remains wider than the blue band at iterations 4 and 5.
**Component Isolation:**
* **Header:** Contains the Y-axis label and the top portion of the plot area.
* **Main Chart:** The central plotting area containing the two lines, their confidence bands, and the grid.
* **Footer:** Contains the X-axis label and tick marks.
### Key Observations
1. **Initial Performance:** The "Multiple-choice" method starts with a higher average correct flips value (~0.080) than the "Generation" method (~0.070) at iteration 1.
2. **Rate of Decline:** The "Generation" method experiences a more consistent and steeper decline across all iterations. The "Multiple-choice" method's decline is most pronounced in the first two iterations, after which it stabilizes.
3. **Convergence:** By iteration 3, the mean values of both methods converge at approximately 0.040. From iteration 4 onward, the "Generation" method's mean falls slightly below that of "Multiple-choice."
4. **Variability:** The confidence intervals for both methods are substantial, particularly in early iterations. The bands overlap significantly throughout the chart, especially from iteration 3 onwards, indicating that the difference in mean performance between the two methods may not be statistically significant at those points.
5. **Final State:** At iteration 5, both methods show low average correct flips (between ~0.028 and ~0.030), with overlapping confidence intervals.
### Interpretation
The chart demonstrates that both the "Generation" and "Multiple-choice" methods see a reduction in the "Average Correct Flips" metric as the number of iterations increases. This suggests that the task or problem being measured becomes more difficult, or the methods' effectiveness diminishes, with repeated iterations.
The "Multiple-choice" approach appears to have a higher initial performance but stabilizes after a sharp early drop. The "Generation" approach degrades more steadily. The critical observation is the **overlap of the confidence bands**. This visual overlap suggests that while the point estimates (the lines) differ, the underlying data has enough variability that we cannot confidently state one method is superior to the other at any given iteration based solely on this chart. The trend is clearāperformance decreases for bothābut the precise comparison between methods is uncertain. The data implies that the choice between these methods may depend more on factors other than this specific performance metric over multiple iterations, such as computational cost, speed, or performance on other unmeasured dimensions.
</details>
(a) Baseline
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Average Correct Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is "Average Correct Flips." Both methods show a general downward trend in performance over time, with overlapping confidence intervals (shaded regions) suggesting variability in the results.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence intervals.
* **Y-Axis:**
* **Label:** "Average Correct Flips"
* **Scale:** Linear, ranging from 0.000 to 0.100.
* **Ticks:** 0.000, 0.025, 0.050, 0.075, 0.100.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Discrete, with integer values.
* **Ticks:** 1, 2, 3, 4, 5.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange line with circular markers.
* **Confidence Intervals:** Shaded regions around each line, indicating the range of uncertainty or variance for each data point. The "Generation" interval is shaded in a light blue/purple, and the "Multiple-choice" interval is shaded in a light orange/tan.
### Detailed Analysis
**Data Series: Generation (Blue Line)**
* **Trend:** The line shows a general downward trend with a plateau between iterations 2 and 3.
* **Data Points (Approximate):**
* Iteration 1: ~0.070
* Iteration 2: ~0.050
* Iteration 3: ~0.050
* Iteration 4: ~0.030
* Iteration 5: ~0.030
* **Confidence Interval:** The shaded blue region is widest at iteration 1 (spanning roughly 0.045 to 0.095) and narrows considerably by iteration 5 (spanning roughly 0.015 to 0.045).
**Data Series: Multiple-choice (Orange Line)**
* **Trend:** The line shows a consistent downward trend across all iterations.
* **Data Points (Approximate):**
* Iteration 1: ~0.080
* Iteration 2: ~0.060
* Iteration 3: ~0.040
* Iteration 4: ~0.040
* Iteration 5: ~0.030
* **Confidence Interval:** The shaded orange region is also widest at iteration 1 (spanning roughly 0.055 to 0.105) and narrows by iteration 5 (spanning roughly 0.015 to 0.045).
### Key Observations
1. **Initial Performance Gap:** At iteration 1, the "Multiple-choice" method starts with a higher average correct flips value (~0.080) compared to the "Generation" method (~0.070).
2. **Converging Performance:** The performance of both methods converges by the final iteration (5), with both data points at approximately 0.030.
3. **Overlapping Confidence Intervals:** The shaded confidence intervals for both methods overlap significantly at every iteration. This visual overlap suggests that the difference in performance between the two methods may not be statistically significant at any given point.
4. **Plateau in Generation:** The "Generation" method's performance does not decrease between iterations 2 and 3, unlike the "Multiple-choice" method which continues to decline.
5. **Narrowing Variance:** The confidence intervals for both methods narrow as iterations increase, indicating that the results become more consistent or less variable over time.
### Interpretation
The chart demonstrates that both the "Generation" and "Multiple-choice" methods experience a decline in the "Average Correct Flips" metric as the iterative process progresses. This could indicate that the task becomes more difficult with each iteration, or that the methods are being applied to increasingly challenging cases.
The initial advantage of the "Multiple-choice" method diminishes over time, leading to equivalent final performance. The significant overlap in confidence intervals is a critical finding; it implies that any observed difference in the average values between the two methods at a specific iteration could be due to random variation rather than a true difference in effectiveness. Therefore, based solely on this visual data, one cannot confidently conclude that one method is superior to the other at any stage.
The narrowing confidence intervals suggest that the process or the methods themselves become more stable and predictable with repeated iterations. The plateau in the "Generation" method's performance is an interesting anomaly that might warrant further investigation into what occurred between iterations 2 and 3 to halt its decline temporarily.
</details>
(b) CoT
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Average Correct Flips by Iteration
### Overview
This is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is "Average Correct Flips." The chart includes shaded regions around each line, indicating variability or confidence intervals.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Correct Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange line with circular markers.
* **Data Series & Shading:**
* Each line has a corresponding semi-transparent shaded area of the same color (light blue for Generation, light orange for Multiple-choice), representing the range of uncertainty or variance around the mean value.
### Detailed Analysis
**Trend Verification & Data Points (Approximate Values):**
* **Generation (Blue Line):**
* **Trend:** Starts highest, shows an overall downward trend with a small peak at iteration 3.
* **Data Points:**
* Iteration 1: ~0.060
* Iteration 2: ~0.030
* Iteration 3: ~0.040
* Iteration 4: ~0.020
* Iteration 5: ~0.020
* **Uncertainty Band (Approximate Range):** Widest at iteration 1 (spanning ~0.040 to ~0.080), narrows significantly by iteration 5.
* **Multiple-choice (Orange Line):**
* **Trend:** Starts lower than Generation, declines to a minimum at iteration 3, then shows a slight recovery.
* **Data Points:**
* Iteration 1: ~0.050
* Iteration 2: ~0.040
* Iteration 3: ~0.020
* Iteration 4: ~0.030
* Iteration 5: ~0.030
* **Uncertainty Band (Approximate Range):** Relatively consistent width across iterations, spanning roughly ±0.015 from the mean line.
**Spatial Relationships:**
* The two lines cross between iterations 1 and 2. Generation is higher at iteration 1, but Multiple-choice is higher at iteration 2.
* The lines cross again between iterations 3 and 4. Generation is higher at iteration 3, but Multiple-choice is higher at iterations 4 and 5.
* The shaded uncertainty bands overlap significantly between iterations 2 and 4, suggesting the performance difference between the two methods may not be statistically significant in that range.
### Key Observations
1. **Initial Advantage:** The "Generation" method starts with a higher average correct flips score (~0.060) compared to "Multiple-choice" (~0.050).
2. **Convergence and Crossover:** The performance of both methods converges and crosses over multiple times. "Multiple-choice" ends with a slightly higher score (~0.030) than "Generation" (~0.020) at iteration 5.
3. **Minimum Point:** Both methods reach their lowest performance at different iterations: "Generation" at iterations 4 & 5, and "Multiple-choice" at iteration 3.
4. **Variability:** The "Generation" method exhibits much higher variability (wider shaded band) in the early iterations, which decreases over time. The "Multiple-choice" method shows more consistent variability.
### Interpretation
The chart suggests a comparative study of two iterative processes or algorithms. The "Generation" method may be more powerful or effective initially but is less stable (higher variance) and its performance degrades more sharply over successive iterations. The "Multiple-choice" method starts slightly less effective but demonstrates more resilience, recovering after a dip and maintaining a more stable performance level in later iterations.
The overlapping confidence intervals are crucial; they indicate that for iterations 2 through 4, the observed differences in average correct flips between the two methods might be due to random chance rather than a true superiority of one method. The final data point (iteration 5) shows a clearer separation, with "Multiple-choice" outperforming "Generation," but the certainty of this conclusion depends on the width of the confidence bands at that point.
This pattern could imply a trade-off: "Generation" might be a high-risk, high-reward approach that excels early but is prone to degradation, while "Multiple-choice" is a more robust, consistent strategy that improves or stabilizes with more iterations. The "correct flips" metric itself suggests a task involving correction or improvement, where the two methods employ fundamentally different strategies.
</details>
(c) SC
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Average Incorrect Flips vs. Iteration
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," over five iterations. The performance metric is the "Average Incorrect Flips," where a lower value indicates better performance. Each data series is represented by a line with circular markers and a shaded region indicating the confidence interval or variance around the mean.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Linear, with discrete integer markers at 1, 2, 3, 4, and 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Incorrect Flips"
* **Scale:** Linear, ranging from 0.000 to 0.100, with major tick marks at 0.000, 0.025, 0.050, 0.075, and 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with blue circular markers and a light blue shaded confidence interval.
* **Series 2:** "Multiple-choice" - Represented by an orange line with orange circular markers and a light orange shaded confidence interval.
### Detailed Analysis
**Data Series: Generation (Blue Line)**
* **Trend:** The line shows a general downward trend from iteration 1 to 4, with a slight upward tick at iteration 5. The confidence interval is widest at the first iteration and narrows considerably by the fifth.
* **Approximate Data Points:**
* Iteration 1: ~0.060
* Iteration 2: ~0.040
* Iteration 3: ~0.040
* Iteration 4: ~0.030
* Iteration 5: ~0.040
**Data Series: Multiple-choice (Orange Line)**
* **Trend:** The line shows a steep initial decline from iteration 1 to 2, followed by a moderate increase at iteration 3, and then a steady decline through iterations 4 and 5. The confidence interval is very wide at iteration 1 and narrows significantly by iteration 5.
* **Approximate Data Points:**
* Iteration 1: ~0.090
* Iteration 2: ~0.040
* Iteration 3: ~0.050
* Iteration 4: ~0.040
* Iteration 5: ~0.020
### Key Observations
1. **Initial Performance Gap:** At iteration 1, the "Multiple-choice" method has a substantially higher average incorrect flips (~0.090) compared to the "Generation" method (~0.060).
2. **Convergence at Iteration 2:** Both methods converge to a similar performance level of approximately 0.040 at iteration 2.
3. **Final Performance Divergence:** By iteration 5, the "Multiple-choice" method achieves the lowest observed value (~0.020), outperforming the "Generation" method (~0.040).
4. **Confidence Interval Behavior:** For both series, the shaded confidence intervals are widest at the first iteration and become progressively narrower, suggesting that the variance in performance decreases as the number of iterations increases.
5. **Crossover Point:** The two lines cross between iterations 1 and 2, and again between iterations 4 and 5, indicating a shift in which method is superior at different stages of the process.
### Interpretation
The chart demonstrates the learning or optimization curves for two different approaches. The "Multiple-choice" method starts with poorer performance but exhibits a more dramatic improvement, ultimately achieving the best result by the final iteration. The "Generation" method shows more consistent, moderate improvement but plateaus at a higher error rate.
The narrowing confidence intervals suggest that both methods become more reliable and consistent in their outputs as they are applied iteratively. The crossover points are critical; they indicate that the optimal method depends on the stage of the process. If only a few iterations are possible, "Generation" may be preferable initially. However, for a process allowing for five or more iterations, "Multiple-choice" appears to be the more effective strategy for minimizing incorrect flips. The data suggests an initial phase of rapid learning for "Multiple-choice," followed by a refinement phase where it surpasses the more steadily improving "Generation" method.
</details>
(d) Baseline
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Average Incorrect Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is the "Average Incorrect Flips," where a lower value indicates better performance. The chart includes shaded regions around each line, likely representing confidence intervals or variability.
### Components/Axes
* **Chart Type:** Line chart with two data series and shaded error bands.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** Ticks at integers 1, 2, 3, 4, 5.
* **Y-Axis:**
* **Label:** "Average Incorrect Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** Ticks at 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue dashed line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Data Series & Shading:**
* The "Generation" series has a blue shaded area around its line.
* The "Multiple-choice" series has an orange shaded area around its line.
* The shaded areas overlap significantly, particularly in later iterations.
### Detailed Analysis
**Trend Verification:**
* **Generation (Blue Line):** The line shows an overall downward trend from iteration 1 to 5, with a notable dip at iteration 3 and a slight rise at iteration 4 before falling again.
* **Multiple-choice (Orange Line):** The line shows a general downward trend, with a plateau between iterations 2 and 3, followed by a steeper decline.
**Data Point Extraction (Approximate Values):**
| Iteration | Generation (Avg. Incorrect Flips) | Multiple-choice (Avg. Incorrect Flips) |
| :--- | :--- | :--- |
| 1 | ~0.060 | ~0.080 |
| 2 | ~0.050 | ~0.060 |
| 3 | ~0.030 | ~0.060 |
| 4 | ~0.040 | ~0.030 |
| 5 | ~0.020 | ~0.030 |
**Shaded Region Analysis:**
* The shaded regions (likely confidence intervals) are widest at iteration 1 for both series, suggesting higher initial variability.
* The bands narrow considerably by iteration 5, indicating more consistent results as iterations progress.
* The blue and orange shaded areas overlap substantially from iteration 2 onward, suggesting the performance difference between the two methods may not be statistically significant at many points.
### Key Observations
1. **Initial Performance Gap:** At iteration 1, the "Multiple-choice" method has a higher average error (~0.080) compared to the "Generation" method (~0.060).
2. **Convergence:** By iteration 5, the performance of both methods converges to a similar low error rate (between ~0.020 and ~0.030).
3. **Non-Monotonic Improvement:** The "Generation" method does not improve linearly; its error rate increases slightly from iteration 3 to 4 before decreasing again.
4. **Plateau in Multiple-choice:** The "Multiple-choice" method shows no improvement between iterations 2 and 3, maintaining an error rate of ~0.060.
5. **Reducing Variability:** The narrowing of the shaded bands for both series indicates that the results become more precise and less variable with more iterations.
### Interpretation
The chart demonstrates that both the "Generation" and "Multiple-choice" methods are effective at reducing the "Average Incorrect Flips" over successive iterations, suggesting a learning or optimization process.
* **Relative Efficacy:** The "Generation" method starts with a performance advantage. However, the "Multiple-choice" method shows a steeper rate of improvement between iterations 3 and 5, ultimately catching up.
* **Convergence and Reliability:** The convergence of the lines and the narrowing of the confidence bands by iteration 5 suggest that given enough iterations, both methods achieve a similar, reliable, and low-error outcome. The initial higher variability diminishes, indicating the process stabilizes.
* **Practical Implication:** If the goal is to minimize errors quickly (in few iterations), the "Generation" method appears superior initially. If the process can run for more iterations (5 or more), the choice between methods may become less critical based on this final error metric alone. The overlapping confidence intervals caution against declaring one method definitively better than the other at most individual iteration points without further statistical analysis. The data suggests the underlying process for both methods becomes more consistent and accurate over time.
</details>
(e) CoT
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Average Incorrect Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is the "Average Incorrect Flips," where a lower value indicates better performance. Each data series is represented by a dashed line with circular markers and includes a shaded region indicating the confidence interval or variability around the mean.
### Components/Axes
* **Chart Type:** Line chart with confidence intervals.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Incorrect Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Series 1:** "Generation" - Represented by a blue dashed line with blue circular markers (ā).
* **Series 2:** "Multiple-choice" - Represented by an orange dashed line with orange circular markers (ā).
* **Visual Elements:**
* **Shaded Regions:** A semi-transparent blue shaded area surrounds the "Generation" line, and a semi-transparent orange shaded area surrounds the "Multiple-choice" line. These represent the range of uncertainty (e.g., standard deviation or confidence interval) for each series.
### Detailed Analysis
**Data Series: Generation (Blue)**
* **Trend:** Shows a consistent downward trend across all five iterations.
* **Data Points (Approximate):**
* Iteration 1: ~0.060
* Iteration 2: ~0.040
* Iteration 3: ~0.030
* Iteration 4: ~0.030
* Iteration 5: ~0.020
* **Confidence Interval:** The blue shaded region is widest at Iteration 1 (spanning roughly 0.035 to 0.085) and narrows progressively, becoming tightest at Iteration 5.
**Data Series: Multiple-choice (Orange)**
* **Trend:** Shows an initial decrease, a slight increase at Iteration 3, and then stabilizes.
* **Data Points (Approximate):**
* Iteration 1: ~0.050
* Iteration 2: ~0.030
* Iteration 3: ~0.040
* Iteration 4: ~0.030
* Iteration 5: ~0.030
* **Confidence Interval:** The orange shaded region is also widest at the start (spanning roughly 0.000 to 0.075 at Iteration 1) and narrows over time, though it remains slightly wider than the blue region at Iteration 5.
### Key Observations
1. **Initial Performance:** At Iteration 1, the "Multiple-choice" method starts with a lower average incorrect flips value (~0.050) compared to the "Generation" method (~0.060).
2. **Convergence:** By Iteration 2, both methods have similar performance (~0.030 for Multiple-choice, ~0.040 for Generation). Their confidence intervals overlap significantly from Iteration 2 onward.
3. **Final Performance:** At Iteration 5, the "Generation" method achieves the lowest observed value (~0.020), while the "Multiple-choice" method plateaus at ~0.030.
4. **Variability:** Both methods show high initial variability (wide shaded areas), which decreases with more iterations, suggesting the results become more consistent over time.
5. **Crossover:** The "Generation" line crosses below the "Multiple-choice" line between Iteration 2 and Iteration 3 and remains below it for the rest of the charted iterations.
### Interpretation
The chart demonstrates that both the "Generation" and "Multiple-choice" methods are effective at reducing the "Average Incorrect Flips" over successive iterations. The "Generation" method shows a more consistent and ultimately greater improvement, starting from a worse position but ending with the best performance. The "Multiple-choice" method improves quickly but then hits a plateau.
The significant overlap in the confidence intervals, especially from Iteration 2 to 4, suggests that the performance difference between the two methods may not be statistically significant during those stages. The narrowing of the shaded regions indicates that the evaluation of both methods becomes more precise or stable with more iterations.
**Underlying Narrative:** This data likely comes from an iterative machine learning or optimization process (e.g., training a model, refining a prompt). The "Incorrect Flips" could refer to errors in classification, generation, or decision-making. The trend suggests that iterative refinement is beneficial, and the "Generation" approach, while potentially noisier initially, may have a higher ceiling for improvement in this specific task. The plateau of the "Multiple-choice" method could indicate it reaches its optimal performance faster but has less room for further refinement.
</details>
(f) SC
Figure 2: Average Correct and Incorrect Flips on DisambiguationQA
We now analyze the results in relation to our three research questions.
Improvement Patterns Across Iterations (RQ1).
To address RQ1, we first examine the aggregate performance reported in Figure 1, which compares accuracy across correction iterations for generation and multiple-choice formats. The generation paradigm improves rapidly in the first one or two iterations, showing that early revisions are effective at fixing obvious errors or adding missing information. However, after these early gains, performance often plateaus or declines, as additional revisions increase the risk of semantic drift and lead to new mistakes. In contrast, the multiple-choice paradigm improves more gradually and steadily. Accuracy rises incrementally with each round of self-correction, reflecting cautious re-weighting among fixed options. Yet this format struggles to recover from poor initial predictions: if the modelās first choice is wrong, subsequent iterations rarely flip it to the correct option, showing the effects of logit inertia.
Figures 2 and 3 present the āflipā dynamics of self-correction on the two datasets, broken down into correct (a previously wrong answer corrected to right) and incorrect (a previously correct answer changed to wrong) flips over successive iterations. On DisambiguationQA (Figure 2), multiple-choice self-correction yields very few flips overall. Correct answers are stably retained, but wrong initial guesses are seldom corrected. Generation, by contrast, produces more frequent flips: many beneficial in early iterations (correcting ambiguous references) but increasingly harmful in later ones, as correct answers are sometimes replaced with incorrect ones, once the model starts to over-correct or drift. On tinyTruthfulQA (Figure 3), the contrast is sharper: generation produces a high number of flips, with many early correct flips (replacing misconceptions with truths), but also a rising number of incorrect flips in later rounds, reflecting semantic drift. Multiple-choice again remains stable, with minimal incorrect flips but limited ability to recover from an early mistake.
Taken together, we show that open-ended generation offers adaptability and rapid early gains but suffers from instability in later iterations, whereas multiple-choice offers stability and incremental improvement but is hampered by inertia when the first choice is wrong. This confirms that self-correction effectiveness is strongly dependent on task format: open-ended generation can exploit flexibility to correct errors but risks drift, while multiple-choice provides reliable retention of correct answers at the expense of recoverability. If the model doesnāt get the answer right on the first attempt, it has a hard time changing to the correct option later. This fundamental difference in dynamics directly answers RQ1: self-correction behaves very differently in open-ended versus fixed-option scenarios, with each paradigm exhibiting its own pattern of improvement and failure modes.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Chart: Average Correct Flips vs. Iteration
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is "Average Correct Flips." The chart includes shaded regions representing confidence intervals or variability around each line.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Linear, with discrete integer markers from 1 to 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Correct Flips"
* **Scale:** Linear, ranging from 0.000 to 0.100, with major tick marks at 0.000, 0.025, 0.050, 0.075, and 0.100.
* **Legend:**
* **Position:** Top-center of the plot area.
* **Items:**
1. **Blue line with circle markers:** "Generation"
2. **Orange line with circle markers:** "Multiple-choice"
* **Data Series & Confidence Bands:**
* **Generation (Blue):** A solid blue line with circular data points. It is surrounded by a light blue shaded area.
* **Multiple-choice (Orange):** A solid orange line with circular data points. It is surrounded by a light orange shaded area.
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
| Iteration | Generation (Blue Line) | Multiple-choice (Orange Line) |
| :--- | :--- | :--- |
| 1 | ~0.050 | ~0.060 |
| 2 | ~0.050 | ~0.050 |
| 3 | ~0.040 | ~0.030 |
| 4 | ~0.030 | ~0.010 |
| 5 | ~0.040 | ~0.020 |
**Trend Verification:**
* **Generation (Blue):** The line shows a slight overall downward trend from iteration 1 to 4, with a partial recovery at iteration 5. It starts at ~0.050, dips to a low of ~0.030 at iteration 4, and rises back to ~0.040.
* **Multiple-choice (Orange):** The line shows a steeper downward trend from iteration 1 to 4, followed by a small rebound at iteration 5. It starts higher than Generation at ~0.060, falls to a low of ~0.010 at iteration 4, and recovers slightly to ~0.020.
**Confidence Interval Observation:**
* The shaded blue area (Generation) is notably wide at iterations 1 and 5, suggesting higher variance or uncertainty in the data at the beginning and end of the measured sequence.
* The shaded orange area (Multiple-choice) is generally narrower but also shows increased width at iteration 1.
### Key Observations
1. **Performance Crossover:** The "Multiple-choice" method starts with a higher average correct flips score than "Generation" at iteration 1. However, its performance degrades more rapidly, falling below the "Generation" line by iteration 3 and remaining below it for the rest of the chart.
2. **Common Low Point:** Both methods experience their lowest measured performance at iteration 4.
3. **Differential Recovery:** While both methods show a performance increase from iteration 4 to 5, the "Generation" method recovers more strongly, returning to a level close to its starting point, whereas the "Multiple-choice" method shows only a modest rebound.
4. **Volatility:** The "Multiple-choice" series exhibits greater volatility, with a larger relative drop from its peak to its trough compared to the "Generation" series.
### Interpretation
The data suggests a comparative analysis of two iterative processes. The "Generation" method demonstrates more stable and resilient performance over the five iterations. Although it starts slightly lower, it maintains a more consistent output, with a less severe decline and a stronger recovery.
In contrast, the "Multiple-choice" method shows an initial advantage that is not sustained. Its performance deteriorates significantly, indicating it may be more sensitive to the iterative process or encounters a bottleneck around iteration 4. The partial recovery at iteration 5 for both methods could indicate an adaptive mechanism or a change in conditions.
The wide confidence interval for "Generation" at the start and end implies that while its average performance is stable, individual runs or instances may vary considerably. The chart implies that for tasks measured by "Average Correct Flips" over multiple iterations, the "Generation" approach may offer more predictable and robust long-term results, despite a potentially slower start.
</details>
(a) Baseline
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Chart: Average Correct Flips vs. Iteration
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is "Average Correct Flips." Both methods show a general downward trend, with "Generation" consistently outperforming "Multiple-choice" after the first iteration. Shaded regions around each line indicate variability or confidence intervals.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** Ticks at integers 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Correct Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** Ticks at 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with solid circle markers.
* **Series 2:** "Multiple-choice" - Represented by an orange line with solid circle markers.
* **Data Series & Shading:**
* Each line is surrounded by a semi-transparent shaded band of the same color (light blue for Generation, light orange for Multiple-choice), representing the range of uncertainty or variance around the mean value.
### Detailed Analysis
**Trend Verification:**
* **Generation (Blue Line):** The line exhibits a steep downward slope from iteration 1 to 3, followed by a much gentler decline or near-plateau from iteration 3 to 5.
* **Multiple-choice (Orange Line):** The line shows a sharp initial drop from iteration 1 to 2, followed by a steady, gradual decline through iteration 5.
**Data Point Extraction (Approximate Values):**
| Iteration | Generation (Avg. Correct Flips) | Multiple-choice (Avg. Correct Flips) |
| :--- | :--- | :--- |
| 1 | ~0.070 | ~0.070 |
| 2 | ~0.065 | ~0.030 |
| 3 | ~0.050 | ~0.030 |
| 4 | ~0.040 | ~0.020 |
| 5 | ~0.040 | ~0.020 |
**Uncertainty Bands (Visual Estimate of Range):**
* **Generation:** The shaded band is widest at iteration 1 (spanning approx. 0.050 to 0.090) and narrows considerably by iteration 5 (spanning approx. 0.030 to 0.050).
* **Multiple-choice:** The shaded band is also widest at iteration 1 (approx. 0.050 to 0.090) and narrows by iteration 5 (approx. 0.010 to 0.030).
* **Overlap:** The confidence bands for the two methods overlap significantly at iteration 1 and show partial overlap at iterations 2 and 3. By iterations 4 and 5, the bands are distinct, with the Generation band positioned entirely above the Multiple-choice band.
### Key Observations
1. **Initial Parity:** Both methods start at an identical average performance level (~0.070) at iteration 1.
2. **Divergence:** A significant performance gap emerges immediately at iteration 2, with "Generation" maintaining a much higher value.
3. **Convergence of Trend:** While "Generation" remains superior, both methods follow a similar pattern of rapid initial performance loss followed by a slower rate of decline.
4. **Reducing Variance:** The narrowing of the shaded bands for both series suggests that the performance of each method becomes more consistent (less variable) as iterations progress.
5. **Clear Separation:** By the final two iterations (4 and 5), the performance of "Generation" is approximately double that of "Multiple-choice," and their confidence intervals no longer overlap.
### Interpretation
This chart demonstrates a comparative evaluation of two iterative processes. The key finding is that the "Generation" method is more robust or effective than the "Multiple-choice" method for the task measured by "Average Correct Flips."
* **Performance Decay:** The downward trend for both lines indicates that the task becomes more difficult or the methods become less effective with each successive iteration. This could be due to factors like increasing problem complexity, diminishing returns, or error accumulation.
* **Method Superiority:** The "Generation" method's ability to maintain a higher average score, especially after the first iteration, suggests it has a better strategy, more robust underlying model, or is less susceptible to the factors causing performance decay.
* **Reliability:** The narrowing confidence bands imply that as the process continues, the outcomes for each method become more predictable. The clear separation of the bands at the end provides strong visual evidence that the performance difference in later iterations is likely statistically significant.
* **Practical Implication:** If iterations represent a cost (time, computation), the chart suggests that using the "Generation" method yields a better return on investment, particularly in later stages. The "Multiple-choice" method suffers a more severe and immediate drop in performance.
**Language:** The text in the image is in English.
</details>
(b) CoT
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is "Average Correct Flips." The chart includes shaded regions around each line, likely representing confidence intervals or standard deviation, indicating the variability of the data.
### Components/Axes
* **Chart Type:** Line chart with two data series and shaded error bands.
* **Y-Axis:**
* **Label:** "Average Correct Flips"
* **Scale:** Linear, ranging from 0.000 to 0.100.
* **Major Ticks:** 0.000, 0.025, 0.050, 0.075, 0.100.
* **X-Axis:**
* **Label:** "Iteration"
* **Scale:** Discrete, with integer values from 1 to 5.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange line with circular markers.
* **Data Series & Shading:**
* The "Generation" (blue) line is surrounded by a light blue shaded area.
* The "Multiple-choice" (orange) line is surrounded by a light orange shaded area.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
1. **Series: Generation (Blue Line)**
* **Visual Trend:** The line shows a fluctuating trend. It starts at a moderate level, dips, recovers slightly, and ends lower than its starting point. The shaded blue area is notably wide, especially at iterations 1 and 5, indicating high variance.
* **Data Points:**
* Iteration 1: ~0.050
* Iteration 2: ~0.030
* Iteration 3: ~0.030
* Iteration 4: ~0.040
* Iteration 5: ~0.030
2. **Series: Multiple-choice (Orange Line)**
* **Visual Trend:** The line shows a clear downward trend that plateaus. It starts as the higher-performing method, declines steadily until iteration 3, and then remains flat. The shaded orange area is narrower than the blue one, suggesting more consistent results.
* **Data Points:**
* Iteration 1: ~0.060
* Iteration 2: ~0.040
* Iteration 3: ~0.020
* Iteration 4: ~0.020
* Iteration 5: ~0.020
### Key Observations
* **Performance Crossover:** The "Multiple-choice" method starts with a higher average correct flips score (~0.060) than "Generation" (~0.050) at Iteration 1.
* **Convergence and Divergence:** The two methods converge at Iteration 2 (~0.030 vs ~0.040). By Iteration 3, "Generation" (~0.030) surpasses "Multiple-choice" (~0.020) and maintains a higher average score for the remaining iterations.
* **Stability vs. Variance:** The "Multiple-choice" method shows a stable, low plateau from Iteration 3 onward. The "Generation" method exhibits more fluctuation and has a significantly wider confidence band, indicating less predictable performance.
* **Overall Decline:** Both methods show a net decrease in "Average Correct Flips" from Iteration 1 to Iteration 5.
### Interpretation
The data suggests a trade-off between initial performance and long-term stability between the two methods.
* The **"Multiple-choice"** approach appears to be more effective initially but suffers a rapid and consistent decline, stabilizing at a low performance level. Its narrow error band implies this decline is a reliable outcome of the method under the tested conditions.
* The **"Generation"** approach starts slightly worse but demonstrates more resilience. While its performance fluctuates and is more variable (wide error band), it does not collapse to the same low baseline as the multiple-choice method. By the later iterations, it consistently outperforms the multiple-choice approach.
This pattern could indicate that the "Generation" method, while noisier, is better at maintaining a certain level of capability over repeated iterations, whereas the "Multiple-choice" method may be prone to a form of rapid degradation or saturation. The investigation would benefit from understanding the specific task ("Correct Flips") to contextualize why performance generally trends downward for both methods.
</details>
(c) SC
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: Average Incorrect Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is "Average Incorrect Flips," where a lower value indicates better performance. Both methods show a decreasing trend, indicating improvement over time.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence intervals (or standard deviation bands).
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** Ticks at integers 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Incorrect Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** Ticks at 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a dark blue dashed line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Data Series & Shading:**
* Each line is accompanied by a semi-transparent shaded area of the same color, extending above and below the line. This likely represents a measure of variance (e.g., standard deviation, confidence interval).
### Detailed Analysis
**Trend Verification & Data Points (Approximate Values):**
1. **Generation (Blue Dashed Line):**
* **Visual Trend:** Starts high, experiences a sharp decrease between iterations 1 and 2, then continues a more gradual decline through iteration 5.
* **Data Points:**
* Iteration 1: ~0.070
* Iteration 2: ~0.040
* Iteration 3: ~0.040
* Iteration 4: ~0.030
* Iteration 5: ~0.030
* **Shading:** The blue shaded band is widest at iteration 1 (spanning approx. 0.050 to 0.090) and narrows considerably by iteration 5.
2. **Multiple-choice (Orange Dashed Line):**
* **Visual Trend:** Starts lower than Generation and shows a steady, consistent downward slope across all five iterations.
* **Data Points:**
* Iteration 1: ~0.040
* Iteration 2: ~0.030
* Iteration 3: ~0.020
* Iteration 4: ~0.020
* Iteration 5: ~0.010
* **Shading:** The orange shaded band is relatively consistent in width across iterations, spanning roughly ±0.015 from the central line.
### Key Observations
* **Performance Gap:** The "Multiple-choice" method consistently has a lower "Average Incorrect Flips" value than the "Generation" method at every iteration.
* **Rate of Improvement:** "Generation" shows a more dramatic initial improvement (a drop of ~0.030 between iterations 1 and 2) compared to the steadier improvement of "Multiple-choice."
* **Convergence:** By iteration 5, the performance gap between the two methods has narrowed compared to iteration 1, but "Multiple-choice" remains superior.
* **Variance:** The "Generation" method exhibits much higher variance (wider shaded area) in the early iterations, suggesting less consistent performance initially. Its variance decreases significantly over time. The "Multiple-choice" method shows more stable variance throughout.
### Interpretation
The chart demonstrates that both evaluated methods improve their accuracy (reduce incorrect flips) with repeated iterations. The "Multiple-choice" approach is not only more accurate from the outset but also maintains a more consistent performance (lower variance). The "Generation" approach starts with higher error and greater inconsistency but learns rapidly, particularly in the first step.
The data suggests a trade-off: "Multiple-choice" offers reliable, steady performance, while "Generation" may have a higher initial cost in errors but demonstrates a capacity for rapid early learning. The narrowing gap by iteration 5 indicates that with sufficient iterations, the performance of "Generation" approaches that of "Multiple-choice," though it does not surpass it within the observed timeframe. The investigation would benefit from knowing what happens beyond iteration 5 to see if the trends continue, plateau, or cross.
</details>
(d) Baseline
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Chart: Average Incorrect Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is the "Average Incorrect Flips." Each data series is represented by a dashed line with circular markers and includes a shaded region indicating a confidence interval or range of variability.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Incorrect Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue dashed line with blue circular markers and a light blue shaded confidence band.
* **Series 2:** "Multiple-choice" - Represented by an orange dashed line with orange circular markers and a light orange shaded confidence band.
### Detailed Analysis
**Data Series: Generation (Blue)**
* **Trend:** The line shows a steep downward trend initially, followed by a plateau.
* **Data Points (Approximate):**
* Iteration 1: ~0.100
* Iteration 2: ~0.060
* Iteration 3: ~0.060
* Iteration 4: ~0.040
* Iteration 5: ~0.040
* **Confidence Band:** The blue shaded region is widest at Iteration 1 (spanning roughly 0.050 to 0.100+), narrows significantly by Iteration 2, and remains relatively narrow through Iterations 3-5.
**Data Series: Multiple-choice (Orange)**
* **Trend:** The line shows a slight initial increase, followed by a gentle decline and stabilization.
* **Data Points (Approximate):**
* Iteration 1: ~0.020
* Iteration 2: ~0.030
* Iteration 3: ~0.020
* Iteration 4: ~0.020
* Iteration 5: ~0.020
* **Confidence Band:** The orange shaded region is moderately wide at Iteration 1 (spanning roughly 0.000 to 0.040), peaks in width at Iteration 2, and then narrows slightly but remains consistently present through Iteration 5.
### Key Observations
1. **Initial Performance Gap:** At Iteration 1, the "Generation" method has a substantially higher average of incorrect flips (~0.100) compared to the "Multiple-choice" method (~0.020).
2. **Convergence:** The performance of the two methods converges significantly over time. By Iteration 5, the values are much closer (~0.040 vs. ~0.020) than at the start.
3. **Rate of Improvement:** The "Generation" method shows a dramatic improvement (steep decline) between Iterations 1 and 2. The "Multiple-choice" method shows a much flatter trend overall.
4. **Variability:** The "Generation" method exhibits very high initial variability (wide confidence band at Iteration 1), which decreases sharply. The "Multiple-choice" method shows more consistent, moderate variability throughout.
5. **Plateau:** Both methods appear to plateau after Iteration 3 or 4, showing little to no change in the average incorrect flips for the final iterations.
### Interpretation
The chart demonstrates a comparative learning or optimization process. The "Generation" method starts with poor performance (high incorrect flips) but improves rapidly, suggesting it may be learning from its errors effectively in the early stages. The "Multiple-choice" method starts with better performance but shows less dramatic improvement, indicating it may be a more stable but less adaptable approach from the outset.
The convergence of the two lines suggests that given enough iterations (5 in this case), the performance gap between the two methods narrows considerably. The narrowing confidence band for "Generation" implies that its performance becomes more predictable and consistent as iterations progress. The data suggests that while "Multiple-choice" is initially superior, "Generation" catches up significantly, and the choice between them might depend on the cost of early errors versus the final performance ceiling. The plateau indicates that further iterations beyond 5 may yield diminishing returns for both methods under the tested conditions.
</details>
(e) CoT
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Average Incorrect Flips Over Iterations
### Overview
The image is a line chart comparing the performance of two methods, "Generation" and "Multiple-choice," across five iterations. The performance metric is the "Average Incorrect Flips," where a lower value indicates better performance. Both methods show a general downward trend, suggesting improvement over successive iterations. The chart includes shaded regions around each line, representing confidence intervals or variability in the data.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis (Horizontal):**
* **Label:** "Iteration"
* **Scale:** Discrete, linear scale from 1 to 5.
* **Markers:** Major ticks at integers 1, 2, 3, 4, 5.
* **Y-Axis (Vertical):**
* **Label:** "Average Incorrect Flips"
* **Scale:** Linear scale from 0.000 to 0.100.
* **Markers:** Major ticks at 0.000, 0.025, 0.050, 0.075, 0.100.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a dark blue dashed line with circular markers.
* **Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Data Series & Confidence Bands:**
* Each line is surrounded by a semi-transparent shaded area of the corresponding color (blue for Generation, orange for Multiple-choice), indicating the range of uncertainty or variance around the mean value.
### Detailed Analysis
**Trend Verification:**
* **Generation (Blue Line):** The line exhibits a clear downward slope from iteration 1 to iteration 5, indicating a consistent reduction in average incorrect flips.
* **Multiple-choice (Orange Line):** The line also slopes downward from iteration 1 to iteration 4, showing improvement. Between iteration 4 and 5, the trend flattens or shows a very slight upward inflection.
**Data Point Extraction (Approximate Values):**
* **Iteration 1:**
* Generation: ~0.060
* Multiple-choice: ~0.050
* **Iteration 2:**
* Generation: ~0.050
* Multiple-choice: ~0.030
* **Iteration 3:**
* Generation: ~0.040
* Multiple-choice: ~0.020
* **Iteration 4:**
* Generation: ~0.030
* Multiple-choice: ~0.010
* **Iteration 5:**
* Generation: ~0.020
* Multiple-choice: ~0.015 (slight increase from iteration 4)
**Confidence Interval Observation:**
* The shaded confidence band for the "Generation" method is notably wider than that for "Multiple-choice," especially in the earlier iterations (1-3). This suggests greater variability or less certainty in the performance of the Generation method during the initial phases.
* The confidence bands for the two methods overlap significantly across all iterations, particularly from iteration 2 onward.
### Key Observations
1. **Initial Performance Gap:** At iteration 1, the "Multiple-choice" method starts with a lower average incorrect flip rate (~0.050) compared to the "Generation" method (~0.060).
2. **Rate of Improvement:** The "Generation" method shows a steeper initial decline between iterations 1 and 2. The "Multiple-choice" method improves steadily until iteration 4.
3. **Convergence and Divergence:** The performance of the two methods appears to converge around iteration 5, with both achieving low error rates (between 0.015 and 0.020). However, the "Multiple-choice" method shows a potential plateau or slight regression at the final step.
4. **Uncertainty:** The wide confidence interval for "Generation" implies that while its average performance improves, individual results may vary considerably. The "Multiple-choice" method's tighter band suggests more consistent performance.
### Interpretation
The chart demonstrates that both the "Generation" and "Multiple-choice" methods are effective at reducing errors (incorrect flips) over successive iterations, likely in a machine learning or optimization context. The "Multiple-choice" approach appears to offer a more consistent and initially superior performance, achieving lower error rates faster. However, the "Generation" method, despite higher initial error and variability, catches up by the fifth iteration.
The overlapping confidence intervals are a critical detail. They suggest that the observed differences in average performance between the two methods at any given iteration may not be statistically significant. A practitioner would need to consider this uncertainty; the apparent advantage of "Multiple-choice" might be less definitive than the mean lines alone suggest.
The slight uptick for "Multiple-choice" at iteration 5 could indicate a point of diminishing returns, potential overfitting, or simply noise within the confidence interval. This anomaly warrants further investigation to determine if it's a meaningful pattern or a random fluctuation.
**In summary:** The data suggests both methods are viable for reducing errors over time. "Multiple-choice" may be preferred for its consistency and faster initial gains, while "Generation" is a competitive alternative that achieves similar final performance, albeit with higher variability during the process. The choice between them might depend on the importance of early-stage performance versus final outcome, and the tolerance for result variability.
</details>
(f) SC
Figure 3: Average Correct and Incorrect Flips on tinyTruthfulQA
Effects of Model Scale and Prompting Strategy (RQ2).
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Charts: Model Accuracy Across Iterations on DisambiguationQA and tinyTruthfulQA
### Overview
The image displays a 2x3 grid of six line charts comparing the performance of various large language models on two question-answering benchmarks ("DisambiguationQA" and "tinyTruthfulQA") across three prompting methods ("Baseline", "CoT" (Chain-of-Thought), and "Self-Consistency"). Performance is measured as accuracy percentage over 6 iterations (0 to 5). Each chart plots multiple lines, each representing a specific model, with line color indicating the task type (Generation or Multiple-choice) and marker shape indicating the specific model.
### Components/Axes
* **Chart Grid:** 2 rows x 3 columns.
* **Top Row Title:** `DisambiguationQA`
* **Bottom Row Title:** `tinyTruthfulQA`
* **Column Titles (Left to Right):** `Baseline`, `CoT`, `Self-Consistency`
* **Axes (Identical for all charts):**
* **X-axis:** Label: `Iteration`. Ticks: `0, 1, 2, 3, 4, 5`.
* **Y-axis:** Label: `Accuracy (%)`.
* Top Row (DisambiguationQA) Scale: `0.00` to `0.40` (increments of 0.10).
* Bottom Row (tinyTruthfulQA) Scale: `0.0` to `0.8` (increments of 0.2).
* **Legend (Located at the bottom center of the entire image):**
* **Task Type (Color):**
* Blue Circle: `Generation`
* Orange Circle: `Multiple-choice`
* **Model (Marker Shape & Label):**
* Gray Square: `Gemini-2.0-Flash`
* Gray Upward Triangle: `Qwen2.5-14B`
* Gray Downward Triangle: `Llama-3.1-8B`
* Gray Diamond: `SmolLM2-1.7B`
* Gray Left Triangle: `DeepSeek-R1-Distill-Llama-8B`
* Gray Right Triangle: `Qwen2.5-3B`
* **Note:** The legend uses gray for all model markers. In the charts, the lines are colored blue or orange based on the task type, and the specific model is identified by its unique marker shape on that colored line.
### Detailed Analysis
**1. DisambiguationQA - Baseline (Top-Left Chart)**
* **Trend:** Performance is generally low and stable across iterations for most models. There is a clear separation between task types.
* **Multiple-choice (Orange Lines):** Clustered in the upper band (~0.25 to ~0.38). The highest performer appears to be `Gemini-2.0-Flash` (square marker), starting near 0.38 and ending near 0.35. `Qwen2.5-14B` (up triangle) is also high, around 0.35.
* **Generation (Blue Lines):** Clustered in the lower band (~0.05 to ~0.20). The highest blue line is likely `Qwen2.5-14B` (up triangle), hovering around 0.18-0.20. The lowest is `SmolLM2-1.7B` (diamond), near 0.05.
**2. DisambiguationQA - CoT (Top-Middle Chart)**
* **Trend:** More variability and some upward trends compared to Baseline. The gap between task types narrows slightly.
* **Multiple-choice (Orange Lines):** Still generally higher, but with more fluctuation. `Gemini-2.0-Flash` (square) shows a dip at iteration 2 before recovering. Several models converge around 0.30-0.35 by iteration 5.
* **Generation (Blue Lines):** Shows more improvement. The top blue line (likely `Qwen2.5-14B`, up triangle) rises from ~0.20 to ~0.28. Other models like `Llama-3.1-8B` (down triangle) also show upward movement.
**3. DisambiguationQA - Self-Consistency (Top-Right Chart)**
* **Trend:** The highest overall performance and most distinct separation between top and bottom performers.
* **Multiple-choice (Orange Lines):** `Gemini-2.0-Flash` (square) is the clear leader, starting above 0.40 and maintaining a high level. `Qwen2.5-14B` (up triangle) is also strong, around 0.35-0.38.
* **Generation (Blue Lines):** The top blue line (`Qwen2.5-14B`, up triangle) performs well, around 0.30-0.32. A significant outlier is the lowest blue line (`SmolLM2-1.7B`, diamond), which remains very low, near 0.05-0.08.
**4. tinyTruthfulQA - Baseline (Bottom-Left Chart)**
* **Trend:** Extremely wide spread in performance. Some models excel, while others fail almost completely.
* **Multiple-choice (Orange Lines):** Two distinct clusters. Top cluster (`Gemini-2.0-Flash`, `Qwen2.5-14B`) is very high, ~0.75-0.80. Bottom cluster (`SmolLM2-1.7B`, `Qwen2.5-3B`) is very low, ~0.15-0.20.
* **Generation (Blue Lines):** Similar wide spread. Top blue line (`Qwen2.5-14B`, up triangle) is high, ~0.70. Bottom blue line (`SmolLM2-1.7B`, diamond) is near 0.10.
**5. tinyTruthfulQA - CoT (Bottom-Middle Chart)**
* **Trend:** Performance for top models remains high but shows more volatility. The low-performing cluster remains consistently poor.
* **Multiple-choice (Orange Lines):** Top models (`Gemini-2.0-Flash`, `Qwen2.5-14B`) fluctuate between 0.70 and 0.80. The low cluster (`SmolLM2-1.7B`, `Qwen2.5-3B`) stays flat near 0.20.
* **Generation (Blue Lines):** The top blue line (`Qwen2.5-14B`, up triangle) shows a notable dip at iteration 2 before recovering to ~0.70. The lowest blue line remains near 0.10.
**6. tinyTruthfulQA - Self-Consistency (Bottom-Right Chart)**
* **Trend:** Similar pattern to CoT, with high performers maintaining a lead and low performers stagnant.
* **Multiple-choice (Orange Lines):** `Gemini-2.0-Flash` (square) and `Qwen2.5-14B` (up triangle) are again top, ~0.75-0.80. The low cluster is unchanged.
* **Generation (Blue Lines):** The top blue line (`Qwen2.5-14B`, up triangle) is stable around 0.70. The lowest blue line (`SmolLM2-1.7B`, diamond) shows a slight upward tick at iteration 5 but remains below 0.20.
### Key Observations
1. **Task Type Dominance:** Across all charts and methods, **Multiple-choice** tasks (orange lines) consistently yield higher accuracy than **Generation** tasks (blue lines) for the same model.
2. **Model Performance Hierarchy:** A clear hierarchy exists. `Gemini-2.0-Flash` and `Qwen2.5-14B` are consistently top performers. `SmolLM2-1.7B` and `Qwen2.5-3B` are consistently the lowest performers, especially on tinyTruthfulQA.
3. **Benchmark Difficulty:** Models achieve significantly higher accuracy on **tinyTruthfulQA** (up to ~80%) compared to **DisambiguationQA** (max ~40%), suggesting the latter is a more challenging benchmark for these models.
4. **Prompting Method Impact:** Moving from **Baseline** to **CoT** and **Self-Consistency** generally improves performance, particularly for Generation tasks on DisambiguationQA. The effect is less pronounced on tinyTruthfulQA for the top models, as they are already near a performance ceiling.
5. **Stability:** Performance is relatively stable across iterations for most models, with some notable fluctuations (e.g., `Qwen2.5-14B` on tinyTruthfulQA-CoT at iteration 2).
### Interpretation
This data demonstrates the significant impact of both **task formulation** (Multiple-choice vs. Generation) and **prompting strategy** (Baseline, CoT, Self-Consistency) on LLM performance. The consistent superiority of Multiple-choice formats suggests that constrained output spaces are easier for models to handle accurately than open-ended generation for these QA tasks.
The stark performance gap between models like `Gemini-2.0-Flash` and `SmolLM2-1.7B` highlights the importance of model scale and capability. The fact that advanced prompting (CoT, Self-Consistency) provides a larger relative boost to weaker models on the harder benchmark (DisambiguationQA) indicates these techniques are most valuable for bridging capability gaps in complex reasoning tasks. Conversely, on the easier benchmark (tinyTruthfulQA), top models are already proficient, so advanced prompting yields diminishing returns.
The charts collectively argue that for reliable QA performance, one should consider: 1) using a capable base model, 2) framing the task as multiple-choice if possible, and 3) employing advanced prompting techniques like Self-Consistency, especially for challenging, ambiguous problems.
</details>
Figure 4: Accuracy per iteration per model on generation and multiple-choice.
Here, we investigate how a modelās size and the prompting strategy influence self-correction, and whether these effects differ between the two output paradigms. Figure 4 provides a detailed view of accuracy per iteration for various models under different prompting methods. A clear finding is that task difficulty moderates these effects. On the challenging DisambiguationQA benchmark, accuracy is low for all models: even the largest (e.g., Gemini-2.0-Flash, Qwen2.5-14B) plateau around 50% in multiple-choice and below 20% in generation, while smaller models perform far worse. In contrast, on the easier tinyTruthfulQA, generative accuracy ranges from 60ā90% and multiple-choice from 50ā80%, with even small models performing well. Thus, model scale yields clear benefits on harder tasks, but differences narrow considerably on simpler ones.
The prompting strategy has a modest but noticeable effect, more so on the difficult task. On DisambiguationQA, using an explicit CoT prompt or a SC approach yields slight accuracy improvements over the Baseline direct prompting. For example, prompting the model to āthink step by stepā or to consider multiple reasoning paths sometimes helps it disambiguate the question better, nudging up the accuracy by a few percentage points. These gains, while not dramatic, suggest that reasoning-oriented prompts can aid the model on ambiguous, challenging questions. In contrast, on tinyTruthfulQA, all three prompting strategies lead to very similar performance. The accuracy curves for different prompts on this task are nearly overlapping (Figure 4), indicating that when a question is relatively straightforward or the model already knows the domain (e.g., common truths vs. misconceptions), an elaborate prompt does not provide much benefit. In summary, prompting variations have a task-dependent impact: they can be slightly beneficial for resolving difficult queries (DisambiguationQA) but mostly redundant for simpler factual questions (tinyTruthfulQA). This aligns with the findings in the literature Sprague et al. (2025).
Model scale shows a similarly nuanced effect. Larger models generally outperform smaller ones, especially on DisambiguationQA, where 14B+ models clearly surpass 1ā3B models. On tinyTruthfulQA, however, the performance gap narrows, with small models often approaching large-model accuracy. In some cases, scaling produces diminishing returns, indicating that size matters more for difficult tasks but offers limited advantage once a task is already within reach.
Notably, repeated iterations of self-correction do not consistently boost accuracy for either paradigm, regardless of model size or prompt strategy. Across our experiments, most performance curves over iterations (spanning iteration 0 through 5) are relatively flat after the initial step. As highlighted by Figure 4, it is rare to see a clear upward trajectory beyond the first one or two iterations; instead, accuracy often oscillates with minor gains or losses. For example, a model might correct a mistake at iteration 1, only to introduce a different mistake at iteration 3, ending up with a similar accuracy as it started. This plateauing behavior implies that giving the model many chances to self-correct yields diminishing returns. Neither larger scale nor advanced prompting fundamentally changes this outcome ā their benefits tend to manifest in the first attempt or two, but they do not drive continual improvement with more iterations. In some cases, we even observed slight performance degradation with too many iterations (echoing the drift issues from RQ1). In summary, the impact of model scale and prompting strategy on self-correction is real but nuanced: larger models and CoT-style prompts can improve initial accuracy, especially on hard tasks, but these factors are task-dependent and ultimately insufficient to guarantee ongoing improvements through iterative self-correction alone. Multiple-choice and generation formats alike see their gains saturate early, and improvements from scaling or better prompting taper off without addressing the core limitations of each paradigm. Notably, we also found that the multiple-choice paradigm often reaped slightly more benefit from increased model size and reasoning prompts than the generation paradigm did (especially on DisambiguationQA), reinforcing the idea that constrained decision tasks can more readily capitalize on those enhancements. Still, neither paradigm achieves a dramatically upward performance trend with iteration ā a key insight for understanding the boundaries of current self-correction capabilities.
Trade-offs Between Adaptability and Stability (RQ3).
RQ3 examines how iterative self-correction influences correctness, stability, and semantic drift across unconstrained and constrained outputs. In the generation setting, flexibility allows models to revise and often improve answers in the first one or two iterations, but this same flexibility leads to semantic drift in later rounds. As Figures 2 and 3 as well as the detailed plots of per model evaluation in Appendix C.1, generation produces many flips: early ones are often correct (e.g., resolving an ambiguity or correcting a misconception), but over time, incorrect flips dominate as the model over-edits or drifts away from the question. This suggests that while generation supports adaptability, it lacks effective internal checks to prevent harmful revisions. By contrast, in the multiple-choice setting, the output space is restricted to fixed options, which prevents drift altogether. Correct answers remain locked in across iterations, reflecting high stability. However, this comes with logit inertia: wrong initial answers persist, with very few corrective flips observed in Figures 2 and 3. The mechanism here is that once a wrong option is selected, the model rarely shifts its ranking enough to choose the correct one later, even when revisiting its reasoning.
These patterns reveal a fundamental adaptabilityāstability trade-off. Generation is exploratory and can recover from initial mistakes, but risks undermining correctness as iterations accumulate. Multiple-choice ensures consistency once correct, but limits opportunities to fix errors. For system design, this implies that neither paradigm is universally optimal. Applications requiring stable outputs, such as safety-critical domains, benefit from constrained correction, though additional mechanisms may be needed to overcome inertia (e.g., external verification or re-ranking). Conversely, tasks where capturing every possible correction is crucial may favor open-ended revision, provided that safeguards against drift are implemented. Promising directions include hybrid strategies that combine paradigms, using generation to explore candidate answers followed by constrained verification to anchor correctness, and dynamic stopping rules that halt iteration once improvements saturate or harmful drift is detected. Addressing these trade-offs directly, by mitigating semantic drift in generation and reducing inertia in multiple-choice, will be key to making iterative self-correction a reliable capability of LLM systems.
## 6 Conclusion
This study compared iterative self-correction in large language models across open-ended generation and multiple-choice question answering. Results show that the structure of the output space fundamentally shapes correction dynamics. Generation achieves rapid early gains by correcting errors in the first few iterations, but suffers from semantic drift as revisions accumulate, resulting in increasing rates of incorrect flips. Multiple-choice responses remain highly stable and avoid drift, but exhibit logit inertia: wrong initial answers are rarely overturned, and improvements are incremental at best. Model scale and prompting strategy modulate performance but do not alter these core patterns. Larger models and reasoning-oriented prompts (CoT, SC) yield slight improvements, especially on the harder DisambiguationQA task, but their effects are modest and task-dependent. Across both paradigms, accuracy generally plateaus after the first one or two iterations, showing that repeated self-correction brings limited benefit.
These findings highlight an inherent adaptabilityāstability trade-off. Open-ended generation enables recovery from errors but risks instability, while multiple-choice ensures reliability but limits correction. Future work should explore hybrid strategies, such as using generation for exploration and constrained formats for verification, as well as dynamic stopping criteria to prevent late drift. Addressing drift and inertia directly will be essential for building reliable self-correcting LLM systems.
## Limitations
This study focuses on benchmarks that provide parallel formulations for both open-ended generation and multiple-choice questions. While this setup enables a controlled analysis of self-correction across task formats, it also limits the number of datasets available for evaluation, as few benchmarks support both types of tasks. Moreover, our experiments are conducted using currently available models of moderate scale. Recent larger models, which may exhibit different self-correction dynamics and reasoning behaviors, are not included in our analysis. Future work could extend our study to such models to provide a more comprehensive understanding of scaling effects.
## Ethical Considerations
We have carefully verified that the software, model checkpoints and existing datasets utilised in this work are permitted for access, distribution and, where relevant, modification. Our use and purpose comply with those terms.
## Acknowledgments
This research is supported by the Engineering and Physical Sciences Research Council [EP/S021566/1] and the EPSRC Fellowship titled āTask Based Information Retrievalā [EP/P024289/1].
## References
- A2i (2025) A2i. 2025. TruthfulQA Truth Judge. Accessed: 2025.
- Allal et al. (2025) Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martin Blazquez, Guilherme Penedo, Lewis Tunstall, AndrĆ©s Marafioti, AgustĆn Piqueres LajarĆn, Hynek KydlĆÄek, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan Son NGUYEN, Ben Burtenshaw, ClĆ©mentine Fourrier, Haojun Zhao, Hugo Larcher, Mathieu Morlon, Cyril Zakka, and 3 others. 2025. SmolLM2: When smol goes big ā data-centric training of a fully open small language model. In Second Conference on Language Modeling.
- Belcak et al. (2025) Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small language models are the future of agentic ai. arXiv preprint arXiv:2506.02153.
- Brown and et al. (2020) Tom B Brown and et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Chang et al. (2024) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, and 1 others. 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3):1ā45.
- Chen et al. (2023) Mark Y Chen, Chia-Wei Liu, Xuezhi Wang, Quoc V Le, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Program-aided language models: Language models as programs. arXiv preprint arXiv:2303.11366.
- Clark and et al. (2018) Peter Clark and et al. 2018. Think you have reasoning solved? evaluating the arc challenge. arXiv preprint arXiv:1803.05457.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Anish Madaan, and et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Comanici et al. (2025) Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261.
- Cook et al. (2024) Jonathan Cook, Tim RocktƤschel, Jakob Nicolaus Foerster, Dennis Aumiller, and Alex Wang. 2024. TICKing all the boxes: Generated checklists improve LLM evaluation and generation. In Language Gamification - NeurIPS 2024 Workshop.
- Dixit et al. (2023) Tanay Dixit, Fei Wang, Muhao Chen, and et al. 2023. Improving factuality of abstractive summarization without sacrificing summary quality. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 902ā913.
- Gou et al. (2024) Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Santi Basart, and et al. 2021. Measuring massive multitask language understanding. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10013ā10023.
- Huang et al. (2023) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
- Ji et al. (2023a) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023a. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1ā38.
- Ji et al. (2023b) Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023b. Towards mitigating llm hallucination via self-reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827ā1843.
- Kamoi et al. (2024) Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. 2024. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12:1417ā1440.
- Kazemi et al. (2025) Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Peter Chen, and 1 others. 2025. Big-bench extra hard. arXiv preprint arXiv:2502.19187.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199ā22213.
- Krishna et al. (2024) Satyapriya Krishna, Chirag Agarwal, and Himabindu Lakkaraju. 2024. Understanding the effects of iterative prompting on truthfulness. In Proceedings of the 41st International Conference on Machine Learning, ICMLā24. JMLR.org.
- Kumar et al. (2025) Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. 2025. Training language models to self-correct via reinforcement learning. In The Thirteenth International Conference on Learning Representations.
- Li and et al. (2024) Wei Li and et al. 2024. Logitlens: Calibrating reasoning in language models with internal consistency. NeurIPS 2024.
- Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), pages 2129ā2144.
- Liu et al. (2024) Dancheng Liu, Amir Nassereldine, Ziming Yang, Chenhui Xu, Yuting Hu, Jiajie Li, Utkarsh Kumar, Changjae Lee, Ruiyang Qin, Yiyu Shi, and 1 others. 2024. Large language models have intrinsic self-correction ability. arXiv preprint arXiv:2406.15673.
- Ma et al. (2025) Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, and Jia Li. 2025. S 2 R: Teaching LLMs to self-verify and self-correct via reinforcement learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22632ā22654.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534ā46594.
- Pan and et al. (2023) Xinyu Pan and et al. 2023. Multiple rationales for multiple-choice question answering. arXiv preprint arXiv:2305.03495.
- Polo et al. (2024) Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. 2024. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992.
- Pryzant (2023) Ryan Pryzant. 2023. Automatic prompt optimization with "gradient descent" for language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 494ā507.
- Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634ā8652.
- Spataru (2024) Ava Spataru. 2024. Know when to stop: A study of semantic drift in text generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3656ā3671.
- Sprague et al. (2025) Zayne Rea Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. 2025. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning. In The Thirteenth International Conference on Learning Representations.
- Suzgun and Kalai (2024) Mirac Suzgun and Adam Tauman Kalai. 2024. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954.
- Suzgun et al. (2025) Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. 2025. Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952.
- Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Zelikman et al. (2022) Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. STar: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595ā46623.
- Zhu and et al. (2024) Xue Zhu and et al. 2024. Mcrepair: Enhancing multiple-choice reasoning with self-explanation and rescoring. arXiv preprint arXiv:2405.18711.
## Appendix A Details on Experimental Setup
### A.1 Details on Final Answer Extraction
For all of our problems, we added a short phrase to the text of the question to guide the model to give the final answer in a clear format: ā provide your final answer after the āThe final answer is: ā.ā To extract the answer, we split the output of the model using this phrase and take what comes after it. Since models sometimes change the phrase slightly, we also check for different variations until one is found: āThe answer is: ā, āThe answer is ā. āThe final answer is: ā, āThe final answer is ā. Once we get the final answer, we clean it up with a few simple steps:
1. If the answer is inside symbols like boxed, text, texttt, or wrapped in **, we remove those and keep only the text inside.
1. For multiple-choice questions, if the model adds extra text after the final answer (for example, by putting a newline \n), we split on \n and keep only the first part. We then lowercase both the final answer and the label, and then check the correctness with the following rules:
- If the final answer and label are identical, we consider the final answer correct.
- If they only differ by quotes or brackets around the answer, we consider it to be correct.
- For multiple-choice questions, the label is in the format (<LETTER>). If the model only gives the letter (like A instead of (A)), we still count it as correct.
### A.2 Prompts
#### A.2.1 Start Prompts
Baseline
Question: {question}. Provide your final answer after the āThe final answer is: ā.
Chain-of-Thought (CoT)
Question: {question}. Think step by step, and provide your final answer after the āThe final answer is: ā.
#### A.2.2 Iterative (Self-Correction) Prompts
Baseline
Question: {question}. Review your previous responses, and provide your final answer after the āThe final answer is: ā.
Chain-of-Thought (CoT)
Question: {question}. Review your previous responses, think step by step and provide your final answer after the āThe final answer is: ā.
## Appendix B Evaluation Protocol
Given the differences between task formats, we adopt distinct evaluation strategies tailored to the characteristics of each settingāopen-ended generation and multiple-choice questions. For multiple-choice questions, we use Soft Match (SM) Suzgun and Kalai (2024); Suzgun et al. (2025), a lenient metric that considers an answer correct if the ground-truth label appears in the modelās output, disregarding minor formatting variations such as punctuation or whitespace.
For open-ended generation, we employ the LLM-as-a-Judge Zheng et al. (2023) approach to assess the correctness of the generated answers relative to the ground-truth responses for each dataset. Specifically, we use the fine-tuned model https://github.com/yizhongw/truthfulqa_reeval introduced by A2i for evaluating generations on tinyTruthfulQA. For DisambiguationQA, we prompt a large model, GPT-4o, by providing the question, the model-generated answer, and the reference answer, asking it to determine whether the generated answer is correct. The exact prompt used for DisambiguationQA evaluation is shown below:
Evaluation Prompt
You are an expert in answer correctness evaluation. Given a question, its reference answer, and a generated answer, please evaluate the correctness of the generated answer based on the question and the reference answer. Here are the question, reference answer, and generated answer: - Question: {question} - Reference Answer: {gold answer} - Generated Answer: {generated answer} Please assess the correctness of the generated answer by considering the question and comparing it against the reference answer. Return yes if the generated answer is completely correct, otherwise, return ānoā. The final answer must only be āyesā or ānoā, corresponding to the correctness of the generated answer.
## Appendix C Additional Experiments and Results
<details>
<summary>x20.png Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison (Generation vs. Multiple-choice)
### Overview
The image is a vertical bar chart comparing the accuracy of seven different large language models on two distinct task types: "Generation" and "Multiple-choice." The chart visually demonstrates a consistent performance gap between the two evaluation methods across all models shown.
### Components/Axes
* **Chart Type:** Grouped bar chart.
* **X-axis (Horizontal):** Lists seven model names. From left to right:
1. DeepSeek-V3
2. Llama-3.1-405B
3. Qwen2-110B
4. Qwen2-72B
5. SmolLM2-1.7B
6. Llama-3.1-70B
7. Qwen2-7B-Plain
* **Y-axis (Vertical):** Labeled "Accuracy (%)". The scale runs from 0 to 0.5 (representing 0% to 50%), with major tick marks at 0.1 intervals (0.1, 0.2, 0.3, 0.4, 0.5).
* **Legend:** Located in the top-right corner of the chart area.
* A blue square corresponds to the label "Generation".
* An orange square corresponds to the label "Multiple-choice".
* **Data Series:** Two series of bars are plotted for each model on the x-axis.
* **Blue Bars (Left):** Represent "Generation" accuracy.
* **Orange Bars (Right):** Represent "Multiple-choice" accuracy.
### Detailed Analysis
For each model, the "Multiple-choice" (orange) bar is significantly taller than the "Generation" (blue) bar. Approximate accuracy values, estimated from the bar heights relative to the y-axis, are as follows:
| Model Name | Generation Accuracy (Blue, Approx.) | Multiple-choice Accuracy (Orange, Approx.) |
| :--- | :--- | :--- |
| DeepSeek-V3 | ~0.28 (28%) | ~0.38 (38%) |
| Llama-3.1-405B | ~0.30 (30%) | ~0.50 (50%) |
| Qwen2-110B | ~0.48 (48%) | ~0.50 (50%) |
| Qwen2-72B | ~0.32 (32%) | ~0.42 (42%) |
| SmolLM2-1.7B | ~0.05 (5%) | ~0.35 (35%) |
| Llama-3.1-70B | ~0.08 (8%) | ~0.35 (35%) |
| Qwen2-7B-Plain | ~0.42 (42%) | ~0.50 (50%) |
**Trend Verification:**
* **Generation Series (Blue):** The trend is highly variable. It starts moderate (~28%), rises to a peak with Qwen2-110B (~48%), then drops sharply for SmolLM2-1.7B and Llama-3.1-70B (both below 10%), before rising again for Qwen2-7B-Plain (~42%).
* **Multiple-choice Series (Orange):** The trend is more stable and consistently high. All models achieve between ~35% and ~50% accuracy. The lowest values are for SmolLM2-1.7B and Llama-3.1-70B (~35%), while three models (Llama-3.1-405B, Qwen2-110B, Qwen2-7B-Plain) reach or approach the 50% mark.
### Key Observations
1. **Universal Performance Gap:** Every single model performs substantially better on the "Multiple-choice" task than on the "Generation" task. The gap is often 20 percentage points or more.
2. **Outlier in Generation Performance:** The "SmolLM2-1.7B" model shows an extremely low "Generation" accuracy (~5%), which is a dramatic outlier compared to its "Multiple-choice" performance (~35%) and the generation scores of other models.
3. **Top Performers:** "Qwen2-110B" and "Qwen2-7B-Plain" show the strongest combined performance, with high scores in both categories, though multiple-choice remains superior.
4. **Scale vs. Performance:** There is no clear, linear correlation between model size (as implied by the names, e.g., 405B vs. 1.7B) and accuracy in this chart. For example, the largest model (Llama-3.1-405B) does not have the highest generation score, and a smaller model (Qwen2-7B-Plain) outperforms several larger ones in generation.
### Interpretation
This chart provides a clear, data-driven insight into a fundamental challenge in evaluating large language models. The consistent and large disparity between "Multiple-choice" and "Generation" accuracy suggests that **the format of the evaluation task dramatically influences the measured performance of a model.**
* **What the data suggests:** Models are significantly more proficient at selecting a correct answer from a predefined set (multiple-choice) than they are at generating a correct answer from scratch (generation). This implies that the cognitive or computational load of open-ended generation is much higher, or that models are better optimized for recognition-based tasks than creation-based ones.
* **How elements relate:** The side-by-side bars for each model force a direct comparison, highlighting that the task type is a more dominant factor in the accuracy score than the specific model architecture or size in this particular evaluation.
* **Notable implications:** This has critical implications for AI benchmarking. If a model's capability is primarily reported using multiple-choice benchmarks, it may present an overly optimistic view of its ability to perform real-world tasks that require generating novel text, code, or solutions. The outlier performance of SmolLM2-1.7B in generation could indicate a specific weakness in that model's training or architecture for generative tasks, despite having reasonable recognition abilities. The chart argues for the necessity of using diverse evaluation methodologies to build a complete picture of a model's capabilities.
</details>
(a) Baseline
<details>
<summary>x21.png Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison (Generation vs. Multiple-choice)
### Overview
The image is a vertical bar chart comparing the accuracy (in percentage) of various language models on two distinct task types: "Generation" and "Multiple-choice". The chart uses a dual-bar format for each model, with blue bars representing Generation accuracy and orange bars representing Multiple-choice accuracy.
### Components/Axes
* **Chart Title:** "Accuracy (%)" (positioned at the top-left of the chart area).
* **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 0 to 60, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50, 60).
* **X-Axis:** Lists six distinct model names or categories. From left to right:
1. `Qwen2-0.5B`
2. `Llama-3-8B`
3. `Qwen2-14B`
4. `Qwen2-72B`
5. `Small-1.7B-70B` (Note: This label appears to be a composite or specific variant name).
6. `Qwen2-7B-Chat`
* **Legend:** Positioned at the bottom center of the chart. It contains two entries:
* A blue square labeled "Generation".
* An orange square labeled "Multiple-choice".
### Detailed Analysis
The following data points are approximate values extracted by visually aligning the top of each bar with the y-axis scale.
| Model Name | Generation Accuracy (Blue Bar) | Multiple-choice Accuracy (Orange Bar) |
| :--- | :--- | :--- |
| **Qwen2-0.5B** | ~20% | ~35% |
| **Llama-3-8B** | ~35% | ~55% |
| **Qwen2-14B** | ~45% | ~55% |
| **Qwen2-72B** | ~30% | ~40% |
| **Small-1.7B-70B** | ~5% | ~40% |
| **Qwen2-7B-Chat** | ~50% | ~55% |
**Visual Trend Verification:**
* **Generation (Blue Bars):** The trend is non-linear. Accuracy starts low (~20%), rises to a peak at `Qwen2-14B` (~45%), then dips significantly for `Qwen2-72B` (~30%) and plummets for `Small-1.7B-70B` (~5%), before rising sharply again to its highest point at `Qwen2-7B-Chat` (~50%).
* **Multiple-choice (Orange Bars):** The trend is more consistently high. It starts at ~35%, jumps to ~55% for `Llama-3-8B` and `Qwen2-14B`, dips to ~40% for `Qwen2-72B` and `Small-1.7B-70B`, and returns to ~55% for `Qwen2-7B-Chat`.
### Key Observations
1. **Consistent Performance Gap:** For every single model listed, the accuracy on Multiple-choice tasks (orange) is higher than on Generation tasks (blue). The gap is smallest for `Qwen2-7B-Chat` (~5 percentage points) and largest for `Small-1.7B-70B` (~35 percentage points).
2. **Highest and Lowest Performers:**
* The highest accuracy for **Generation** is achieved by `Qwen2-7B-Chat` (~50%).
* The highest accuracy for **Multiple-choice** is shared by `Llama-3-8B`, `Qwen2-14B`, and `Qwen2-7B-Chat` (all ~55%).
* The lowest accuracy for **Generation** is by `Small-1.7B-70B` (~5%).
* The lowest accuracy for **Multiple-choice** is by `Qwen2-0.5B` (~35%).
3. **Notable Anomaly:** The model labeled `Small-1.7B-70B` shows a dramatic disparity. It has the worst performance on Generation tasks by a large margin but performs moderately well on Multiple-choice tasks (~40%), comparable to the much larger `Qwen2-72B` model on the same task.
### Interpretation
This chart demonstrates a clear and consistent trend: the evaluated language models find "Multiple-choice" tasks significantly easier than "Generation" tasks. This is expected, as multiple-choice questions provide a constrained answer space and test recognition/recall, while generation requires open-ended synthesis and production of novel text.
The data suggests that model size (as implied by names like 0.5B, 8B, 14B, 72B) is not the sole determinant of performance, especially on generation tasks. For instance, `Qwen2-7B-Chat` outperforms the much larger `Qwen2-72B` on generation. This highlights the importance of model architecture, training data, and fine-tuning (as suggested by the "-Chat" suffix) for specific task types.
The outlier `Small-1.7B-70B` is particularly interesting. Its name is ambiguous, but its performance profileācatastrophic on generation, decent on multiple-choiceācould indicate a model heavily optimized or specialized for discriminative tasks, or perhaps a model that has undergone a form of distillation or pruning that severely impacted its generative capabilities while preserving its ability to select correct answers from a list. This chart effectively visualizes the fundamental difference in difficulty between these two core NLP task paradigms across a range of model architectures.
</details>
(b) CoT
<details>
<summary>x22.png Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison (Generation vs. Multiple-choice)
### Overview
The image is a vertical bar chart comparing the accuracy of six different language models on two distinct task types: "Generation" and "Multiple-choice". The chart displays performance on a scale from 0 to 0.5 (50% accuracy). The models are listed on the x-axis, and their corresponding accuracy scores are represented by paired bars.
### Components/Axes
* **Chart Title:** Partially visible at the top, appears to be "Accuracy (0-0.5)".
* **Y-Axis:**
* **Label:** "Accuracy (0-0.5)"
* **Scale:** Linear, ranging from 0.0 to 0.5 with major tick marks at 0.1 intervals (0.0, 0.1, 0.2, 0.3, 0.4, 0.5).
* **X-Axis:**
* **Label:** None explicitly stated, but contains model names.
* **Categories (from left to right):**
1. Qwen2.5-0.5B-Instruct
2. Llama-3-8B
3. Qwen2.5-14B
4. Qwen2.5-7B
5. SmallThinker-3B-1.7B
6. Qwen2.5-7B-Plain
* **Legend:**
* **Position:** Bottom center of the chart.
* **Items:**
* **Blue Square:** "Generation"
* **Orange Square:** "Multiple-choice"
### Detailed Analysis
The chart presents paired bars for each model. The blue bar represents "Generation" accuracy, and the orange bar represents "Multiple-choice" accuracy. All values are approximate, estimated from the visual height of the bars relative to the y-axis.
| Model Name | Generation Accuracy (Blue Bar, Approx.) | Multiple-choice Accuracy (Orange Bar, Approx.) |
| :--- | :--- | :--- |
| Qwen2.5-0.5B-Instruct | ~0.22 | ~0.45 |
| Llama-3-8B | ~0.39 | ~0.47 |
| Qwen2.5-14B | ~0.40 | ~0.52 |
| Qwen2.5-7B | ~0.33 | ~0.48 |
| SmallThinker-3B-1.7B | ~0.05 | ~0.24 |
| Qwen2.5-7B-Plain | ~0.45 | ~0.52 |
**Trend Verification:**
* **Generation (Blue Bars):** The trend is generally upward from left to right, with a significant dip for the "SmallThinker" model. The highest value is for "Qwen2.5-7B-Plain" (~0.45), and the lowest is for "SmallThinker-3B-1.7B" (~0.05).
* **Multiple-choice (Orange Bars):** The trend is more stable and consistently higher than the Generation scores. Values range from ~0.24 (SmallThinker) to ~0.52 (Qwen2.5-14B and Qwen2.5-7B-Plain).
### Key Observations
1. **Consistent Performance Gap:** For every model shown, the accuracy on "Multiple-choice" tasks is higher than on "Generation" tasks. The gap is most pronounced for the "Qwen2.5-0.5B-Instruct" and "SmallThinker-3B-1.7B" models.
2. **Model Performance Hierarchy:** The "Qwen2.5-14B" and "Qwen2.5-7B-Plain" models achieve the highest scores in both categories, with near-identical performance (~0.52) on the multiple-choice task.
3. **Significant Outlier:** The "SmallThinker-3B-1.7B" model is a clear outlier, performing substantially worse than all other models on both task types, especially on the generation task where its accuracy is near zero.
4. **Converging Performance:** The performance gap between the two task types narrows for the higher-performing models. For "Qwen2.5-7B-Plain", the scores are very close (~0.45 vs. ~0.52).
### Interpretation
This chart demonstrates a clear and consistent trend: the evaluated language models find "Multiple-choice" tasks significantly easier than open-ended "Generation" tasks. This suggests that constrained, recognition-based tasks (selecting from options) are less challenging for current model architectures than generative tasks requiring the creation of novel, coherent text.
The data implies that model scale and training (as seen in the progression from 0.5B to 14B parameters in the Qwen series) generally improve performance on both task types. However, the "SmallThinker" model's poor performance indicates that not all small models are equal; its specific architecture or training may be ill-suited for these benchmarks.
The near-parity in performance between "Qwen2.5-14B" and "Qwen2.5-7B-Plain" on the multiple-choice task is notable. It suggests that for this specific task type, a well-tuned 7B model can match a larger 14B model, highlighting the importance of model configuration and fine-tuning over raw parameter count alone. The chart ultimately serves as a comparative benchmark, illustrating the current state of model capabilities across different cognitive tasks.
</details>
(c) SC
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison (Generation vs. Multiple-choice)
### Overview
The image is a vertical bar chart comparing the accuracy percentages of various large language models on two distinct task types: "Generation" and "Multiple-choice." The chart presents a side-by-side comparison for each model, highlighting performance differences between the two evaluation paradigms.
### Components/Axes
* **Chart Title:** `Accuracy (%)` (Positioned at the top-left of the chart area).
* **Y-Axis:**
* **Label:** `Accuracy (%)` (Vertical text along the left axis).
* **Scale:** Linear scale from `0.0` to `1.0`, with major tick marks at `0.0`, `0.2`, `0.4`, `0.6`, `0.8`, and `1.0`.
* **X-Axis:**
* **Label:** None explicitly stated. The axis contains categorical labels for different models/configurations.
* **Categories (from left to right):**
1. `Qwen2.5-72B (Chat)`
2. `Llama-3.1-405B`
3. `Qwen2-72B-14B`
4. `Qwen2-7B-3B`
5. `Small-1.7B-1.7B`
6. `Qwen2-7B-Plain`
* **Legend:**
* **Position:** Centered at the bottom of the chart.
* **Items:**
* **Blue Square:** `Generation`
* **Orange Square:** `Multiple-choice`
### Detailed Analysis
The chart displays paired bars for each of the six model categories. The blue bar (Generation) is consistently positioned to the left of the orange bar (Multiple-choice) for each pair.
**Trend Verification & Data Points (Approximate Values):**
1. **Qwen2.5-72B (Chat):**
* **Generation (Blue):** The bar reaches the `1.0` line. **Trend:** Maximum value.
* **Multiple-choice (Orange):** The bar is slightly above the `0.8` line. **Approximate Value:** ~0.82.
2. **Llama-3.1-405B:**
* **Generation (Blue):** The bar is at the `1.0` line. **Trend:** Maximum value.
* **Multiple-choice (Orange):** The bar is slightly below the `0.8` line. **Approximate Value:** ~0.78.
3. **Qwen2-72B-14B:**
* **Generation (Blue):** The bar is at the `1.0` line. **Trend:** Maximum value.
* **Multiple-choice (Orange):** The bar is slightly above the `0.8` line. **Approximate Value:** ~0.82.
4. **Qwen2-7B-3B:**
* **Generation (Blue):** The bar is at the `1.0` line. **Trend:** Maximum value.
* **Multiple-choice (Orange):** The bar is slightly below the `0.8` line. **Approximate Value:** ~0.78.
5. **Small-1.7B-1.7B:**
* **Generation (Blue):** The bar is slightly below the `0.6` line. **Approximate Value:** ~0.58.
* **Multiple-choice (Orange):** The bar is slightly above the `0.8` line. **Approximate Value:** ~0.82.
6. **Qwen2-7B-Plain:**
* **Generation (Blue):** The bar is at the `1.0` line. **Trend:** Maximum value.
* **Multiple-choice (Orange):** The bar is also at the `1.0` line. **Trend:** Maximum value.
### Key Observations
1. **Performance Gap:** For the first four models listed, there is a consistent and notable performance gap. The "Generation" task accuracy is at or near the maximum (1.0 or 100%), while the "Multiple-choice" task accuracy is lower, hovering around 0.78-0.82 (78-82%).
2. **Significant Outlier:** The model labeled `Small-1.7B-1.7B` shows a complete reversal of the general trend. Its "Generation" accuracy (~0.58) is significantly lower than its "Multiple-choice" accuracy (~0.82). This is the only instance where the Multiple-choice bar is taller than the Generation bar.
3. **Perfect Parity:** The final model, `Qwen2-7B-Plain`, achieves perfect accuracy (1.0) on both task types, showing no performance gap.
4. **Scale Consistency:** The y-axis scale from 0.0 to 1.0 suggests these are normalized accuracy scores, likely representing 0% to 100%.
### Interpretation
The data suggests a fundamental difference in how these models perform on generative versus discriminative (multiple-choice) tasks. For most of the larger or chat-tuned models shown, generating correct text appears to be an easier task than selecting the correct option from a predefined set. This could indicate that the models' generative capabilities are more robust or better aligned with the evaluation metric for the "Generation" task.
The stark outlier, `Small-1.7B-1.7B`, implies that model size, architecture, or training regimen dramatically affects this balance. Its poor generative performance relative to its multiple-choice performance might point to limitations in its ability to produce coherent, correct text from scratch, even if it can recognize correct answers.
The perfect scores for `Qwen2-7B-Plain` are notable and could indicate either a very simple evaluation task for that specific model configuration or a potential ceiling effect in the benchmark used. The chart effectively communicates that model performance is not monolithic; it varies significantly based on the type of cognitive task required.
</details>
(d) Baseline
<details>
<summary>x24.png Details</summary>

### Visual Description
## Bar Chart: Model Performance Comparison (Generation vs. Multiple-choice)
### Overview
The image displays a grouped bar chart comparing the performance of six different language models on two distinct task types: "Generation" and "Multiple-choice." The performance is measured as a percentage, likely representing accuracy or a similar success metric. The chart uses a dark background with blue and orange bars for clear contrast.
### Components/Axes
* **Chart Type:** Grouped bar chart.
* **Title:** Not explicitly stated in the image. The chart's purpose is inferred from its content.
* **Y-Axis:**
* **Label:** "Percentage (%)"
* **Scale:** Linear scale from 0.0 to 1.0 (representing 0% to 100%).
* **Markers:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **X-Axis:**
* **Label:** Not explicitly labeled, but contains categorical model names.
* **Categories (from left to right):**
1. `Qwen2.5-72B`
2. `Llama-3.1-405B`
3. `Qwen2-72B`
4. `Qwen2-7B`
5. `Small-1.7B`
6. `Qwen2-5-72B`
* **Legend:**
* **Position:** Bottom center of the chart area.
* **Items:**
* **Blue Square:** "Generation"
* **Orange Square:** "Multiple-choice"
### Detailed Analysis
The chart presents performance data for six models across two tasks. Below is an extraction of the approximate values for each bar, based on visual alignment with the y-axis grid lines.
| Model Name | Generation (Blue Bar) | Multiple-choice (Orange Bar) |
| :--- | :--- | :--- |
| **Qwen2.5-72B** | ~0.95 (95%) | ~0.60 (60%) |
| **Llama-3.1-405B** | ~0.85 (85%) | ~0.80 (80%) |
| **Qwen2-72B** | ~0.85 (85%) | ~0.80 (80%) |
| **Qwen2-7B** | ~0.95 (95%) | ~0.80 (80%) |
| **Small-1.7B** | ~0.75 (75%) | ~0.20 (20%) |
| **Qwen2-5-72B** | ~0.95 (95%) | ~0.85 (85%) |
**Trend Verification per Data Series:**
* **Generation (Blue Bars):** The performance is consistently high across all models, with most scoring between 85% and 95%. The `Small-1.7B` model is the lowest performer in this category at approximately 75%. The trend is one of generally strong performance with a single notable dip.
* **Multiple-choice (Orange Bars):** Performance varies significantly more. It ranges from a low of ~20% (`Small-1.7B`) to a high of ~85% (`Qwen2-5-72B`). There is no uniform trend; performance is model-dependent.
### Key Observations
1. **Performance Gap:** A significant performance gap exists between the two tasks for the `Qwen2.5-72B` model (95% vs. 60%) and the `Small-1.7B` model (75% vs. 20%).
2. **Model Consistency:** The `Llama-3.1-405B` and `Qwen2-72B` models show the most balanced performance, with less than a 5% difference between their Generation and Multiple-choice scores.
3. **Outlier:** The `Small-1.7B` model is a clear outlier, showing the lowest performance in both categories, with a particularly drastic drop in Multiple-choice capability.
4. **Top Performer:** The `Qwen2-5-72B` model appears to be the top overall performer, achieving the highest score in Multiple-choice (~85%) while maintaining a top-tier Generation score (~95%).
5. **Task Difficulty:** For most models shown, the "Generation" task appears to be easier (yielding higher scores) than the "Multiple-choice" task, with the exception of the balanced `Llama-3.1-405B` and `Qwen2-72B`.
### Interpretation
This chart suggests that the evaluated language models possess significantly different strengths. The "Generation" task, which likely involves open-ended text creation, appears to be a more consistent strength across models of varying sizes (from 1.7B to 72B+ parameters). In contrast, "Multiple-choice" performance, which may require precise knowledge retrieval or reasoning within constrained options, is more volatile and model-specific.
The data implies that model size alone (e.g., 72B parameters) does not guarantee superior performance on all task types, as seen with `Qwen2.5-72B`'s lower Multiple-choice score. Conversely, the `Small-1.7B` model's poor performance, especially on Multiple-choice, highlights potential limitations in smaller models for tasks requiring precise factual recall or complex discrimination.
The most notable finding is the existence of models like `Qwen2-5-72B` and `Qwen2-7B` that achieve high scores in both categories, suggesting a more robust and versatile architecture or training regimen. This comparison is crucial for selecting the right model for a specific application: a model excelling in Generation may be preferred for creative writing assistants, while one with balanced or superior Multiple-choice performance might be better suited for QA systems or exam engines.
</details>
(e) CoT
<details>
<summary>x25.png Details</summary>

### Visual Description
## Grouped Bar Chart: Model Accuracy Comparison (Generation vs. Multiple-choice)
### Overview
The image is a grouped bar chart comparing the accuracy of seven different large language models on two distinct task types: "Generation" and "Multiple-choice". The chart uses blue bars for Generation tasks and orange bars for Multiple-choice tasks. The overall visual trend shows that most models perform better on Generation tasks than on Multiple-choice tasks, with one notable exception.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:**
* **Label:** `Accuracy (%)`
* **Scale:** Linear, ranging from 0.0 to 1.0 (representing 0% to 100%).
* **Major Ticks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **X-Axis:**
* **Label:** Model names.
* **Categories (from left to right):**
1. `Qwen2.5-72B-Instruct`
2. `Llama-3.1-405B`
3. `Qwen2-72B`
4. `Qwen2.5-32B`
5. `Qwen2.5-7B`
6. `Small-1.7B`
7. `Qwen2-7B-Plain`
* **Legend:**
* **Position:** Centered at the bottom of the chart.
* **Items:**
* Blue Square: `Generation`
* Orange Square: `Multiple-choice`
### Detailed Analysis
Below is the extracted data for each model, with approximate accuracy values read from the chart. The visual trend for each model is noted first.
1. **Qwen2.5-72B-Instruct**
* **Trend:** Generation accuracy is significantly higher than Multiple-choice.
* **Generation (Blue):** ~0.95 (95%)
* **Multiple-choice (Orange):** ~0.80 (80%)
2. **Llama-3.1-405B**
* **Trend:** Generation and Multiple-choice accuracies are very close, with Generation slightly higher.
* **Generation (Blue):** ~0.82 (82%)
* **Multiple-choice (Orange):** ~0.80 (80%)
3. **Qwen2-72B**
* **Trend:** Generation accuracy is higher than Multiple-choice.
* **Generation (Blue):** ~0.88 (88%)
* **Multiple-choice (Orange):** ~0.80 (80%)
4. **Qwen2.5-32B**
* **Trend:** Generation accuracy is notably higher than Multiple-choice.
* **Generation (Blue):** ~0.92 (92%)
* **Multiple-choice (Orange):** ~0.80 (80%)
5. **Qwen2.5-7B**
* **Trend:** Generation accuracy is substantially higher than Multiple-choice.
* **Generation (Blue):** ~0.50 (50%)
* **Multiple-choice (Orange):** ~0.18 (18%)
6. **Small-1.7B**
* **Trend:** Generation accuracy is higher than Multiple-choice.
* **Generation (Blue):** ~0.18 (18%)
* **Multiple-choice (Orange):** ~0.08 (8%)
7. **Qwen2-7B-Plain**
* **Trend:** **This is the only model where Multiple-choice accuracy is higher than Generation.**
* **Generation (Blue):** ~0.78 (78%)
* **Multiple-choice (Orange):** ~0.88 (88%)
### Key Observations
* **Performance Hierarchy:** The `Qwen2.5-72B-Instruct` model achieves the highest Generation accuracy (~95%). The `Qwen2-7B-Plain` model achieves the highest Multiple-choice accuracy (~88%).
* **Consistent Multiple-choice Baseline:** Five of the seven models (the first four and the last one) cluster around an 80% accuracy for Multiple-choice tasks, suggesting a common performance ceiling or benchmark for this task type among these models.
* **Significant Performance Drop:** There is a dramatic drop in accuracy for both task types for the `Qwen2.5-7B` and `Small-1.7B` models, indicating a strong correlation between model size/capability and performance on these benchmarks.
* **Notable Anomaly:** `Qwen2-7B-Plain` is the sole outlier where the Multiple-choice score (~88%) exceeds the Generation score (~78%). This contrasts with the pattern seen in all other models.
### Interpretation
This chart provides a comparative snapshot of model capabilities across two fundamental NLP task paradigms: open-ended generation and constrained multiple-choice selection.
* **Task Difficulty Implication:** The general trend of higher Generation scores suggests that, for these specific models and benchmarks, the evaluated Generation tasks may be less challenging or better aligned with the models' pre-training than the Multiple-choice tasks. The consistent ~80% Multiple-choice score for larger models might indicate a specific type of reasoning or knowledge retrieval that is equally challenging for them.
* **Model Specialization:** The anomaly of `Qwen2-7B-Plain` performing better on Multiple-choice could imply a difference in its training data, fine-tuning procedure, or architecture that favors discriminative tasks over generative ones. The "-Plain" suffix might denote a base model without instruction tuning, which could explain this reversal.
* **Scale Matters:** The steep decline in performance for the 7B and 1.7B models underscores the importance of model scale for achieving high accuracy on these benchmarks. The performance gap between `Qwen2.5-7B` and `Qwen2.5-32B` is particularly stark.
* **Benchmark Insight:** The chart likely represents results from a specific evaluation suite. The data suggests that "Generation" and "Multiple-choice" are not monolithic categories; their relative difficulty is model-dependent. A model's strength in one does not perfectly predict its strength in the other, as evidenced by the `Qwen2-7B-Plain` case.
</details>
(f) SC
Figure 5: Cumulative accuracy (after final self-correction iteration) using different models on (top) DisambiguationQA and (bottom) tinyTruthfulQA. The results indicate that models perform completely differently on self-correction of generation and multiple-choice questions, depending on the dataset.
### C.1 Results on Correct and Incorrect Flips
Figures 6-11 show the correct and incorrect flips on different datasets and models.
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Chart: SmolLM2-1.7B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "SmolLM2-1.7B" that plots the "Proportion of Flips" against "Iterations" (from 1 to 5). It compares four different metrics or conditions, represented by distinct lines. The chart appears to track changes in model behavior or output characteristics across sequential iterations.
### Components/Axes
* **Title:** SmolLM2-1.7B (Top center)
* **Y-Axis:**
* **Label:** Proportion of Flips (Left side, vertical)
* **Scale:** Linear, ranging from 0.00 to 0.10, with major tick marks at 0.00, 0.02, 0.04, 0.06, 0.08, and 0.10.
* **X-Axis:**
* **Label:** Iterations (Bottom center)
* **Scale:** Discrete, with values 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four data series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Black dashed line with circular markers.
4. **Incorrect Flip:** Black dashed line with square markers.
### Detailed Analysis
The following data points are approximate values extracted from the chart. The visual trend for each series is described first as a logic check.
1. **Generation (Blue Solid Line):**
* **Trend:** Fluctuates at a low level, dipping to near zero at iteration 3.
* **Data Points:**
* Iteration 1: ~0.01
* Iteration 2: ~0.02
* Iteration 3: ~0.00
* Iteration 4: ~0.01
* Iteration 5: ~0.02
2. **Multiple-Choice (Orange Solid Line):**
* **Trend:** Starts as the highest value, drops sharply, then shows a partial recovery with fluctuation.
* **Data Points:**
* Iteration 1: ~0.08
* Iteration 2: ~0.03
* Iteration 3: ~0.04
* Iteration 4: ~0.02
* Iteration 5: ~0.04
3. **Correct Flip (Black Dashed Line, Circle Markers):**
* **Trend:** Starts at zero, increases to a peak at iterations 3-4, then slightly decreases.
* **Data Points:**
* Iteration 1: 0.00
* Iteration 2: ~0.01
* Iteration 3: ~0.02
* Iteration 4: ~0.02
* Iteration 5: ~0.01
4. **Incorrect Flip (Black Dashed Line, Square Markers):**
* **Trend:** Starts at zero, remains low, and shows a very gradual increase with a peak at iteration 4.
* **Data Points:**
* Iteration 1: 0.00
* Iteration 2: 0.00
* Iteration 3: ~0.01
* Iteration 4: ~0.02
* Iteration 5: ~0.01
### Key Observations
* The **Multiple-Choice** proportion is dominant at the start (Iteration 1) but decreases significantly.
* The **Generation** proportion remains consistently low (ā¤0.02) across all iterations.
* Both **Correct Flip** and **Incorrect Flip** proportions start at zero and show a general, though modest, upward trend over the first four iterations before a slight dip at iteration 5.
* At Iteration 3, the **Generation** proportion drops to its minimum (~0.00), while the **Correct Flip** proportion reaches its peak (~0.02).
* The values for **Correct Flip** and **Incorrect Flip** are very similar, especially at iterations 4 and 5, where they are nearly indistinguishable on the chart.
### Interpretation
The chart likely illustrates the performance or behavioral evolution of the "SmolLM2-1.7B" model over five iterative steps. The "Proportion of Flips" could refer to changes in model outputs, such as flipping a prediction or answer choice.
* The high initial **Multiple-Choice** value suggests the model's outputs or errors were heavily concentrated in a multiple-choice context at the start. The subsequent drop indicates a shift away from this pattern.
* The consistently low **Generation** proportion suggests that "flips" related to open-ended generation tasks are rare throughout the process.
* The rise of **Correct Flip** and **Incorrect Flip** from zero implies that the iterative process introduces or increases the phenomenon of "flipping." The fact that correct and incorrect flips track closely together suggests the model's tendency to flip is not strongly correlated with the correctness of the flip at this scale.
* The overall low proportions (all below 0.10) indicate that "flipping" is a relatively infrequent event across all measured conditions. The most significant change is the redistribution from a high initial Multiple-Choice proportion to a more balanced, albeit low-level, distribution among the other metrics by iteration 5.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-3B Performance Across Iterations
### Overview
This is a line chart titled "Qwen2.5-3B" that plots the "Proportion of Correct" responses against the number of "Iterations" (from 1 to 5). It compares the performance of four different methods or conditions: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart shows how the accuracy of each method changes over five sequential iterations.
### Components/Axes
* **Chart Title:** Qwen2.5-3B (located at the top center).
* **X-Axis:** Labeled "Iterations". It has discrete markers at values 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Correct". The scale ranges from 0.02 to 0.14, with major tick marks at intervals of 0.02 (0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14).
* **Legend:** Located in the top-right corner of the plot area. It defines four data series:
* **Generation:** Solid blue line with circular markers.
* **Multiple-Choice:** Solid orange line with circular markers.
* **Correct Flip:** Dashed blue line with circular markers.
* **Incorrect Flip:** Dashed orange line with circular markers.
### Detailed Analysis
The following data points are approximate values extracted from the chart. The trend for each series is described first, followed by the estimated values per iteration.
1. **Generation (Solid Blue Line):**
* **Trend:** Starts high, experiences a sharp drop at iteration 2, then shows a partial recovery and fluctuation.
* **Data Points (Approximate):**
* Iteration 1: 0.10
* Iteration 2: 0.03
* Iteration 3: 0.07
* Iteration 4: 0.07
* Iteration 5: 0.06
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts moderately high, dips at iteration 2, rises to a peak at iteration 4, then declines sharply.
* **Data Points (Approximate):**
* Iteration 1: 0.09
* Iteration 2: 0.06
* Iteration 3: 0.06
* Iteration 4: 0.08
* Iteration 5: 0.02
3. **Correct Flip (Dashed Blue Line):**
* **Trend:** Starts at a similar level to Generation, drops significantly at iteration 2, then shows a steady upward trend.
* **Data Points (Approximate):**
* Iteration 1: 0.09
* Iteration 2: 0.03
* Iteration 3: 0.05
* Iteration 4: 0.05
* Iteration 5: 0.06
4. **Incorrect Flip (Dashed Orange Line):**
* **Trend:** Starts at the same point as Multiple-Choice and Correct Flip, drops at iteration 2, then fluctuates with a slight downward trend overall.
* **Data Points (Approximate):**
* Iteration 1: 0.09
* Iteration 2: 0.06
* Iteration 3: 0.04
* Iteration 4: 0.05
* Iteration 5: 0.05
### Key Observations
* **Universal Dip at Iteration 2:** All four methods show a decrease in the proportion of correct answers at the second iteration compared to the first.
* **Convergence at Iteration 5:** By the final iteration, the performance of three methods (Generation, Correct Flip, Incorrect Flip) converges within a narrow band between 0.05 and 0.06. The Multiple-Choice method is a significant outlier, dropping to the lowest point on the chart (0.02).
* **Diverging Paths:** After the initial dip, the methods follow different trajectories. "Correct Flip" shows the most consistent improvement from iteration 2 onward. "Multiple-Choice" is the most volatile, with a notable peak at iteration 4 before its final drop.
* **Initial Similarity:** At iteration 1, three of the four methods (Multiple-Choice, Correct Flip, Incorrect Flip) start at nearly the same performance level (~0.09), while "Generation" starts slightly higher (~0.10).
### Interpretation
The chart suggests that the Qwen2.5-3B model's performance on a given task is highly sensitive to both the method used (Generation vs. Multiple-Choice vs. Flip strategies) and the iteration number. The universal dip at iteration 2 could indicate a common point of difficulty, a change in task parameters, or a phase where the model is "re-learning" or adjusting.
The "Correct Flip" strategy demonstrates the most robust recovery and improvement after the initial setback, suggesting it may be a more stable method for this specific task over multiple iterations. In contrast, the "Multiple-Choice" method, while peaking at iteration 4, ends as the worst performer, indicating it may be less reliable or more prone to degradation over repeated trials.
The convergence of three methods at iteration 5 implies that, given enough iterations, different approaches may lead to a similar, albeit modest, level of accuracy. The data highlights the importance of evaluating model performance across multiple iterations and methods, as a single snapshot (e.g., only at iteration 1 or 4) could give a misleading impression of overall capability. The low absolute values (all below 0.14) suggest the underlying task is challenging for this model.
</details>
(b) Qwen2.5-3B
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Llama-3.1-8B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "Llama-3.1-8B" that plots the "Proportion of Flips" against "Iterations" for four distinct data series. The chart tracks changes in flip proportions across five discrete iterations, comparing two primary methods (Generation and Multiple-Choice) and two flip outcomes (Correct and Incorrect).
### Components/Axes
* **Chart Title:** "Llama-3.1-8B" (centered at the top).
* **X-Axis:** Labeled "Iterations". It has five major tick marks labeled 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale ranges from 0.04 to 0.14, with major tick marks at 0.04, 0.06, 0.08, 0.10, 0.12, and 0.14.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Black dashed line with circular markers.
4. **Incorrect Flip:** Black dashed line with square markers.
### Detailed Analysis
The following data points are approximate values extracted from the chart.
**1. Generation (Blue Solid Line):**
* **Trend:** Shows an overall downward trend with a significant dip at iteration 4.
* **Data Points:**
* Iteration 1: ~0.11
* Iteration 2: ~0.08
* Iteration 3: ~0.10
* Iteration 4: ~0.05 (lowest point)
* Iteration 5: ~0.07
**2. Multiple-Choice (Orange Solid Line):**
* **Trend:** Exhibits high volatility, with two peaks (iterations 2 and 4) and a sharp decline at iteration 5.
* **Data Points:**
* Iteration 1: ~0.10
* Iteration 2: ~0.12 (first peak)
* Iteration 3: ~0.08
* Iteration 4: ~0.11 (second peak)
* Iteration 5: ~0.08
**3. Correct Flip (Black Dashed Line, Circle Markers):**
* **Trend:** Features a dramatic, isolated spike at iteration 3, which is the highest value on the entire chart. Otherwise, it follows a pattern similar to the Generation line.
* **Data Points:**
* Iteration 1: ~0.11
* Iteration 2: ~0.07
* Iteration 3: ~0.14 (global maximum)
* Iteration 4: ~0.06
* Iteration 5: ~0.07
**4. Incorrect Flip (Black Dashed Line, Square Markers):**
* **Trend:** Follows a pattern very closely aligned with the Generation line, suggesting a strong correlation.
* **Data Points:**
* Iteration 1: ~0.11
* Iteration 2: ~0.08
* Iteration 3: ~0.10
* Iteration 4: ~0.05
* Iteration 5: ~0.06
### Key Observations
1. **Iteration 3 Anomaly:** The "Correct Flip" series experiences a massive, singular spike to ~0.14 at iteration 3, while all other series either dip or remain stable at that point. This is the most salient feature of the chart.
2. **Correlation:** The "Incorrect Flip" and "Generation" lines track each other almost perfectly across all iterations, indicating their proportions are tightly linked.
3. **Divergence at Iteration 4:** At iteration 4, the "Multiple-Choice" proportion rises to a peak (~0.11) while the "Generation" and "Incorrect Flip" proportions hit their lowest points (~0.05). This suggests an inverse relationship between these methods at this stage.
4. **Convergence at Start and End:** At iteration 1, three of the four series (Generation, Correct Flip, Incorrect Flip) start at approximately the same value (~0.11). By iteration 5, all four series converge within a narrow band between ~0.06 and ~0.08.
### Interpretation
The chart appears to analyze the behavior of the Llama-3.1-8B model over a series of iterative steps, likely during a training, fine-tuning, or evaluation process involving "flips" (which could refer to changes in model predictions, outputs, or states).
* **The Iteration 3 Spike:** The dramatic spike in "Correct Flip" at iteration 3 is a critical event. It suggests a specific intervention, data batch, or learning phase at that step caused a significant increase in desirable (correct) changes, without a corresponding increase in incorrect changes. This could indicate a successful learning milestone or the effect of a targeted optimization.
* **Method Comparison:** The "Multiple-Choice" method shows more volatile performance than the "Generation" method. Its peaks do not align with the "Correct Flip" spike, implying that the conditions leading to high flip rates in multiple-choice tasks are different from those that produce correct flips overall.
* **System Dynamics:** The tight coupling between "Incorrect Flip" and "Generation" proportions suggests that the generation process inherently carries a proportional risk of incorrect outcomes. The system's behavior stabilizes by iteration 5, with all metrics settling into a lower, more consistent range, possibly indicating convergence or the end of an active learning phase. The data demonstrates that flip proportions are highly sensitive to the iteration step, with specific steps (like 3 and 4) acting as pivotal points for different metrics.
</details>
(c) Llama-3.1-8B
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Line Chart: Qwen2.5-14B
### Overview
The image is a line chart titled "Qwen2.5-14B". It plots the "Proportion of Flips" on the y-axis against "Iterations" on the x-axis for four different data series. The chart compares the performance or behavior of different methods or conditions over a sequence of five iterations.
### Components/Axes
* **Title:** "Qwen2.5-14B" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips"
* **Scale:** Linear, ranging from 0.00 to 0.10.
* **Major Ticks:** 0.00, 0.02, 0.04, 0.06, 0.08, 0.10.
* **X-Axis:**
* **Label:** "Iterations"
* **Scale:** Discrete, with integer values.
* **Major Ticks:** 1, 2, 3, 4, 5.
* **Legend:** Located in the top-left corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed orange line with square markers.
### Detailed Analysis
The following table reconstructs the approximate data points for each series across the five iterations. Values are estimated from the chart's gridlines.
| Iteration | Generation (Blue Solid) | Multiple-Choice (Orange Solid) | Correct Flip (Blue Dashed, Circle) | Incorrect Flip (Orange Dashed, Square) |
| :--- | :--- | :--- | :--- | :--- |
| **1** | ~0.095 | ~0.050 | ~0.075 | ~0.040 |
| **2** | ~0.015 | ~0.025 | ~0.065 | ~0.020 |
| **3** | ~0.025 | ~0.010 | ~0.030 | ~0.000 |
| **4** | ~0.030 | ~0.015 | ~0.030 | ~0.010 |
| **5** | ~0.025 | ~0.000 | ~0.015 | ~0.000 |
**Trend Verification per Series:**
* **Generation (Blue Solid):** Starts as the highest value at Iteration 1. It experiences a **sharp, steep decline** between Iterations 1 and 2, then fluctuates at a low level (between ~0.015 and ~0.030) for the remaining iterations.
* **Multiple-Choice (Orange Solid):** Starts at a moderate level. It shows a **general downward trend** across all iterations, decreasing from ~0.050 to 0.000, with a slight increase at Iteration 4.
* **Correct Flip (Blue Dashed):** Starts as the second-highest value. It follows a **steady, consistent downward trend** from Iteration 1 to 5, with a notable drop between Iterations 2 and 3.
* **Incorrect Flip (Orange Dashed):** Starts as the lowest value. It shows a **declining trend**, reaching near zero by Iteration 3 and remaining at or near zero for Iterations 4 and 5.
### Key Observations
1. **Initial Dominance:** At Iteration 1, the "Generation" method has the highest proportion of flips (~0.095), significantly above the others.
2. **Convergence at Low Values:** By Iteration 5, all four series have converged to very low proportions of flips (ā¤0.025), with "Multiple-Choice" and "Incorrect Flip" reaching 0.000.
3. **Divergent Paths:** The "Generation" series exhibits the most volatile behavior, with a dramatic drop followed by minor fluctuations. In contrast, the "Correct Flip" series shows the smoothest, most monotonic decline.
4. **Relationship between Dashed Lines:** The "Correct Flip" (blue dashed) proportion is consistently higher than the "Incorrect Flip" (orange dashed) proportion at every iteration, suggesting a higher rate of correct flips versus incorrect ones throughout the process.
5. **Crossover Point:** Between Iterations 2 and 3, the "Generation" line drops below the "Correct Flip" line and remains below it for the rest of the chart.
### Interpretation
This chart likely visualizes the results of an experiment or evaluation involving the "Qwen2.5-14B" model. The "Proportion of Flips" metric suggests a process where outputs or answers are being changed ("flipped") from an initial state over successive iterations.
* **What the data suggests:** The process becomes more stable over time, as evidenced by the decreasing proportion of flips across all methods. The initial high rate for "Generation" indicates it was the most unstable or change-prone method at the start.
* **How elements relate:** The dashed lines ("Correct Flip" and "Incorrect Flip") may represent sub-categories or specific types of flips occurring within the broader "Generation" and "Multiple-Choice" methods. The fact that the "Correct Flip" line is always above the "Incorrect Flip" line is a positive indicator, showing that when flips occur, they are more likely to be corrections.
* **Notable trends/anomalies:** The most striking trend is the rapid stabilization of the "Generation" method after the first iteration. The near-zero values for "Incorrect Flip" and "Multiple-Choice" by the end suggest the process has reached a point of minimal change or error. The chart effectively demonstrates that iterative refinement reduces the need for flips, with different methods exhibiting distinct stabilization profiles.
</details>
(d) Qwen2.5-14B
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B Flip Proportions
### Overview
This is a line chart displaying the "Proportion of Flips" over five iterations for a model named "DeepSeek-R1-Distill-Llama-8B". The chart compares two primary categories ("Generation" and "Multiple-Choice"), each subdivided into "Correct Flip" and "Incorrect Flip" events. The data suggests an analysis of model behavior changes or corrections during a sequential process.
### Components/Axes
* **Chart Title:** "DeepSeek-R1-Distill-Llama-8B" (centered at the top).
* **Y-Axis:** Labeled "Proportion of Flips". Scale ranges from 0.00 to 0.08, with major tick marks at 0.00, 0.02, 0.04, 0.06, and 0.08.
* **X-Axis:** Labeled "Iterations". Discrete integer values from 1 to 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four data series:
* **Generation - Correct Flip:** Solid blue line.
* **Generation - Incorrect Flip:** Dashed blue line.
* **Multiple-Choice - Correct Flip:** Solid orange line.
* **Multiple-Choice - Incorrect Flip:** Dashed orange line.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
1. **Generation - Correct Flip (Solid Blue Line):**
* **Trend:** Shows a general downward trend with a dip at iteration 2 and a slight recovery at iteration 3 before declining again.
* **Approximate Values:**
* Iteration 1: ~0.035
* Iteration 2: ~0.018
* Iteration 3: ~0.025
* Iteration 4: ~0.025
* Iteration 5: ~0.018
2. **Generation - Incorrect Flip (Dashed Blue Line):**
* **Trend:** Highly volatile. Starts mid-range, dips, rises to a peak at iteration 3, plummets to near zero at iteration 4, then spikes sharply to its highest point at iteration 5.
* **Approximate Values:**
* Iteration 1: ~0.025
* Iteration 2: ~0.022
* Iteration 3: ~0.035
* Iteration 4: ~0.000
* Iteration 5: ~0.050
3. **Multiple-Choice - Correct Flip (Solid Orange Line):**
* **Trend:** Relatively stable and high for the first four iterations, then drops significantly at the final iteration.
* **Approximate Values:**
* Iteration 1: ~0.060
* Iteration 2: ~0.060
* Iteration 3: ~0.050
* Iteration 4: ~0.060
* Iteration 5: ~0.025
4. **Multiple-Choice - Incorrect Flip (Dashed Orange Line):**
* **Trend:** Follows an almost identical path to its "Correct Flip" counterpart for the first four iterations, then diverges slightly at iteration 5, ending lower.
* **Approximate Values:**
* Iteration 1: ~0.060
* Iteration 2: ~0.060
* Iteration 3: ~0.050
* Iteration 4: ~0.060
* Iteration 5: ~0.022
### Key Observations
1. **Category Dominance:** The "Multiple-Choice" category (orange lines) consistently shows a higher proportion of flips than the "Generation" category (blue lines) for the first four iterations.
2. **Convergence at Iteration 5:** At the final iteration, the proportions for all series converge into a narrower range (between ~0.018 and ~0.050), with the "Generation - Incorrect Flip" series becoming the highest value.
3. **Anomalous Point:** The "Generation - Incorrect Flip" value at Iteration 4 is approximately 0.000, a dramatic outlier compared to its values at other iterations.
4. **Parallel Behavior:** The two "Multiple-Choice" lines (solid and dashed orange) track each other extremely closely until the final iteration, suggesting a strong correlation between correct and incorrect flip events in that context for most of the process.
### Interpretation
The chart likely visualizes the stability or correction behavior of the "DeepSeek-R1-Distill-Llama-8B" model during a multi-step evaluation or training process. "Flips" may refer to changes in the model's output or decision between iterations.
* **What the data suggests:** The model exhibits different flip dynamics depending on the task type. For "Multiple-Choice" tasks, flips (both correct and incorrect) are frequent and stable initially, then drop off. For "Generation" tasks, flips are less frequent overall but show more erratic behavior, culminating in a surge of incorrect flips at the end.
* **How elements relate:** The parallel trends in the Multiple-Choice lines imply that the factors driving correct and incorrect flips in that setting are similar until the final step. The divergence of the Generation lines, especially the spike in incorrect flips at iteration 5, indicates a potential breakdown or a specific challenge encountered in generative tasks at that stage.
* **Notable anomaly:** The near-zero value for "Generation - Incorrect Flip" at iteration 4 is a critical point. It could indicate a moment of perfect stability (no incorrect flips) or, more likely, a data collection anomaly or a specific phase in the process where incorrect flips were suppressed or not measured.
* **Overall implication:** The process does not lead to a monotonic decrease in flips. Instead, it reveals complex, task-dependent patterns. The final iteration shows a significant shift, with generative tasks becoming more prone to incorrect flips, while multiple-choice tasks become more stable. This could inform where to focus debugging or refinement efforts for the model.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x31.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flash Proportions Over Iterations
### Overview
The image is a line chart titled "Gemini-2.0-Flash" that plots the "Proportion of Flips" against "Iterations" (from 1 to 5). It compares two primary methods, "Generation" and "Multiple-Choice," each broken down into "Correct Flip" and "Incorrect Flip" sub-categories. The chart visualizes how the frequency of these flip events changes over five iterative steps.
### Components/Axes
* **Title:** "Gemini-2.0-Flash" (centered at the top).
* **Y-Axis:** Label is "Proportion of Flips." Scale ranges from 0.00 to 0.07, with major tick marks at 0.01 intervals.
* **X-Axis:** Label is "Iterations." Discrete integer markers from 1 to 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four data series:
* **Generation - Correct Flip:** Solid blue line with circular markers.
* **Generation - Incorrect Flip:** Dashed blue line with square markers.
* **Multiple-Choice - Correct Flip:** Solid orange line with circular markers.
* **Multiple-Choice - Incorrect Flip:** Dashed orange line with square markers.
### Detailed Analysis
The chart tracks four distinct data series across five iterations. Values are approximate, read from the chart's grid.
**1. Generation - Correct Flip (Solid Blue Line, Circles)**
* **Trend:** Rises to a peak at iteration 2, then declines steadily.
* **Data Points (Approx.):**
* Iteration 1: 0.033
* Iteration 2: 0.042 (Peak)
* Iteration 3: 0.025
* Iteration 4: 0.025
* Iteration 5: 0.025
**2. Generation - Incorrect Flip (Dashed Blue Line, Squares)**
* **Trend:** Shows a general downward trend, reaching zero by the final iteration.
* **Data Points (Approx.):**
* Iteration 1: 0.045
* Iteration 2: 0.033
* Iteration 3: 0.017
* Iteration 4: 0.017
* Iteration 5: 0.000
**3. Multiple-Choice - Correct Flip (Solid Orange Line, Circles)**
* **Trend:** Declines sharply from the start, reaching zero by iteration 5.
* **Data Points (Approx.):**
* Iteration 1: 0.045
* Iteration 2: 0.033
* Iteration 3: 0.008
* Iteration 4: 0.008
* Iteration 5: 0.000
**4. Multiple-Choice - Incorrect Flip (Dashed Orange Line, Squares)**
* **Trend:** Decreases initially, hits a low at iteration 4, then shows a sharp increase at iteration 5.
* **Data Points (Approx.):**
* Iteration 1: 0.045
* Iteration 2: 0.033
* Iteration 3: 0.025
* Iteration 4: 0.017 (Lowest Point)
* Iteration 5: 0.042 (Sharp Increase)
### Key Observations
1. **Convergence to Zero:** Both "Correct Flip" series (Generation and Multiple-Choice) and the "Generation - Incorrect Flip" series trend toward or reach a proportion of 0.000 by iteration 5.
2. **Divergent Final Behavior:** The "Multiple-Choice - Incorrect Flip" series is the only one that does not end at or near zero. Instead, it exhibits a significant upward spike between iterations 4 and 5, nearly returning to its starting value.
3. **Peak Timing:** The "Generation - Correct Flip" series peaks early (iteration 2), while the "Multiple-Choice - Incorrect Flip" series has its lowest point at iteration 4 before spiking.
4. **Initial Similarity:** At iteration 1, three of the four series (Generation Incorrect, Multiple-Choice Correct, Multiple-Choice Incorrect) start at approximately the same proportion (~0.045).
### Interpretation
This chart likely illustrates the performance or behavior of a model (Gemini-2.0-Flash) during an iterative process, such as refinement, training, or a multi-step evaluation. "Flips" may refer to changes in model output, predictions, or decisions between steps.
* **What the data suggests:** The general downward trend for "Correct Flips" indicates that as iterations progress, the model makes fewer *correct* changes to its state or outputs. This could imply stabilization or convergence. The trend for "Incorrect Flips" is more complex. For the "Generation" method, incorrect changes also diminish to zero, suggesting the process becomes stable and error-free. However, for the "Multiple-Choice" method, the late spike in incorrect flips is a critical anomaly. It suggests that in the final iteration, this method experiences a resurgence of erroneous changes, potentially indicating instability, over-correction, or a failure mode specific to that method's logic in later stages.
* **Relationship between elements:** The chart directly compares two methodologies ("Generation" vs. "Multiple-Choice") across two outcome types ("Correct" vs. "Incorrect"). The key relationship is the divergent final behavior of the "Multiple-Choice - Incorrect Flip" series compared to all others, highlighting a potential weakness or different characteristic of that approach.
* **Notable anomaly:** The sharp increase in "Multiple-Choice - Incorrect Flip" from ~0.017 at iteration 4 to ~0.042 at iteration 5 is the most significant outlier. This reversal of trend warrants investigation into what occurs in the final step of the Multiple-Choice process.
</details>
(f) Gemini-2.0-Flash
Figure 6: Models Correct and Incorrect Flips on Baseline on DisambiguationQA
<details>
<summary>x32.png Details</summary>

### Visual Description
## Line Chart: SmolLM2-1.7B - Proportion of Flips Over Iterations
### Overview
The image is a line chart titled "SmolLM2-1.7B". It plots the "Proportion of Flips" against the number of "Iterations" (from 1 to 5) for four distinct data series. The chart compares the performance or behavior of different methods or conditions over a sequence of iterations.
### Components/Axes
* **Chart Title:** SmolLM2-1.7B (Top Center)
* **Y-Axis:**
* **Label:** Proportion of Flips (Left side, vertical)
* **Scale:** Linear, ranging from 0.00 to 0.07, with major tick marks at 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07.
* **X-Axis:**
* **Label:** Iterations (Bottom, horizontal)
* **Scale:** Discrete, with markers at 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed black line with square markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
1. **Generation (Solid Blue Line):**
* **Trend:** Volatile. Starts low, dips, spikes sharply at iteration 3, then plummets before a slight recovery.
* **Data Points (Approximate):**
* Iteration 1: ~0.018
* Iteration 2: ~0.010
* Iteration 3: ~0.035 (Peak)
* Iteration 4: ~0.002 (Lowest point)
* Iteration 5: ~0.010
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Overall downward trend after an initial sharp drop. Starts as the highest value, decreases, has a secondary peak at iteration 3, then declines steadily.
* **Data Points (Approximate):**
* Iteration 1: ~0.065 (Highest value on the chart)
* Iteration 2: ~0.025
* Iteration 3: ~0.040 (Secondary peak)
* Iteration 4: ~0.025
* Iteration 5: ~0.005
3. **Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** Generally decreasing with a small rebound at iteration 4.
* **Data Points (Approximate):**
* Iteration 1: ~0.025
* Iteration 2: ~0.018
* Iteration 3: ~0.009
* Iteration 4: ~0.018
* Iteration 5: ~0.008
4. **Incorrect Flip (Dashed Black Line with Squares):**
* **Trend:** Fluctuates at a low level, generally below the "Correct Flip" line except at iteration 3.
* **Data Points (Approximate):**
* Iteration 1: ~0.018
* Iteration 2: ~0.009
* Iteration 3: ~0.018
* Iteration 4: ~0.009
* Iteration 5: ~0.008
### Key Observations
1. **Dominant Series:** The "Multiple-Choice" method begins with a significantly higher proportion of flips (~0.065) than all other series at iteration 1.
2. **Convergence at Iteration 3:** All four data series show a local peak or notable change at iteration 3. "Generation" reaches its maximum, "Multiple-Choice" has a secondary peak, and "Incorrect Flip" rises to meet "Correct Flip".
3. **Final Convergence:** By iteration 5, the proportion of flips for all four series converges to a very low range, between approximately 0.005 and 0.010.
4. **Correct vs. Incorrect Flips:** The "Correct Flip" proportion is generally higher than or equal to the "Incorrect Flip" proportion across iterations, except at iteration 3 where they are approximately equal (~0.018).
5. **Volatility:** The "Generation" series exhibits the most dramatic volatility, with the largest single-iteration change (a drop from ~0.035 to ~0.002 between iterations 3 and 4).
### Interpretation
This chart likely tracks the stability or error-correction behavior of the "SmolLM2-1.7B" model across iterative refinement steps. "Flips" probably refer to changes in the model's output or predictions between iterations.
* **Method Comparison:** The "Multiple-Choice" approach starts with high instability (many flips) but improves rapidly. The "Generation" approach is highly unstable mid-process (iteration 3) but achieves the lowest flip rate by iteration 4.
* **Process Dynamics:** The synchronized activity at iteration 3 suggests a critical point in the iterative process where all methods undergo significant adjustment. The general downward trend indicates that, regardless of the method, the model's outputs stabilize as iterations progress.
* **Error Quality:** The fact that "Correct Flip" rates are typically higher than "Incorrect Flip" rates suggests that when the model changes its output, it is more often making a correction towards a better answer than introducing a new error. The convergence of all metrics to a low value by iteration 5 implies the process reaches a stable state.
* **Underlying Question:** The data prompts investigation into why the "Generation" method has such a pronounced spike and subsequent crash at iterations 3-4, and whether the "Multiple-Choice" method's high initial flip rate is a necessary cost for its later stability.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x33.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-3B - Proportion of Flips Over Iterations
### Overview
The image is a line chart titled "Qwen2.5-3B". It plots the "Proportion of Flips" against "Iterations" for two distinct methods: "Generation" and "Multiple-Choice". The chart tracks how the proportion of flips changes over five iterations for each method, with line style (solid vs. dashed) indicating whether the flip was "Correct" or "Incorrect".
### Components/Axes
* **Title:** "Qwen2.5-3B" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips" (vertical text on the left).
* **Scale:** Linear scale from 0.02 to 0.10, with major tick marks at 0.02, 0.04, 0.06, 0.08, and 0.10.
* **X-Axis:**
* **Label:** "Iterations" (horizontal text at the bottom).
* **Scale:** Discrete integer values from 1 to 5.
* **Legend:** Positioned in the top-right corner of the plot area.
* **Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Series 2:** "Multiple-Choice" - Represented by an orange line with square markers.
* **Line Style Key:**
* "Correct Flip" - Indicated by a solid line segment.
* "Incorrect Flip" - Indicated by a dashed line segment.
* **Data Series:** Two lines, each composed of four segments connecting five data points. The style of each segment (solid/dashed) corresponds to the flip correctness for that interval.
### Detailed Analysis
**1. Generation (Blue Line, Circular Markers):**
* **Trend:** The line shows high volatility. It rises sharply to a peak, then declines steeply before a final partial recovery.
* **Data Points & Segment Analysis:**
* Iteration 1: Value ā 0.050.
* Segment 1ā2: **Solid line** (Correct Flip). Value rises to a peak at Iteration 2 ā 0.080.
* Segment 2ā3: **Dashed line** (Incorrect Flip). Value drops sharply to Iteration 3 ā 0.025.
* Segment 3ā4: **Solid line** (Correct Flip). Value continues to drop to its lowest point at Iteration 4 ā 0.010.
* Segment 4ā5: **Dashed line** (Incorrect Flip). Value recovers to Iteration 5 ā 0.050.
**2. Multiple-Choice (Orange Line, Square Markers):**
* **Trend:** The line shows a more moderate, undulating pattern with a mid-chart dip and a later peak.
* **Data Points & Segment Analysis:**
* Iteration 1: Value ā 0.065.
* Segment 1ā2: **Solid line** (Correct Flip). Value drops to Iteration 2 ā 0.040.
* Segment 2ā3: **Dashed line** (Incorrect Flip). Value remains stable at Iteration 3 ā 0.040.
* Segment 3ā4: **Solid line** (Correct Flip). Value rises to a peak at Iteration 4 ā 0.065.
* Segment 4ā5: **Dashed line** (Incorrect Flip). Value drops to Iteration 5 ā 0.035.
### Key Observations
1. **Volatility Contrast:** The "Generation" method exhibits significantly higher volatility, with a range of approximately 0.070 (from ~0.010 to ~0.080). The "Multiple-Choice" method is more stable, with a range of approximately 0.030 (from ~0.035 to ~0.065).
2. **Peak Timing:** The two methods peak at different iterations. "Generation" peaks early at Iteration 2, while "Multiple-Choice" peaks later at Iteration 4.
3. **Final Convergence:** By Iteration 5, both methods converge to a similar proportion of flips (~0.050 for Generation, ~0.035 for Multiple-Choice), though Generation ends on an upward trend and Multiple-Choice on a downward one.
4. **Flip Correctness Pattern:** For both methods, the segments between iterations alternate between solid (Correct) and dashed (Incorrect). This suggests a pattern where a correct flip is followed by an incorrect one, and vice-versa, across the measured intervals.
### Interpretation
This chart likely evaluates the behavior of the Qwen2.5-3B model under two different prompting or evaluation strategies ("Generation" vs. "Multiple-Choice") over a series of iterative steps. The "Proportion of Flips" probably measures the rate at which the model changes its output or answer between iterations.
* **Method Comparison:** The "Generation" approach appears to induce more dramatic changes in the model's outputs, leading to both higher peaks of change and deeper troughs of stability. This could indicate a more exploratory or unstable process. The "Multiple-Choice" approach results in more measured, consistent changes, suggesting a more constrained or conservative evaluation process.
* **Correctness Oscillation:** The alternating solid/dashed pattern is a critical finding. It implies that for both methods, the model's tendency to make a *correct* flip versus an *incorrect* flip is not random but follows a sequential pattern over these iterations. A correct adjustment is often followed by an incorrect one, and vice-versa, which may point to an underlying oscillatory dynamic in the model's refinement process.
* **Convergence vs. Divergence:** While both methods start and end at somewhat similar levels, their paths are distinct. The final upward tick for "Generation" versus the downward tick for "Multiple-Choice" suggests their long-term trajectories might continue to diverge if iterations were extended. The data does not show a clear convergence to a stable, low flip rate for either method within five iterations.
</details>
(b) Qwen2.5-3B
<details>
<summary>x34.png Details</summary>

### Visual Description
## Line Chart: Llama-3.1-8B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "Llama-3.1-8B" that plots the "Proportion of Flips" against "Iterations" for four distinct data series. The chart appears to track the performance or behavior of a model (likely a large language model) across five sequential iterations, measuring different types of "flips" or changes.
### Components/Axes
* **Chart Title:** "Llama-3.1-8B" (centered at the top).
* **X-Axis:** Labeled "Iterations". It has five discrete, equally spaced tick marks labeled 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale is linear, ranging from 0.04 to 0.18, with major tick marks at intervals of 0.02 (0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18).
* **Legend:** Positioned in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line.
4. **Incorrect Flip:** Dashed orange line.
### Detailed Analysis
The following data points are approximate values extracted by visually aligning each line's position at each iteration tick with the y-axis scale.
**1. Generation (Solid Blue Line)**
* **Trend:** Shows a steady, gradual downward trend across all five iterations.
* **Data Points:**
* Iteration 1: ~0.10
* Iteration 2: ~0.09
* Iteration 3: ~0.085
* Iteration 4: ~0.08
* Iteration 5: ~0.05
**2. Multiple-Choice (Solid Orange Line)**
* **Trend:** Starts high, dips at iteration 3, recovers slightly at iteration 4, then drops again. Overall trend is downward.
* **Data Points:**
* Iteration 1: ~0.15
* Iteration 2: ~0.14
* Iteration 3: ~0.12
* Iteration 4: ~0.14
* Iteration 5: ~0.08
**3. Correct Flip (Dashed Blue Line)**
* **Trend:** Follows a similar downward trajectory to the "Generation" line but is consistently lower in value.
* **Data Points:**
* Iteration 1: ~0.09
* Iteration 2: ~0.085
* Iteration 3: ~0.07
* Iteration 4: ~0.085
* Iteration 5: ~0.04
**4. Incorrect Flip (Dashed Orange Line)**
* **Trend:** Exhibits the most volatile behavior. It starts high, dips slightly, spikes to the highest point on the chart at iteration 3, then drops sharply.
* **Data Points:**
* Iteration 1: ~0.15
* Iteration 2: ~0.14
* Iteration 3: ~0.15 (Peak value for this series and the entire chart)
* Iteration 4: ~0.07
* Iteration 5: ~0.10
### Key Observations
1. **General Decline:** All four metrics show a lower proportion of flips at iteration 5 compared to iteration 1, suggesting an overall reduction in the measured phenomenon over time.
2. **Volatility of Incorrect Flips:** The "Incorrect Flip" series is an outlier in its pattern. It does not follow a smooth decline, instead showing a significant spike at iteration 3 before falling.
3. **Correlation of Line Styles:** The two dashed lines ("Correct Flip" and "Incorrect Flip") generally show more volatility than their solid-line counterparts ("Generation" and "Multiple-Choice").
4. **Convergence at Iteration 5:** By the final iteration, the values for "Generation" and "Correct Flip" are very close (~0.05 and ~0.04), as are the values for "Multiple-Choice" and "Incorrect Flip" (~0.08 and ~0.10), though the latter pair remains higher.
### Interpretation
This chart likely visualizes the results of an iterative evaluation or training process for the Llama-3.1-8B model. The "Proportion of Flips" could refer to changes in model outputs, such as flipping an answer choice or altering a generated response between iterations.
* **What the data suggests:** The general downward trend indicates that as the process iterates, the model's outputs become more stable (fewer flips). The distinction between "Correct" and "Incorrect" flips implies an evaluation against a ground truth. The spike in "Incorrect Flips" at iteration 3 is a critical anomaly. It suggests that at this specific stage, the model underwent a period of significant but erroneous change, potentially indicating a problematic update or a challenging evaluation batch.
* **How elements relate:** The solid lines ("Generation", "Multiple-Choice") may represent aggregate flip rates for different task types, while the dashed lines break these down into correct vs. incorrect changes. The fact that "Incorrect Flip" can spike independently (as at iteration 3) shows that the volume of changes is not perfectly correlated with the quality of those changes.
* **Notable anomaly:** The iteration 3 spike in "Incorrect Flip" is the most salient feature. An investigator would want to examine what occurred at this iterationāwas there a change in the model's parameters, the evaluation dataset, or the prompting strategy? This point represents a moment of high instability and error introduction. The subsequent drop by iteration 4 suggests a recovery or correction.
</details>
(c) Llama-3.1-8B
<details>
<summary>x35.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-14B Flip Proportions Over Iterations
### Overview
This is a line chart titled "Qwen2.5-14B" that plots the "Proportion of Flips" against "Iterations" (from 1 to 5). It compares four different metrics or conditions, represented by distinct line styles and colors, showing how their values change over five sequential iterations.
### Components/Axes
* **Title:** "Qwen2.5-14B" (centered at the top).
* **Y-Axis:** Label is "Proportion of Flips". Scale ranges from 0.00 to 0.08, with major tick marks at 0.00, 0.02, 0.04, 0.06, and 0.08.
* **X-Axis:** Label is "Iterations". Discrete tick marks at integer values 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-left corner of the plot area. It defines four data series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Dashed orange line.
3. **Correct Flip:** Dotted green line with circular markers.
4. **Incorrect Flip:** Dash-dot black line with square markers.
### Detailed Analysis
The chart tracks the proportion of "flips" (likely a change in model output or decision) across five iterations for four categories. All series show a general downward trend, converging toward zero by iteration 5.
**1. Generation (Solid Blue Line):**
* **Trend:** Starts highest, experiences a sharp drop, plateaus, then plummets to near zero before a slight final rise.
* **Data Points (Approximate):**
* Iteration 1: ~0.078
* Iteration 2: ~0.042
* Iteration 3: ~0.042 (plateau)
* Iteration 4: ~0.000 (sharp drop)
* Iteration 5: ~0.010
**2. Multiple-Choice (Dashed Orange Line):**
* **Trend:** Shows a steady, near-linear decline from the second-highest starting point.
* **Data Points (Approximate):**
* Iteration 1: ~0.060
* Iteration 2: ~0.025
* Iteration 3: ~0.015
* Iteration 4: ~0.000
* Iteration 5: ~0.010
**3. Correct Flip (Dotted Green Line with Circles):**
* **Trend:** Declines steadily from a moderate starting point.
* **Data Points (Approximate):**
* Iteration 1: ~0.040
* Iteration 2: ~0.020
* Iteration 3: ~0.010
* Iteration 4: ~0.000
* Iteration 5: ~0.010
**4. Incorrect Flip (Dash-Dot Black Line with Squares):**
* **Trend:** Follows a path very similar to "Correct Flip," declining steadily.
* **Data Points (Approximate):**
* Iteration 1: ~0.040
* Iteration 2: ~0.020
* Iteration 3: ~0.010
* Iteration 4: ~0.000
* Iteration 5: ~0.010
### Key Observations
1. **Convergence:** All four metrics converge to a very low proportion (approximately 0.00 to 0.01) by Iteration 5.
2. **Initial Hierarchy:** At Iteration 1, the "Generation" condition has the highest flip proportion, followed by "Multiple-Choice," with "Correct Flip" and "Incorrect Flip" tied at the lowest starting point.
3. **Dramatic Drop in Generation:** The "Generation" series exhibits the most volatile behavior, with a significant plateau between iterations 2 and 3 followed by a near-total collapse at iteration 4.
4. **Similar Trajectories for Flip Types:** The "Correct Flip" and "Incorrect Flip" series are nearly identical in value and trend throughout all iterations, suggesting the proportion of flips does not distinguish between correct and incorrect outcomes in this experiment.
5. **Iteration 4 Minimum:** Three of the four series ("Generation," "Multiple-Choice," "Correct/Incorrect Flip") reach their minimum value (ā0.00) at Iteration 4.
### Interpretation
The data suggests that for the Qwen2.5-14B model under the tested conditions, the tendency to "flip" its output or decision decreases substantially with repeated iterations. This could indicate a stabilization of the model's responses or a reduction in uncertainty as it processes the same task multiple times.
The stark difference between the "Generation" line and the others implies that the flip behavior is highly dependent on the task or prompting method. The "Generation" task starts with high instability but achieves near-perfect stability (zero flips) by iteration 4, albeit with a minor rebound. The "Multiple-Choice" task shows a more predictable, gradual stabilization.
The most notable finding is the indistinguishable behavior of "Correct Flip" and "Incorrect Flip." This implies that the model's flips are not biased toward correctness; they occur at the same rate regardless of whether the flip leads to a correct or incorrect final answer. This could point to a random or systematic noise factor in the flipping mechanism rather than a targeted correction process.
Overall, the chart demonstrates that iterative processing reduces output volatility for this model, but the path to stability varies significantly by task type.
</details>
(d) Qwen2.5-14B
<details>
<summary>x36.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B
### Overview
This is a line chart comparing the "Proportion of Flips" across five iterations for two different methods: "Generation" and "Multiple-Choice." Each method is further broken down into "Correct Flip" and "Incorrect Flip" categories, represented by solid and dashed lines, respectively. The chart shows significant volatility in the flip proportions for both methods over the measured iterations.
### Components/Axes
* **Chart Title:** "DeepSeek-R1-Distill-Llama-8B" (centered at the top).
* **X-Axis:** Labeled "Iterations." It has five discrete markers: 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips." The scale ranges from 0.01 to 0.06, with major tick marks at 0.01, 0.02, 0.03, 0.04, 0.05, and 0.06.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
* **Generation (Blue):**
* Solid blue line: "Correct Flip"
* Dashed blue line: "Incorrect Flip"
* **Multiple-Choice (Orange):**
* Solid orange line: "Correct Flip"
* Dashed orange line: "Incorrect Flip"
### Detailed Analysis
**1. Generation (Blue Lines)**
* **Correct Flip (Solid Blue):** The trend is a sharp decline followed by a partial recovery.
* Iteration 1: ~0.042
* Iteration 2: ~0.042 (plateau)
* Iteration 3: ~0.015 (sharp drop)
* Iteration 4: ~0.033 (recovery)
* Iteration 5: ~0.025 (slight decline)
* **Incorrect Flip (Dashed Blue):** The trend is highly volatile, with a major peak at iteration 3.
* Iteration 1: ~0.025
* Iteration 2: ~0.015 (drop)
* Iteration 3: ~0.042 (peak)
* Iteration 4: ~0.015 (drop)
* Iteration 5: ~0.033 (rise)
**2. Multiple-Choice (Orange Lines)**
* **Correct Flip (Solid Orange):** The trend shows extreme volatility, with the highest peak on the chart.
* Iteration 1: ~0.042
* Iteration 2: ~0.008 (sharp drop, lowest point on chart)
* Iteration 3: ~0.060 (peak, highest point on chart)
* Iteration 4: ~0.015 (sharp drop)
* Iteration 5: ~0.025 (rise)
* **Incorrect Flip (Dashed Orange):** The trend mirrors the "Generation - Incorrect Flip" line closely.
* Iteration 1: ~0.042
* Iteration 2: ~0.015 (drop)
* Iteration 3: ~0.042 (peak)
* Iteration 4: ~0.015 (drop)
* Iteration 5: ~0.033 (rise)
### Key Observations
1. **Synchronized Peak at Iteration 3:** All four data series show a significant local peak or trough at iteration 3. The "Multiple-Choice - Correct Flip" reaches the chart's maximum value (~0.06), while the "Generation - Correct Flip" reaches its minimum (~0.015).
2. **Convergence at Start and End:** At iteration 1, both "Correct Flip" lines start at the same value (~0.042). At iteration 5, three of the four lines (all except "Generation - Correct Flip") converge at approximately 0.025-0.033.
3. **High Volatility:** The "Multiple-Choice - Correct Flip" series exhibits the most extreme swing, from ~0.008 to ~0.060 within two iterations.
4. **Correlation of Incorrect Flips:** The "Incorrect Flip" lines for both Generation and Multiple-Choice follow nearly identical paths, suggesting the rate of incorrect flips may be independent of the method used.
### Interpretation
The chart demonstrates that the "Proportion of Flips" for the DeepSeek-R1-Distill-Llama-8B model is highly sensitive to the iteration step, showing no stable trend. The dramatic spike in "Correct Flips" for the Multiple-Choice method at iteration 3 suggests a specific condition or event at that stage that significantly increased the model's propensity to change its answer correctly. Conversely, the same iteration saw a collapse in correct flips for the Generation method, indicating a divergent response between the two approaches.
The near-identical behavior of the "Incorrect Flip" lines implies that the underlying mechanism or error rate leading to incorrect answer changes is consistent across both methods. The overall pattern suggests an unstable training or evaluation process where performance metrics fluctuate widely between steps, making it difficult to ascertain a clear improvement trajectory from this data alone. The convergence of values at the final iteration might indicate a return to a baseline state after a period of high instability.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x37.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flip Proportions Over Iterations
### Overview
This is a line chart titled "Gemini-2.0-Flash" that plots the "Proportion of Flips" against "Iterations" (1 through 5). It compares four distinct data series, differentiated by line style, color, and marker shape, representing different conditions or outcomes related to a "flip" event.
### Components/Axes
* **Chart Title:** "Gemini-2.0-Flash" (centered at the top).
* **X-Axis:** Labeled "Iterations". It has five discrete, equally spaced tick marks labeled 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale runs from 0.00 to 0.10, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08, 0.10).
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers (ā).
4. **Incorrect Flip:** Dashed orange line with square markers (ā ).
### Detailed Analysis
The following data points are approximate values extracted from the chart. The trend for each series is described first, followed by the estimated values per iteration.
**1. Generation (Solid Blue Line)**
* **Trend:** Shows high variability. It starts high, dips sharply at iteration 2, peaks at iteration 3, then declines before a slight rise at iteration 5.
* **Data Points (Approximate):**
* Iteration 1: 0.078
* Iteration 2: 0.025
* Iteration 3: 0.082
* Iteration 4: 0.045
* Iteration 5: 0.050
**2. Multiple-Choice (Solid Orange Line)**
* **Trend:** Shows a general downward trend with a small peak at iteration 3. It is consistently lower than the "Generation" line except at iteration 2.
* **Data Points (Approximate):**
* Iteration 1: 0.055
* Iteration 2: 0.018
* Iteration 3: 0.050
* Iteration 4: 0.010
* Iteration 5: 0.025
**3. Correct Flip (Dashed Blue Line with ā Markers)**
* **Trend:** Follows a pattern very similar to the "Generation" line, suggesting a strong correlation. It starts high, dips at iteration 2, peaks at iteration 3, then declines.
* **Data Points (Approximate):**
* Iteration 1: 0.075
* Iteration 2: 0.078
* Iteration 3: 0.010
* Iteration 4: 0.058
* Iteration 5: 0.035
**4. Incorrect Flip (Dashed Orange Line with ā Markers)**
* **Trend:** Remains relatively low and stable across all iterations, with a slight peak at iteration 3. It is consistently the lowest or second-lowest series.
* **Data Points (Approximate):**
* Iteration 1: 0.048
* Iteration 2: 0.020
* Iteration 3: 0.030
* Iteration 4: 0.018
* Iteration 5: 0.015
### Key Observations
1. **Iteration 3 is a Critical Point:** All four series show a significant change at iteration 3. "Generation" and "Correct Flip" reach their maximum values, while "Multiple-Choice" shows its only notable peak. "Incorrect Flip" also has a minor peak here.
2. **Strong Correlation:** The "Generation" (solid blue) and "Correct Flip" (dashed blue) lines track each other closely in shape, though their absolute values differ at iterations 2 and 3.
3. **Divergence at Iteration 2:** At iteration 2, the "Correct Flip" proportion remains high (~0.078) while the "Generation" proportion drops sharply (~0.025). This is the most significant point of divergence between these two correlated series.
4. **Low Incorrect Flip Rate:** The "Incorrect Flip" series is consistently the lowest or near-lowest, indicating that the proportion of incorrect flips is generally small compared to other measured proportions.
### Interpretation
The chart appears to analyze the performance or behavior of a system named "Gemini-2.0-Flash" over five iterative steps. The data suggests a comparison between two primary methods or conditions: "Generation" and "Multiple-Choice."
* The "Generation" method exhibits higher volatility and achieves higher peak proportions of flips, particularly at iteration 3. Its strong correlation with the "Correct Flip" series implies that a significant portion of the flips occurring under the "Generation" condition are correct.
* The "Multiple-Choice" method shows a more stable, generally declining trend, suggesting it may be a more conservative or consistent approach that yields fewer flips overall.
* The "Correct Flip" and "Incorrect Flip" series likely represent a breakdown of the outcomes from the "Generation" and/or "Multiple-Choice" processes. The fact that "Correct Flip" peaks dramatically at iteration 3 while "Incorrect Flip" only has a minor peak suggests that iteration 3 was a particularly successful step for generating correct outcomes, not just more outcomes in general.
* The anomaly at iteration 2āwhere "Correct Flip" stays high but "Generation" dropsācould indicate that while the overall generation rate fell, the *accuracy* (proportion of correct flips within that generation) was exceptionally high for that step.
In summary, the data demonstrates that the "Generation" approach is more dynamic and can lead to high rates of correct flips, especially at a specific iteration (3), while the "Multiple-Choice" approach is more subdued. The system's performance is not linear across iterations, with iteration 3 being a clear focal point of activity.
</details>
(f) Gemini-2.0-Flash
Figure 7: Models Correct and Incorrect Flips on CoT on DisambiguationQA
<details>
<summary>x38.png Details</summary>

### Visual Description
\n
## Line Chart: SmolLM2-1.7B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "SmolLM2-1.7B" that plots the "Proportion of Flips" against the number of "Iterations" (from 1 to 5). It compares four different data series, distinguished by color and line style, as defined in a legend located in the top-left corner of the chart area. The chart appears to track a performance or behavioral metric of a model named SmolLM2-1.7B across sequential iterations.
### Components/Axes
* **Title:** "SmolLM2-1.7B" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips" (rotated vertically on the left).
* **Scale:** Linear scale from 0.00 to 0.04, with major tick marks at 0.00, 0.01, 0.02, 0.03, and 0.04.
* **X-Axis:**
* **Label:** "Iterations" (centered at the bottom).
* **Scale:** Discrete integer scale from 1 to 5, with major tick marks at each integer.
* **Legend:** Positioned in the top-left quadrant of the chart area. It contains four entries:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed orange line.
4. **Incorrect Flip:** Dashed black line.
### Detailed Analysis
The chart displays the following approximate data points for each series across the five iterations:
**1. Generation (Solid Blue Line):**
* **Trend:** Starts low, drops to zero, and remains flat.
* **Data Points:**
* Iteration 1: ~0.008
* Iteration 2: 0.00
* Iteration 3: 0.00
* Iteration 4: 0.00
* Iteration 5: 0.00
**2. Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts high, decreases sharply to zero, then shows a partial recovery before plateauing.
* **Data Points:**
* Iteration 1: ~0.033
* Iteration 2: ~0.017
* Iteration 3: 0.00
* Iteration 4: ~0.008
* Iteration 5: ~0.008
**3. Correct Flip (Dashed Orange Line):**
* **Trend:** Follows a similar initial downward trend to the "Multiple-Choice" line but diverges after iteration 3, ending at zero.
* **Data Points:**
* Iteration 1: ~0.033 (appears to start at the same point as Multiple-Choice)
* Iteration 2: ~0.017 (appears to track with Multiple-Choice)
* Iteration 3: 0.00
* Iteration 4: ~0.008 (appears to track with Multiple-Choice)
* Iteration 5: 0.00
**4. Incorrect Flip (Dashed Black Line):**
* **Trend:** Remains constant at zero throughout all iterations.
* **Data Points:**
* Iterations 1-5: 0.00
### Key Observations
1. **Dominant Initial Value:** The "Multiple-Choice" and "Correct Flip" series start with the highest proportion of flips (~0.033) at iteration 1.
2. **Convergence to Zero:** Both the "Generation" and "Multiple-Choice"/"Correct Flip" series experience a significant drop, reaching 0.00 by iteration 2 and 3, respectively.
3. **Partial Recovery:** The "Multiple-Choice" series shows a distinct recovery from 0.00 at iteration 3 to ~0.008 at iteration 4, where it stabilizes. The "Correct Flip" series does not share this recovery at iteration 5.
4. **Zero Baseline:** The "Incorrect Flip" series shows no activity (0.00) across all measured iterations.
5. **Line Style Correlation:** The solid and dashed orange lines ("Multiple-Choice" and "Correct Flip") are perfectly correlated for the first four data points (iterations 1-4) but diverge at the final point (iteration 5).
### Interpretation
The chart suggests a process where the model's behavior, measured by the "Proportion of Flips," changes significantly over the first few iterations. The high initial values for "Multiple-Choice" and "Correct Flip" indicate a period of volatility or adjustment. The subsequent drop to zero implies a stabilization phase.
The key insight lies in the divergence at iteration 5: while the overall "Multiple-Choice" proportion remains elevated, the "Correct Flip" proportion drops back to zero. This could indicate that the flips occurring in later iterations (4 and 5) are no longer classified as "Correct Flips" according to the chart's definition, or that a different mechanism is sustaining the "Multiple-Choice" flip rate. The flat "Incorrect Flip" line suggests that the observed flips are not categorized as incorrect within this framework. The "Generation" series stabilizes almost immediately, implying that this particular task or metric reaches a steady state very quickly compared to the "Multiple-Choice" task.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x39.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-3B
### Overview
The image displays a line chart titled "Qwen2.5-3B," plotting the "Proportion of Flips" against "Iterations" for four distinct data series. The chart compares the performance or behavior of different methods or conditions over a sequence of five iterations.
### Components/Axes
* **Chart Title:** Qwen2.5-3B (positioned at the top center).
* **X-Axis:** Labeled "Iterations." It has discrete markers at integer values from 1 to 5.
* **Y-Axis:** Labeled "Proportion of Flips." The scale ranges from 0.00 to 0.14, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14).
* **Legend:** Positioned in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed orange line with square markers.
### Detailed Analysis
**Data Series and Approximate Values:**
1. **Generation (Solid Blue Line):**
* Trend: Overall downward trend with a slight recovery at the end.
* Iteration 1: ~0.08
* Iteration 2: ~0.04
* Iteration 3: ~0.00
* Iteration 4: ~0.04
* Iteration 5: ~0.02
2. **Multiple-Choice (Solid Orange Line):**
* Trend: Fluctuates, with a notable peak at iteration 3.
* Iteration 1: ~0.10
* Iteration 2: ~0.06
* Iteration 3: ~0.10
* Iteration 4: ~0.04
* Iteration 5: ~0.05
3. **Correct Flip (Dashed Blue Line with Circles):**
* Trend: Decreases initially, rises slightly, drops to zero, then recovers.
* Iteration 1: ~0.09
* Iteration 2: ~0.04
* Iteration 3: ~0.05
* Iteration 4: ~0.00
* Iteration 5: ~0.03
4. **Incorrect Flip (Dashed Orange Line with Squares):**
* Trend: Rises to a peak at iteration 3, then declines.
* Iteration 1: ~0.08
* Iteration 2: ~0.06
* Iteration 3: ~0.09
* Iteration 4: ~0.04
* Iteration 5: ~0.04
### Key Observations
* **Peak at Iteration 3:** Both the "Multiple-Choice" and "Incorrect Flip" series reach their highest values at iteration 3.
* **Zero Value:** The "Correct Flip" series drops to exactly 0.00 at iteration 4.
* **Convergence at Iteration 4:** At iteration 4, three of the four series ("Generation," "Multiple-Choice," and "Incorrect Flip") converge at approximately 0.04.
* **Divergence at Start:** At iteration 1, the "Multiple-Choice" proportion (~0.10) is the highest, while the "Generation" proportion (~0.08) is the lowest among the four series.
* **Final Values:** By iteration 5, all series have proportion values between 0.02 and 0.05, showing less spread than at earlier iterations.
### Interpretation
The chart appears to track the rate of "flips" (likely a change in model output, prediction, or state) across iterations for a model or system named Qwen2.5-3B. The data suggests:
1. **Method Comparison:** The "Multiple-Choice" condition generally exhibits a higher proportion of flips than the "Generation" condition, especially in the first three iterations. This could indicate that the multiple-choice setting is more volatile or responsive to iterative changes.
2. **Flip Quality:** The "Correct Flip" and "Incorrect Flip" series likely break down the "flips" into successful and unsuccessful changes. The peak in "Incorrect Flip" at iteration 3 coinciding with the peak in "Multiple-Choice" suggests that the increased activity in the multiple-choice method at that stage may have led to a higher rate of erroneous changes.
3. **Process Dynamics:** The system shows significant activity in the early iterations (1-3), with a notable event or shift occurring at iteration 3 (peaks) and iteration 4 (convergence and a zero correct flip rate). This pattern could reflect a learning curve, an optimization process, or a phased evaluation where the system stabilizes or encounters a specific challenge in the middle iterations.
4. **Overall Trend:** Despite the fluctuations, the general trend for all series is toward lower flip proportions by iteration 5, which may indicate the system is reaching a more stable state or that the potential for change diminishes over time.
</details>
(b) Qwen2.5-3B
<details>
<summary>x40.png Details</summary>

### Visual Description
## Line Chart: Llama-3.1-8B - Proportion of Flips Over Iterations
### Overview
The image is a line chart titled "Llama-3.1-8B," plotting the "Proportion of flips" against "Iterations" for four distinct data series. The chart compares the performance or behavior of different methods or conditions over five sequential iterations.
### Components/Axes
* **Chart Title:** "Llama-3.1-8B" (centered at the top).
* **Y-Axis:** Labeled "Proportion of flips." The scale runs from 0.02 to 0.14, with major tick marks at intervals of 0.02 (0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14).
* **X-Axis:** Labeled "Iterations." The scale shows discrete integer values from 1 to 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line with circular markers.
2. **Multiple-Choice:** Solid orange line with circular markers.
3. **Correct Flip:** Dashed black line with square markers.
4. **Incorrect Flip:** Dashed black line with diamond markers.
### Detailed Analysis
**Data Series and Approximate Values:**
1. **Generation (Blue, Solid Line, Circles):**
* **Trend:** Highly volatile. Starts high, dips, spikes to the chart's maximum, then declines sharply before a slight recovery.
* **Data Points (Iterations 1-5):** ~0.11, ~0.09, ~0.14, ~0.05, ~0.07.
2. **Multiple-Choice (Orange, Solid Line, Circles):**
* **Trend:** Starts low, rises to a peak, then gradually declines with a slight uptick at the end.
* **Data Points (Iterations 1-5):** ~0.02, ~0.06, ~0.05, ~0.03, ~0.04.
3. **Correct Flip (Black, Dashed Line, Squares):**
* **Trend:** Shows a general downward trend after an initial plateau, with a notable dip at iteration 4.
* **Data Points (Iterations 1-5):** ~0.11, ~0.11, ~0.09, ~0.05, ~0.08.
4. **Incorrect Flip (Black, Dashed Line, Diamonds):**
* **Trend:** Relatively stable with minor fluctuations, ending slightly lower than it started.
* **Data Points (Iterations 1-5):** ~0.09, ~0.06, ~0.05, ~0.05, ~0.06.
### Key Observations
* **Peak Value:** The highest recorded proportion of flips is approximately 0.14, achieved by the "Generation" series at Iteration 3.
* **Convergence at Iteration 4:** At Iteration 4, three of the four series ("Generation," "Correct Flip," and "Incorrect Flip") converge at a low point around 0.05.
* **Divergence at Iteration 3:** Iteration 3 shows the greatest spread between series, with "Generation" at its peak (~0.14) and "Incorrect Flip" at its lowest (~0.05).
* **Relative Performance:** The "Correct Flip" proportion is consistently equal to or higher than the "Incorrect Flip" proportion across all iterations.
* **Method Comparison:** The "Generation" method exhibits the most extreme fluctuations, while the "Multiple-Choice" method shows a more moderate, bell-shaped trend.
### Interpretation
This chart likely visualizes the results of an experiment testing different prompting or decoding strategies ("Generation" vs. "Multiple-Choice") for a large language model (Llama-3.1-8B) over multiple trials or refinement steps ("Iterations"). The "Proportion of flips" metric could refer to changes in model output, such as flipping a previous answer or changing a generated token.
The data suggests that the "Generation" strategy is highly sensitive to the iteration step, producing a dramatic spike in "flips" at iteration 3 before settling. In contrast, the "Multiple-Choice" strategy induces a more controlled response pattern. The tracking of "Correct" vs. "Incorrect" flips provides a quality measure; the fact that correct flips outnumber or equal incorrect ones indicates the model's changes are, on balance, moving towards more accurate outputs. The convergence at iteration 4 might indicate a stabilization point for the model's behavior under these test conditions. The overall downward trend in "Correct Flip" from its initial value could imply that after several iterations, the model's outputs require fewer major corrections.
</details>
(c) Llama-3.1-8B
<details>
<summary>x41.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-14B
### Overview
This is a line chart titled "Qwen2.5-14B" that plots the "Proportion of Flips" against the number of "Iterations" (from 1 to 5). It compares four different metrics or conditions, distinguished by line color and style. The chart appears to track the frequency of a "flip" event across sequential iterations for different evaluation methods or categories.
### Components/Axes
* **Title:** "Qwen2.5-14B" (located at the top center).
* **X-Axis:** Labeled "Iterations". It has discrete tick marks at integer values: 1, 2, 3, 4, 5.
* **Y-Axis:** Labeled "Proportion of Flips". It has a linear scale ranging from 0.00 to 0.05, with major tick marks at intervals of 0.01 (0.00, 0.01, 0.02, 0.03, 0.04, 0.05).
* **Legend:** Positioned in the top-right corner of the plot area. It defines four data series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line.
4. **Incorrect Flip:** Dashed black line with square markers.
### Detailed Analysis
The chart displays the following trends and approximate data points for each series across the five iterations:
1. **Generation (Solid Blue Line):**
* **Trend:** Starts high, drops sharply, plateaus, then drops to zero.
* **Data Points (Approx.):**
* Iteration 1: ~0.042
* Iteration 2: ~0.025
* Iteration 3: ~0.025
* Iteration 4: 0.00
* Iteration 5: 0.00
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts low, rises, drops to zero, stays at zero, then rises again.
* **Data Points (Approx.):**
* Iteration 1: ~0.008
* Iteration 2: ~0.017
* Iteration 3: 0.00
* Iteration 4: 0.00
* Iteration 5: ~0.024
3. **Correct Flip (Dashed Blue Line):**
* **Trend:** Follows a pattern very similar to the "Generation" line but with slightly lower values at the start.
* **Data Points (Approx.):**
* Iteration 1: ~0.038
* Iteration 2: ~0.017
* Iteration 3: ~0.017
* Iteration 4: 0.00
* Iteration 5: 0.00
4. **Incorrect Flip (Dashed Black Line with Squares):**
* **Trend:** Remains very low and near zero throughout, with a minor peak at iteration 2.
* **Data Points (Approx.):**
* Iteration 1: ~0.008
* Iteration 2: ~0.017
* Iteration 3: ~0.008
* Iteration 4: 0.00
* Iteration 5: ~0.008
### Key Observations
* **Convergence to Zero:** Both the "Generation" and "Correct Flip" series drop to a proportion of 0.00 by iteration 4 and remain there at iteration 5.
* **Divergence at Iteration 5:** The "Multiple-Choice" series shows a distinct resurgence at iteration 5 (~0.024), while the "Generation" and "Correct Flip" series remain at zero.
* **Correlation:** The "Correct Flip" (dashed blue) line closely mirrors the shape and timing of the "Generation" (solid blue) line, suggesting a strong relationship between these two metrics.
* **Low Error Rate:** The "Incorrect Flip" series remains consistently low, never exceeding ~0.017, indicating that the majority of "flips" tracked are likely "correct" ones.
* **Peak Values:** The highest recorded proportion is for "Generation" at iteration 1 (~0.042). The lowest non-zero values are around 0.008.
### Interpretation
This chart likely visualizes the behavior of a large language model (Qwen2.5-14B) during an iterative process, such as self-correction, refinement, or multi-step reasoning. The "Proportion of Flips" probably refers to the rate at which the model changes its output or answer between steps.
* **Process Efficiency:** The rapid decline of the "Generation" and "Correct Flip" proportions to zero suggests the model's outputs stabilize quickly, with meaningful changes ("flips") ceasing after 3-4 iterations.
* **Method Comparison:** The "Multiple-Choice" condition behaves differently, showing a late-stage increase in flip proportion. This could indicate that for multiple-choice tasks, the model continues to reconsider or change its answers even in later iterations, unlike in the general "Generation" task.
* **Accuracy Indicator:** The close alignment of "Correct Flip" with "Generation" and the consistently low "Incorrect Flip" rate implies that when the model does change its output, it is predominantly making a correction toward a better answer, rather than introducing errors.
* **Underlying Mechanism:** The data suggests an underlying process where initial iterations involve significant revision (high flip rate), which then converges to a stable state. The exception for "Multiple-Choice" at iteration 5 might point to a specific challenge or characteristic of that task format that prevents early stabilization.
</details>
(d) Qwen2.5-14B
<details>
<summary>x42.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance Metrics
### Overview
The image is a line chart displaying the performance of a model named "DeepSeek-R1-Distill-Llama-8B" over five training or evaluation iterations. It tracks four distinct metrics related to the model's behavior, specifically concerning "flips" (likely changes in output or decisions). The chart uses two primary colors (blue and orange) and two line styles (solid and dashed) to differentiate the data series.
### Components/Axes
* **Chart Title:** "DeepSeek-R1-Distill-Llama-8B" (centered at the top).
* **X-Axis:** Labeled "Iterations". It has five discrete, equally spaced tick marks labeled 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of flips". The scale is linear, ranging from 0.00 to 0.12, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12).
* **Legend:** Positioned in the top-right corner of the plot area. It contains four entries:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed orange line with square markers.
### Detailed Analysis
The following data points are approximate values extracted by visually aligning each line's position with the y-axis scale.
**1. Generation (Solid Blue Line)**
* **Trend:** The line shows a fluctuating but overall upward trend. It starts low, dips slightly, then rises to its highest point at the final iteration.
* **Data Points:**
* Iteration 1: ~0.015
* Iteration 2: ~0.015
* Iteration 3: ~0.025
* Iteration 4: ~0.010
* Iteration 5: ~0.045
**2. Multiple-Choice (Solid Orange Line)**
* **Trend:** This line exhibits high variability. It rises to a peak at iteration 3, experiences a sharp drop at iteration 4, and then recends partially at iteration 5.
* **Data Points:**
* Iteration 1: ~0.075
* Iteration 2: ~0.085
* Iteration 3: ~0.100
* Iteration 4: ~0.045
* Iteration 5: ~0.080
**3. Correct Flip (Dashed Blue Line with Circles)**
* **Trend:** This series shows a general downward trend over the five iterations, with a minor recovery at iteration 4 before falling to its lowest point.
* **Data Points:**
* Iteration 1: ~0.035
* Iteration 2: ~0.025
* Iteration 3: ~0.010
* Iteration 4: ~0.025
* Iteration 5: ~0.005
**4. Incorrect Flip (Dashed Orange Line with Squares)**
* **Trend:** The trend for this metric closely mirrors the "Multiple-Choice" line. It rises to a peak at iteration 3, drops sharply at iteration 4, and rises again at iteration 5.
* **Data Points:**
* Iteration 1: ~0.055
* Iteration 2: ~0.070
* Iteration 3: ~0.100
* Iteration 4: ~0.045
* Iteration 5: ~0.080
### Key Observations
1. **Strong Correlation:** The "Multiple-Choice" (solid orange) and "Incorrect Flip" (dashed orange) lines are nearly identical in value and shape across all iterations. This suggests that the proportion of flips in the multiple-choice task is almost entirely composed of incorrect flips.
2. **Divergent Paths:** The "Generation" (solid blue) and "Correct Flip" (dashed blue) lines move in opposite directions. As the proportion of flips in generation tasks increases over time, the proportion of correct flips decreases.
3. **Significant Event at Iteration 4:** There is a pronounced dip in the "Multiple-Choice" and "Incorrect Flip" metrics at iteration 4, where both drop to approximately 0.045. This is the only point where the "Generation" metric (~0.010) is lower than the "Multiple-Choice" metric.
4. **Final State:** By iteration 5, the "Generation" flip proportion (~0.045) is at its highest, while the "Correct Flip" proportion (~0.005) is at its lowest. The "Multiple-Choice" and "Incorrect Flip" proportions have rebounded to ~0.080.
### Interpretation
This chart likely visualizes the behavior of a language model during a reinforcement learning or iterative fine-tuning process. "Flips" probably refer to instances where the model changes its initial output or answer.
* **Task-Specific Behavior:** The model behaves very differently between "Generation" (open-ended text generation) and "Multiple-Choice" tasks. The high and correlated values for "Multiple-Choice" and "Incorrect Flip" indicate that when the model changes its answer in a multiple-choice setting, it is almost always changing from a correct to an incorrect answer. This could signal instability or over-correction in that task format.
* **Learning Dynamics:** The decreasing trend in "Correct Flip" suggests the model is becoming less likely to make beneficial self-corrections over time. Conversely, the rising "Generation" flip proportion might indicate increasing exploration or volatility in its generative outputs.
* **The Iteration 4 Anomaly:** The sharp, synchronized drop in multiple-choice flips at iteration 4 is a critical event. It could represent a change in training parameters, a data batch effect, or a point where the model temporarily stabilized its multiple-choice responses before diverging again. Investigating what occurred at this iteration would be key to understanding the training process.
* **Overall Implication:** The data suggests the model's refinement process is not uniformly improving all metrics. While it may be generating more varied outputs (higher generation flips), its ability to make correct self-assessments, particularly in structured tasks like multiple-choice, is deteriorating. This highlights a potential trade-off or misalignment in the optimization objective.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x43.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flip Proportions Over Iterations
### Overview
The image is a line chart titled "Gemini-2.0-Flash" that plots the "Proportion of Flips" against "Iterations" for two different methods: "Generation" and "Multiple-Choice". The chart tracks these proportions over five discrete iterations. A legend indicates that data points are marked as either "Correct Flip" (circle) or "Incorrect Flip" (square).
### Components/Axes
* **Title:** "Gemini-2.0-Flash" (centered at the top).
* **Y-Axis (Vertical):**
* **Label:** "Proportion of Flips".
* **Scale:** Linear scale from 0.00 to 0.07, with major tick marks at 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, and 0.07.
* **X-Axis (Horizontal):**
* **Label:** "Iterations".
* **Scale:** Discrete integer scale from 1 to 5.
* **Legend (Top-Left Corner):**
* **Blue Line:** Labeled "Generation".
* **Orange Line:** Labeled "Multiple-Choice".
* **Marker Key:**
* Circle (ā): "Correct Flip".
* Square (ā ): "Incorrect Flip".
* **Data Series:**
1. **Generation (Blue Line):** A solid blue line connecting data points marked with circles.
2. **Multiple-Choice (Orange Line):** A solid orange line connecting data points marked with circles.
### Detailed Analysis
**Trend Verification & Data Points (Approximate Values):**
* **Generation (Blue Line):**
* **Trend:** The line shows a general downward trend with significant fluctuations. It starts high, drops sharply, recovers partially, drops again, and then rises slightly.
* **Data Points (Iteration, Proportion):**
* Iteration 1: ~0.065 (Circle - Correct Flip)
* Iteration 2: ~0.035 (Circle - Correct Flip)
* Iteration 3: ~0.040 (Circle - Correct Flip)
* Iteration 4: ~0.010 (Circle - Correct Flip)
* Iteration 5: ~0.018 (Circle - Correct Flip)
* **Multiple-Choice (Orange Line):**
* **Trend:** The line shows a sharp initial drop followed by a stable, very low proportion near zero for the remaining iterations.
* **Data Points (Iteration, Proportion):**
* Iteration 1: ~0.018 (Circle - Correct Flip)
* Iteration 2: ~0.002 (Circle - Correct Flip)
* Iteration 3: ~0.001 (Circle - Correct Flip)
* Iteration 4: ~0.002 (Circle - Correct Flip)
* Iteration 5: ~0.001 (Circle - Correct Flip)
**Component Isolation & Marker Analysis:**
* **Header Region:** Contains only the chart title.
* **Main Chart Region:** Contains the two plotted lines, axes, and gridlines.
* **Legend Region:** Located in the top-left corner of the plot area. It defines the line colors and the meaning of the marker shapes.
* **Marker Consistency Check:** All visible data points on **both** the blue and orange lines are marked with **circles**. According to the legend, circles denote "Correct Flip". **No square markers ("Incorrect Flip") are present on the chart.** This indicates that for the data shown, all recorded flips were classified as correct.
### Key Observations
1. **Magnitude Difference:** The "Generation" method consistently exhibits a higher proportion of flips than the "Multiple-Choice" method across all iterations, with the largest gap at Iteration 1.
2. **Convergence at Low Values:** By Iterations 4 and 5, the proportion for "Generation" drops to a level (0.010-0.018) that is much closer to the consistently low "Multiple-Choice" values (~0.001-0.002).
3. **Absence of Incorrect Flips:** The most notable observation is the complete absence of data points marked as "Incorrect Flip" (squares). The legend includes this category, but it is not utilized in the plotted data.
4. **Volatility:** The "Generation" line is more volatile, showing larger swings between iterations compared to the relatively flat "Multiple-Choice" line after the first iteration.
### Interpretation
This chart appears to compare the stability or error-correction behavior of two different model response methods ("Generation" vs. "Multiple-Choice") over sequential iterations, likely in a testing or refinement loop. A "flip" probably refers to a change in the model's output or answer between iterations.
* **What the data suggests:** The "Multiple-Choice" method is highly stable from the outset, with a very low and decreasing rate of change (flips). The "Generation" method starts with a high rate of change, suggesting initial instability or active revision, but this rate decreases substantially over time, indicating convergence or stabilization.
* **Relationship between elements:** The higher initial flip rate for "Generation" may reflect its open-ended nature, allowing for more variation before settling. The low, flat line for "Multiple-Choice" suggests its constrained format leads to quicker consensus or less need for revision.
* **Notable Anomaly:** The legend's inclusion of an "Incorrect Flip" category that has no corresponding data points is a significant anomaly. This could mean: a) No incorrect flips occurred in this experiment, b) The data for incorrect flips was omitted from this specific visualization, or c) The legend is a generic template not fully customized for this chart. This absence prevents analysis of flip *quality* and limits the interpretation to flip *frequency* only. The chart effectively shows only the rate of "Correct Flips" for both methods.
</details>
(f) Gemini-2.0-Flash
Figure 8: Models Correct and Incorrect Flips on SC on DisambiguationQA
<details>
<summary>x44.png Details</summary>

### Visual Description
## Line Chart: SmolLM2-1.7B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "SmolLM2-1.7B" that plots the "Proportion of Flips" against "Iterations" for four distinct data series. The chart tracks changes across five discrete iteration points (1 through 5). The visual style uses a white background with a light gray grid, solid and dashed lines, and distinct markers for each series.
### Components/Axes
* **Title:** "SmolLM2-1.7B" (centered at the top).
* **Y-Axis:** Labeled "Proportion of Flips". The scale runs from 0.00 to 0.07, with major tick marks at intervals of 0.01 (0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07).
* **X-Axis:** Labeled "Iterations". The scale shows discrete integer values: 1, 2, 3, 4, 5.
* **Legend:** Positioned in the top-right corner of the chart area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed black line with square markers.
### Detailed Analysis
**Data Series and Approximate Values:**
1. **Generation (Solid Blue Line):**
* **Trend:** Starts at a moderate level, remains flat, spikes sharply to a peak, then drops significantly and plateaus.
* **Data Points:**
* Iteration 1: ~0.04
* Iteration 2: ~0.04
* Iteration 3: ~0.07 (Peak)
* Iteration 4: ~0.02
* Iteration 5: ~0.02
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts at a moderate level, then drops to zero and remains there.
* **Data Points:**
* Iteration 1: ~0.03
* Iteration 2: 0.00
* Iteration 3: 0.00
* Iteration 4: 0.00
* Iteration 5: 0.00
3. **Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** Shows a slight dip followed by a recovery and a final minor decline. It generally hovers in the 0.03-0.04 range.
* **Data Points:**
* Iteration 1: ~0.04
* Iteration 2: ~0.03
* Iteration 3: ~0.03
* Iteration 4: ~0.04
* Iteration 5: ~0.03
4. **Incorrect Flip (Dashed Black Line with Squares):**
* **Trend:** Starts as the highest value, drops sharply, then fluctuates slightly around a lower level.
* **Data Points:**
* Iteration 1: ~0.06 (Highest initial value)
* Iteration 2: ~0.03
* Iteration 3: ~0.03
* Iteration 4: ~0.04
* Iteration 5: ~0.03
### Key Observations
* **Peak Anomaly:** The "Generation" series exhibits a dramatic, isolated peak at Iteration 3 (0.07), which is the highest single value recorded on the chart.
* **Zero Plateau:** The "Multiple-Choice" series drops to exactly 0.00 at Iteration 2 and shows no activity for the remainder of the tracked iterations.
* **Convergence:** By Iteration 5, three of the four series ("Correct Flip", "Incorrect Flip", and "Generation") converge within a narrow band between 0.02 and 0.03.
* **Initial Dominance:** At the start (Iteration 1), "Incorrect Flip" has the highest proportion, followed by "Generation" and "Correct Flip" (tied), with "Multiple-Choice" being the lowest.
### Interpretation
The chart appears to track the behavior of a language model (SmolLM2-1.7B) across iterative steps, likely during a training, fine-tuning, or evaluation process involving "flips" (which could refer to changes in model predictions, outputs, or internal states).
* **The "Generation" Spike:** The sharp peak at Iteration 3 suggests a significant event or phase in the process where the proportion of flips related to "Generation" surged. This could indicate a period of high instability, a targeted intervention, or a specific test condition applied at that iteration.
* **Cessation of "Multiple-Choice" Activity:** The drop to zero for "Multiple-Choice" flips after the first iteration implies that this particular type of flip was either resolved, became irrelevant, or was no longer measured after the initial step.
* **Stability of Flip Types:** The "Correct Flip" and "Incorrect Flip" series show relative stability after an initial adjustment (Iteration 1 to 2). Their values remain comparable, suggesting a consistent balance between correct and incorrect flip events throughout most of the process.
* **Process Convergence:** The convergence of multiple series by Iteration 5 may indicate that the system is reaching a steady state or that the iterative process is concluding, with different flip metrics settling into a similar, low-proportion range.
**Language:** All text in the image is in English.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x45.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-3B - Proportion of Flips Over Iterations
### Overview
The image is a line chart titled "Qwen2.5-3B". It plots the "Proportion of Flips" against "Iterations" for four distinct data series. The chart appears to track the performance or behavior of a model (likely the Qwen2.5-3B language model) across five iterative steps, measuring different types of "flips" or changes.
### Components/Axes
* **Chart Title:** "Qwen2.5-3B" (centered at the top).
* **X-Axis:** Labeled "Iterations". It has five discrete, equally spaced tick marks labeled 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale ranges from 0.00 to 0.10, with major tick marks at 0.00, 0.02, 0.04, 0.06, 0.08, and 0.10.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Dashed orange line.
3. **Correct Flip:** Dash-dot blue line (lighter blue than "Generation").
4. **Incorrect Flip:** Dotted black line.
### Detailed Analysis
The following data points are approximate values extracted by visually aligning each line's markers with the y-axis scale.
**1. Generation (Solid Blue Line)**
* **Trend:** Shows a general downward trend over the five iterations.
* **Data Points:**
* Iteration 1: ~0.015
* Iteration 2: ~0.030
* Iteration 3: ~0.020
* Iteration 4: ~0.020
* Iteration 5: ~0.010
**2. Multiple-Choice (Dashed Orange Line)**
* **Trend:** Highly volatile. Starts very high, drops sharply, recovers, then plummets.
* **Data Points:**
* Iteration 1: ~0.085 (Highest point on the entire chart)
* Iteration 2: ~0.075
* Iteration 3: ~0.040
* Iteration 4: ~0.050
* Iteration 5: ~0.005 (Lowest point for this series)
**3. Correct Flip (Dash-Dot Blue Line)**
* **Trend:** Peaks sharply at iteration 2, then declines.
* **Data Points:**
* Iteration 1: ~0.020
* Iteration 2: ~0.080 (Second-highest peak on the chart)
* Iteration 3: ~0.050
* Iteration 4: ~0.020
* Iteration 5: ~0.030
**4. Incorrect Flip (Dotted Black Line)**
* **Trend:** Shows a general upward trend, with a significant dip at iteration 2.
* **Data Points:**
* Iteration 1: ~0.010
* Iteration 2: ~0.010
* Iteration 3: ~0.050
* Iteration 4: ~0.020
* Iteration 5: ~0.035
### Key Observations
1. **Inverse Relationship at Iteration 2:** There is a dramatic inverse movement between "Correct Flip" (which peaks) and "Incorrect Flip" (which dips to its lowest point) at iteration 2. This suggests a significant event or evaluation at this step.
2. **Volatility of Multiple-Choice:** The "Multiple-Choice" proportion is the most unstable, starting as the dominant metric and ending as the lowest.
3. **Convergence at Iteration 5:** By the final iteration, three of the four metrics ("Generation", "Correct Flip", "Incorrect Flip") converge within a narrow band between approximately 0.010 and 0.035, while "Multiple-Choice" drops to near zero.
4. **Overall Low Proportions:** All measured proportions remain below 0.10 (10%), indicating these "flip" events are relatively rare occurrences within the model's iterations.
### Interpretation
This chart likely visualizes the internal dynamics or evaluation results of the Qwen2.5-3B model during a multi-step process (e.g., iterative refinement, chain-of-thought reasoning, or multi-turn interaction). The "Proportion of Flips" probably refers to the rate at which the model changes its output or answer between steps.
* The high initial "Multiple-Choice" proportion suggests the model frequently changes its selected option early on, but this behavior nearly vanishes by the end.
* The spike in "Correct Flip" at iteration 2, coupled with the low "Incorrect Flip," indicates a phase where the model was particularly effective at making beneficial changes to its output.
* The rising trend in "Incorrect Flip" in later iterations (3 and 5) is a potential concern, suggesting that as the process continues, the model may become more prone to making erroneous changes.
* The general decline in the "Generation" flip rate could imply the model's generated text stabilizes over iterations.
In summary, the data suggests the model undergoes a volatile early phase (iterations 1-2) where it makes significant corrections, followed by a later phase (iterations 3-5) where its behavior becomes less predictable, with an increasing risk of incorrect modifications. The process appears to conclude with most flip rates settling at a low level.
</details>
(b) Qwen2.5-3B
<details>
<summary>x46.png Details</summary>

### Visual Description
## Line Chart: Llama-3.1-8B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "Llama-3.1-8B" that plots the "Proportion of Flips" against "Iterations" (from 1 to 5). It compares four distinct data series, differentiated by line style, color, and marker shape. The chart appears to track the performance or behavior of a model (likely the Llama-3.1-8B language model) across sequential steps or trials.
### Components/Axes
* **Chart Title:** "Llama-3.1-8B" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips" (rotated vertically on the left).
* **Scale:** Linear scale from 0.02 to 0.14, with major tick marks at 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, and 0.14.
* **X-Axis:**
* **Label:** "Iterations" (centered at the bottom).
* **Scale:** Discrete integer values from 1 to 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Dashed orange line.
3. **Correct Flip:** Dashed blue line with circular markers (ā).
4. **Incorrect Flip:** Dashed orange line with square markers (ā ).
### Detailed Analysis
**Trend Verification & Data Points (Approximate Values):**
1. **Generation (Solid Blue Line):**
* **Trend:** Starts high, dips, recovers partially, then declines steadily.
* **Points:**
* Iteration 1: ~0.13
* Iteration 2: ~0.09
* Iteration 3: ~0.10
* Iteration 4: ~0.07
* Iteration 5: ~0.06
2. **Multiple-Choice (Dashed Orange Line):**
* **Trend:** Starts moderately high, drops sharply, rises, then fluctuates at a lower level.
* **Points:**
* Iteration 1: ~0.095
* Iteration 2: ~0.04
* Iteration 3: ~0.06
* Iteration 4: ~0.02
* Iteration 5: ~0.04
3. **Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** Shows a consistent upward trend from zero.
* **Points:**
* Iteration 1: 0.00
* Iteration 2: ~0.02
* Iteration 3: ~0.04
* Iteration 4: ~0.06
* Iteration 5: ~0.06
4. **Incorrect Flip (Dashed Orange Line with Squares):**
* **Trend:** Starts at zero, rises to a peak, then declines.
* **Points:**
* Iteration 1: 0.00
* Iteration 2: ~0.02
* Iteration 3: ~0.05 (Peak)
* Iteration 4: ~0.04
* Iteration 5: ~0.03
### Key Observations
* **Convergence at Iteration 5:** The "Generation" and "Correct Flip" series converge at approximately 0.06 by the final iteration.
* **Peak of Incorrect Flips:** The "Incorrect Flip" series reaches its maximum value at Iteration 3, after which it begins to decrease.
* **Initial Disparity:** At Iteration 1, there is a large gap between the "Generation" proportion (~0.13) and the "Multiple-Choice" proportion (~0.095). This gap narrows significantly by Iteration 5.
* **Zero Start for Flip Categories:** Both "Correct Flip" and "Incorrect Flip" begin at 0.00 at Iteration 1, indicating no flips occurred at the start of the measured process.
### Interpretation
The chart likely illustrates the dynamics of a model's output "flips" (changes in response or prediction) during an iterative process, such as reinforcement learning, self-correction, or multi-step reasoning.
* **What the data suggests:** The "Generation" and "Multiple-Choice" lines may represent the overall flip rate for two different prompting or evaluation methods. The "Correct Flip" and "Incorrect Flip" lines break down the *nature* of these flips. The steady rise in "Correct Flip" suggests the model is increasingly making beneficial changes over iterations. The peak and subsequent decline in "Incorrect Flip" around iteration 3 could indicate a phase where the model initially makes more errors while exploring, but then learns to avoid them.
* **Relationship between elements:** The sum of "Correct Flip" and "Incorrect Flip" at any iteration does not equal the "Generation" or "Multiple-Choice" value. This implies that "flips" are a subset of the total changes measured by the other two metrics, or that the metrics are calculated differently. The convergence of "Generation" and "Correct Flip" at the end is notable, suggesting that by iteration 5, most flips in the Generation method are correct.
* **Notable anomaly:** The "Multiple-Choice" flip rate drops to its lowest point (~0.02) at Iteration 4, which is lower than both flip sub-categories at that point. This could indicate a moment of high stability or a specific characteristic of the multiple-choice evaluation at that step.
</details>
(c) Llama-3.1-8B
<details>
<summary>x47.png Details</summary>

### Visual Description
\n
## Line Chart: Qwen2.5-14B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "Qwen2.5-14B" that plots the "Proportion of Flips" against "Iterations" for four different data series. The chart appears to track the performance or behavior of a model (likely the Qwen2.5-14B language model) across five discrete iterations, measuring the rate of "flips" (which could refer to changes in output, corrections, or errors) for different evaluation methods or categories.
### Components/Axes
- **Title:** "Qwen2.5-14B" (centered at the top).
- **Y-Axis:** Labeled "Proportion of Flips". The scale runs from 0.00 to 0.05, with major tick marks at intervals of 0.01 (0.00, 0.01, 0.02, 0.03, 0.04, 0.05).
- **X-Axis:** Labeled "Iterations". The scale shows discrete integer values from 1 to 5.
- **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Dashed orange line.
3. **Correct Flip:** Solid black line with circular markers.
4. **Incorrect Flip:** Dashed black line with square markers.
- **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
The following data points are approximate values extracted by visual inspection of the chart.
**1. Generation (Solid Blue Line):**
- **Trend:** Starts high, dips significantly, recovers partially, then drops to zero.
- **Data Points:**
- Iteration 1: ~0.03
- Iteration 2: ~0.03
- Iteration 3: ~0.01
- Iteration 4: ~0.02
- Iteration 5: ~0.00
**2. Multiple-Choice (Dashed Orange Line):**
- **Trend:** Starts high, decreases, plateaus, then drops to zero.
- **Data Points:**
- Iteration 1: ~0.03
- Iteration 2: ~0.01
- Iteration 3: ~0.01
- Iteration 4: ~0.01
- Iteration 5: ~0.00
**3. Correct Flip (Solid Black Line, Circle Markers):**
- **Trend:** Shows a steady, monotonic decrease to zero.
- **Data Points:**
- Iteration 1: ~0.02
- Iteration 2: ~0.01
- Iteration 3: ~0.00
- Iteration 4: ~0.00
- Iteration 5: ~0.00
**4. Incorrect Flip (Dashed Black Line, Square Markers):**
- **Trend:** Starts moderate, dips, spikes dramatically to the chart's maximum, then falls sharply before a slight rise.
- **Data Points:**
- Iteration 1: ~0.02
- Iteration 2: ~0.01
- Iteration 3: ~0.05 (This is the highest point on the entire chart)
- Iteration 4: ~0.00
- Iteration 5: ~0.01
### Key Observations
1. **Peak Anomaly:** The most striking feature is the sharp spike in the "Incorrect Flip" series at Iteration 3, reaching the maximum y-axis value of 0.05. This is 5 times higher than its value at Iteration 2.
2. **Convergence to Zero:** Three of the four series ("Generation", "Multiple-Choice", "Correct Flip") converge to a proportion of 0.00 by Iteration 5. "Incorrect Flip" is the only series with a non-zero value at the final iteration.
3. **Initial Similarity:** At Iteration 1, the "Generation" and "Multiple-Choice" series start at the same point (~0.03), and the "Correct Flip" and "Incorrect Flip" series start at the same point (~0.02).
4. **Divergence at Iteration 3:** Iteration 3 is a critical point where all series show distinct behavior: "Incorrect Flip" peaks, "Generation" is at a local minimum, "Multiple-Choice" plateaus, and "Correct Flip" hits zero.
### Interpretation
The chart likely illustrates the dynamics of a model's self-correction or evaluation process over sequential iterations. The "Proportion of Flips" probably measures how often the model changes its initial answer or output.
- **What the data suggests:** The process appears to stabilize over time, as most flip proportions trend toward zero by the fifth iteration. However, the dramatic spike in "Incorrect Flip" at iteration 3 indicates a specific phase where the model becomes highly prone to making erroneous changes. This could be a point of over-correction or confusion in its reasoning process.
- **Relationship between elements:** The "Correct Flip" and "Incorrect Flip" series may be sub-categories of the flips measured in the "Generation" and "Multiple-Choice" tasks. The fact that "Correct Flip" steadily decreases to zero suggests the model stops making beneficial corrections early on. In contrast, the volatile "Incorrect Flip" series shows that harmful or erroneous corrections persist longer and exhibit unpredictable surges.
- **Notable anomaly:** The Iteration 3 spike for "Incorrect Flip" is the key finding. It suggests a non-linear, potentially problematic stage in the iterative process that warrants investigation. It might correlate with a specific type of task or a threshold in the model's confidence calibration.
- **Overall implication:** While the model's tendency to flip answers diminishes with more iterations (a sign of increasing stability), the presence of a late-stage spike in incorrect flips highlights a risk. Simply running more iterations does not guarantee improved accuracy; it may introduce new failure modes. The process requires careful monitoring, especially around the third iteration.
</details>
(d) Qwen2.5-14B
<details>
<summary>x48.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance Across Iterations
### Overview
This image is a line chart titled "DeepSeek-R1-Distill-Llama-8B". It plots the "Proportion of Flips" against "Iterations" for two distinct methods or conditions, labeled "Generation" and "Multiple-Choice". Each method's data is further broken down into "Correct Flip" and "Incorrect Flip" categories, represented by different line styles and markers. The chart appears to track the stability or error rate of a model's outputs over sequential iterations.
### Components/Axes
* **Chart Title:** "DeepSeek-R1-Distill-Llama-8B" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips" (rotated vertically on the left).
* **Scale:** Linear scale from 0.00 to 0.08, with major tick marks at 0.00, 0.02, 0.04, 0.06, and 0.08.
* **X-Axis:**
* **Label:** "Iterations" (centered at the bottom).
* **Scale:** Discrete integer scale from 1 to 5.
* **Legend:** Positioned at the top center of the plot area.
* **Line Styles & Colors:**
* **Generation:** Solid blue line.
* **Multiple-Choice:** Dashed orange line.
* **Markers (for Flip Type):**
* **Correct Flip:** Circle marker (ā).
* **Incorrect Flip:** Square marker (ā ).
* **Interpretation:** The chart displays four data series by combining the method (line style/color) with the flip type (marker). For example, the solid blue line with circle markers represents the proportion of "Correct Flips" for the "Generation" method.
### Detailed Analysis
The chart tracks four series across 5 iterations. Values are approximate based on visual inspection of the chart.
**1. Generation - Correct Flip (Solid Blue Line, Circle Markers)**
* **Trend:** Shows significant volatility. Starts high, dips, peaks sharply, then declines before rising again.
* **Data Points (Approx.):**
* Iteration 1: ~0.055
* Iteration 2: ~0.020
* Iteration 3: ~0.063 (Peak)
* Iteration 4: ~0.045
* Iteration 5: ~0.065
**2. Generation - Incorrect Flip (Solid Blue Line, Square Markers)**
* **Trend:** Follows a similar volatile pattern to its "Correct Flip" counterpart but with generally lower values after the first iteration.
* **Data Points (Approx.):**
* Iteration 1: ~0.055 (Same starting point as Correct Flip)
* Iteration 2: ~0.020
* Iteration 3: ~0.030
* Iteration 4: ~0.055
* Iteration 5: ~0.030
**3. Multiple-Choice - Correct Flip (Dashed Orange Line, Circle Markers)**
* **Trend:** Exhibits a dramatic "V" shape. Rises to a peak, plummets to near zero, then recovers.
* **Data Points (Approx.):**
* Iteration 1: ~0.030
* Iteration 2: ~0.045 (Peak)
* Iteration 3: ~0.000 (Trough)
* Iteration 4: ~0.030
* Iteration 5: ~0.055
**4. Multiple-Choice - Incorrect Flip (Dashed Orange Line, Square Markers)**
* **Trend:** Shows a general downward trend with a mid-point dip, followed by a slight recovery.
* **Data Points (Approx.):**
* Iteration 1: ~0.050
* Iteration 2: ~0.040
* Iteration 3: ~0.010
* Iteration 4: ~0.030
* Iteration 5: ~0.030
### Key Observations
1. **High Volatility at Iteration 3:** This iteration is a critical point. The "Generation" method's "Correct Flip" proportion spikes to its maximum, while the "Multiple-Choice" method's "Correct Flip" proportion crashes to its minimum (~0.00). This suggests a major divergence in behavior between the two methods at this stage.
2. **Convergence at Start and End:** At Iteration 1, the "Generation" method's correct and incorrect flip proportions are identical (~0.055). By Iteration 5, the "Multiple-Choice - Correct Flip" series (~0.055) surpasses all others, ending as the highest value.
3. **Method Comparison:** The "Generation" method (blue lines) generally maintains higher flip proportions than "Multiple-Choice" (orange lines) for most of the chart, except for the final iteration where "Multiple-Choice - Correct Flip" takes the lead.
4. **Flip Type Relationship:** For the "Generation" method, the correct and incorrect flip lines often move in tandem (e.g., both dip at Iteration 2). For "Multiple-Choice", their paths are more divergent, especially at Iteration 3.
### Interpretation
This chart likely visualizes the stability of a distilled language model's (DeepSeek-R1-Distill-Llama-8B) outputs when subjected to iterative refinement or testing. A "flip" probably refers to a change in the model's answer or output between iterations.
* **What the data suggests:** The "Generation" method appears to be more consistently active (higher overall flip rates) but volatile. The "Multiple-Choice" method shows a more dramatic failure mode at Iteration 3, where correct flips almost vanish, indicating a potential point of instability or a specific challenge in the task at that stage. Its recovery by Iteration 5 suggests resilience or adaptation.
* **How elements relate:** The direct comparison of two methods (Generation vs. Multiple-Choice) across the same iterative process allows for an evaluation of which approach is more stable or reliable. The breakdown into correct vs. incorrect flips adds a layer of quality assessmentānot just whether the output changes, but whether the change is an improvement.
* **Notable anomaly:** The near-zero value for "Multiple-Choice - Correct Flip" at Iteration 3 is the most striking anomaly. This could indicate a systematic error, a particularly difficult test case, or a phase where the model's outputs became temporarily locked in an incorrect state before correcting course.
* **Underlying purpose:** This analysis is crucial for understanding model behavior in iterative settings (like chain-of-thought reasoning or self-correction). It helps identify which methods are prone to erratic changes and at which stages they are most vulnerable, guiding improvements in model training or prompting strategies.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x49.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flash Performance Over Iterations
### Overview
The image is a line chart titled "Gemini-2.0-Flash" that plots the "Proportion of Flips" against the number of "Iterations" (from 1 to 5). It compares four distinct data series, each represented by a different line style and color, as defined in a legend positioned in the top-right corner of the chart area. The chart uses a grid background for easier value estimation.
### Components/Axes
* **Chart Title:** "Gemini-2.0-Flash" (centered at the top).
* **X-Axis:** Labeled "Iterations". The axis markers are discrete integers: 1, 2, 3, 4, 5.
* **Y-Axis:** Labeled "Proportion of Flips". The axis scale ranges from 0.00 to 0.04, with major tick marks at intervals of 0.01 (0.00, 0.01, 0.02, 0.03, 0.04).
* **Legend (Top-Right):**
* **Generation:** Solid blue line.
* **Multiple-Choice:** Solid orange line.
* **Correct Flip:** Dashed blue line with circular markers.
* **Incorrect Flip:** Dashed black line with square markers.
### Detailed Analysis
The following data points are approximate, derived from visual inspection of the chart.
**1. Generation (Solid Blue Line):**
* **Trend:** Starts high, drops sharply to a minimum, then rises steeply to the highest point on the chart.
* **Data Points:**
* Iteration 1: ~0.020
* Iteration 2: ~0.000
* Iteration 3: ~0.020
* Iteration 4: ~0.010
* Iteration 5: ~0.040
**2. Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts at zero, peaks at iteration 2, then declines steadily.
* **Data Points:**
* Iteration 1: ~0.000
* Iteration 2: ~0.030
* Iteration 3: ~0.020
* Iteration 4: ~0.010
* Iteration 5: ~0.000
**3. Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** Decreases from an initial value and then stabilizes at a low, constant level.
* **Data Points:**
* Iteration 1: ~0.020
* Iteration 2: ~0.010
* Iteration 3: ~0.010
* Iteration 4: ~0.010
* Iteration 5: ~0.010
**4. Incorrect Flip (Dashed Black Line with Squares):**
* **Trend:** Remains flat at the baseline (zero) across all iterations.
* **Data Points:**
* Iteration 1: 0.000
* Iteration 2: 0.000
* Iteration 3: 0.000
* Iteration 4: 0.000
* Iteration 5: 0.000
### Key Observations
1. **Divergent Paths:** The "Generation" and "Multiple-Choice" series show inverse behavior between iterations 1 and 2. "Generation" plummets while "Multiple-Choice" surges.
2. **Convergence at Iteration 4:** At iteration 4, three of the four series ("Generation", "Multiple-Choice", and "Correct Flip") converge at approximately the same value (~0.010).
3. **Final Divergence:** At the final iteration (5), the series diverge dramatically. "Generation" reaches its maximum, "Multiple-Choice" returns to zero, "Correct Flip" holds steady, and "Incorrect Flip" remains at zero.
4. **Zero Baseline:** The "Incorrect Flip" series shows no activity (0.000 proportion) throughout the entire experiment.
5. **Peak Values:** The highest recorded proportion is for "Generation" at iteration 5 (~0.040). The second-highest is for "Multiple-Choice" at iteration 2 (~0.030).
### Interpretation
This chart appears to track the performance or behavior of a system (likely an AI model named "Gemini-2.0-Flash") across five iterative steps or trials. The metric is the "Proportion of Flips," which could refer to changes in output, corrections, or state transitions.
* **"Generation" vs. "Multiple-Choice":** These two methods exhibit a trade-off. The "Multiple-Choice" approach has an early peak in flip proportion but then diminishes, suggesting it may be effective for initial adjustments but not sustained change. The "Generation" method shows a volatile but ultimately strong upward trend, indicating it becomes increasingly active or effective in causing "flips" by the final iteration.
* **"Correct Flip" Stability:** The "Correct Flip" series stabilizes at a low, non-zero value after the first iteration. This suggests a baseline rate of beneficial or intended changes is maintained consistently after an initial drop.
* **Absence of "Incorrect Flips":** The flatline at zero for "Incorrect Flip" is a significant positive indicator. It implies that whatever process is being measured did not produce any undesirable or erroneous "flips" during these five iterations.
* **Overall Narrative:** The data suggests that while different strategies ("Generation" vs. "Multiple-Choice") have different temporal profiles of activity, the system avoids incorrect outcomes. The dramatic rise of "Generation" at the end could indicate a successful learning curve or a phase where the model becomes particularly adept at generating changes. The convergence at iteration 4 might represent a pivotal point where different approaches temporarily yield similar levels of activity before diverging again.
</details>
(f) Gemini-2.0-Flash
Figure 9: Models Correct and Incorrect Flips on Baseline on tinyTruthfulQA
<details>
<summary>x50.png Details</summary>

### Visual Description
## Line Chart: SmolLM2-1.7B - Proportion of Flips Over Iterations
### Overview
This is a line chart titled "SmolLM2-1.7B" that plots the "Proportion of Flips" against "Iterations" for four distinct data series. The chart tracks the performance or behavior of a model (likely a language model) across five sequential iterations, measuring different types of "flips" or changes.
### Components/Axes
* **Chart Title:** "SmolLM2-1.7B" (Top Center)
* **X-Axis:** Labeled "Iterations". It has discrete, evenly spaced markers for values 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale is linear, ranging from 0.00 to 0.12, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12).
* **Legend:** Positioned in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed orange line with circular markers.
### Detailed Analysis
**Data Series Trends & Approximate Values:**
1. **Correct Flip (Dashed Blue Line):**
* **Trend:** Shows a steep, overall downward trend. It starts as the highest value and ends near zero.
* **Data Points:**
* Iteration 1: ~0.12 (Highest point on the entire chart)
* Iteration 2: ~0.05
* Iteration 3: ~0.05 (Plateau from Iteration 2)
* Iteration 4: ~0.00
* Iteration 5: ~0.00
2. **Generation (Solid Blue Line):**
* **Trend:** Fluctuates with a notable dip at Iteration 4.
* **Data Points:**
* Iteration 1: ~0.05
* Iteration 2: ~0.045
* Iteration 3: ~0.05
* Iteration 4: ~0.00 (Sharp drop, matches the Incorrect Flip value)
* Iteration 5: ~0.035
3. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Rises to a peak at Iteration 2, then gradually declines.
* **Data Points:**
* Iteration 1: ~0.02
* Iteration 2: ~0.04 (Peak for this series)
* Iteration 3: ~0.015
* Iteration 4: ~0.035
* Iteration 5: ~0.00
4. **Incorrect Flip (Dashed Orange Line):**
* **Trend:** Remains consistently low, near the baseline, with minor fluctuations.
* **Data Points:**
* Iteration 1: ~0.02
* Iteration 2: ~0.00
* Iteration 3: ~0.01
* Iteration 4: ~0.00
* Iteration 5: ~0.005
### Key Observations
* **Dominant Initial Value:** The "Correct Flip" proportion is overwhelmingly high at the first iteration (0.12), more than double any other series at that point.
* **Convergence at Iteration 4:** At Iteration 4, three of the four series ("Generation", "Correct Flip", "Incorrect Flip") converge at or very near 0.00. This is a significant anomaly or event point in the sequence.
* **Inverse Relationship (Early Iterations):** Between Iterations 1 and 2, as the "Correct Flip" proportion drops sharply (0.12 to 0.05), the "Multiple-Choice" proportion rises (0.02 to 0.04).
* **Final State (Iteration 5):** By the final iteration, only the "Generation" series shows a non-trivial proportion (~0.035). All other series are at or below ~0.005.
### Interpretation
The chart likely visualizes the internal dynamics of the SmolLM2-1.7B model during a multi-step process (e.g., iterative refinement, training, or evaluation). The "Proportion of Flips" could refer to changes in model outputs, predictions, or internal states.
* **What the data suggests:** The process begins with a high rate of "Correct Flips," which rapidly diminishes, suggesting the model quickly stabilizes or converges on correct answers. The "Generation" metric, which may represent novel or sampled outputs, shows more volatility, with a near-total collapse at Iteration 4 before a partial recovery. The consistently low "Incorrect Flip" rate is a positive indicator, suggesting the model rarely makes erroneous changes.
* **How elements relate:** The inverse movement between "Correct Flip" and "Multiple-Choice" in early iterations could indicate a trade-off or transition in the model's strategyāperhaps shifting from correcting itself to exploring multiple-choice options. The universal low point at Iteration 4 is critical; it may represent a reset, a point of maximum stability, or a specific intervention in the process.
* **Notable anomalies:** The sharp, synchronized drop to near-zero for three metrics at Iteration 4 is the most striking feature. This is not a gradual trend but a discrete event. The subsequent partial recovery of only the "Generation" metric suggests that whatever caused the Iteration 4 event primarily suppressed non-generative behaviors.
**Language Note:** All text in the image is in English.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x51.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-3B
### Overview
The image is a line chart titled "Qwen2.5-3B" that plots the "Proportion of Flips" against the number of "Iterations" (from 1 to 5). It compares four different data series, distinguished by color and line style, showing how their respective proportions change over five iterative steps.
### Components/Axes
* **Title:** "Qwen2.5-3B" (centered at the top).
* **Y-Axis:** Labeled "Proportion of Flips". The scale runs from 0 to 0.14, with major tick marks at intervals of 0.02 (0, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14).
* **X-Axis:** Labeled "Iterations". The scale shows discrete integer values from 1 to 5.
* **Legend:** Positioned at the top-center of the chart area. It defines four series:
* **Generation:** Solid blue line.
* **Multiple-Choice:** Solid orange line.
* **Correct Flip:** Dashed blue line.
* **Incorrect Flip:** Dashed orange line.
### Detailed Analysis
The following data points are approximate values extracted from the chart. The trend for each series is described first, followed by the estimated values per iteration.
1. **Generation (Solid Blue Line)**
* **Trend:** Shows a sharp initial decline followed by a more gradual decrease.
* **Data Points (Approximate):**
* Iteration 1: 0.11
* Iteration 2: 0.05
* Iteration 3: 0.04
* Iteration 4: 0.035
* Iteration 5: 0.03
2. **Multiple-Choice (Solid Orange Line)**
* **Trend:** Exhibits significant fluctuation, with a sharp drop, a rise, another drop, and a final rise.
* **Data Points (Approximate):**
* Iteration 1: 0.045
* Iteration 2: 0.01
* Iteration 3: 0.05
* Iteration 4: 0.01
* Iteration 5: 0.055
3. **Correct Flip (Dashed Blue Line)**
* **Trend:** Shows a steady, monotonic decline across all iterations.
* **Data Points (Approximate):**
* Iteration 1: 0.11
* Iteration 2: 0.09
* Iteration 3: 0.065
* Iteration 4: 0.045
* Iteration 5: 0.035
4. **Incorrect Flip (Dashed Orange Line)**
* **Trend:** Fluctuates, with an initial rise, a sharp dip, and a subsequent recovery.
* **Data Points (Approximate):**
* Iteration 1: 0.045
* Iteration 2: 0.06
* Iteration 3: 0.015
* Iteration 4: 0.025
* Iteration 5: 0.04
### Key Observations
* **Initial Convergence:** At Iteration 1, the "Generation" and "Correct Flip" series start at the same high point (~0.11), while the "Multiple-Choice" and "Incorrect Flip" series start at the same lower point (~0.045).
* **Diverging Paths:** After the first iteration, the paths of the series diverge significantly. The blue lines (Generation, Correct Flip) generally trend downward, while the orange lines (Multiple-Choice, Incorrect Flip) show more volatility.
* **Final Values:** By Iteration 5, all series have converged to a narrower range between approximately 0.03 and 0.055. The "Multiple-Choice" series ends at the highest value (~0.055), while "Generation" ends at the lowest (~0.03).
* **Volatility:** The "Multiple-Choice" series displays the most dramatic swings between iterations, particularly between iterations 2, 3, and 4.
### Interpretation
This chart likely visualizes the performance or behavior of a model (Qwen2.5-3B) over successive refinement or evaluation iterations. The "Proportion of Flips" metric suggests a measure of change, correction, or error rate.
* The steady decline of the **Correct Flip** (dashed blue) line indicates that the rate of beneficial or accurate changes decreases as the model iterates, possibly suggesting convergence or diminishing returns.
* The volatile path of the **Multiple-Choice** (solid orange) line implies instability or a non-monotonic response in that specific evaluation condition. Its final rise is an outlier compared to the general downward trend of other series.
* The parallel start of **Generation/Correct Flip** and **Multiple-Choice/Incorrect Flip** suggests these pairs may be intrinsically linked at the outset of the process. The subsequent divergence highlights how different evaluation methods (Generation vs. Multiple-Choice) or flip types (Correct vs. Incorrect) evolve differently over time.
* Overall, the data demonstrates that iterative processing reduces the proportion of flips for most conditions, but the path and final state are highly dependent on the specific category being measured.
</details>
(b) Qwen2.5-3B
<details>
<summary>x52.png Details</summary>

### Visual Description
## Line Chart: Llama-3.1-8B
### Overview
The image displays a line chart titled "Llama-3.1-8B," plotting the "Proportion of Flips" across five sequential "Iterations." The chart compares four distinct data series, differentiated by color (blue vs. orange) and line style (solid vs. dashed). The overall visual suggests a comparative analysis of model behavior or performance metrics over iterative steps.
### Components/Axes
* **Chart Title:** "Llama-3.1-8B" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips" (rotated vertically on the left).
* **Scale:** Linear scale ranging from 0.025 to 0.200, with major tick marks at 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175, and 0.200.
* **X-Axis:**
* **Label:** "Iterations" (centered at the bottom).
* **Scale:** Discrete integer values from 1 to 5.
* **Legend:** Positioned in the top-right corner of the plot area. It defines four series:
1. **Generation:** Blue solid line with circular markers.
2. **Multiple-Choice:** Orange solid line with circular markers.
3. **Correct Flip:** Blue dashed line with circular markers.
4. **Incorrect Flip:** Orange dashed line with circular markers.
### Detailed Analysis
Data points are approximate values read from the chart's grid.
**1. Generation (Blue Solid Line):**
* **Trend:** Rises sharply to a peak at Iteration 2, then declines steadily.
* **Data Points:**
* Iteration 1: ~0.080
* Iteration 2: ~0.160 (Peak)
* Iteration 3: ~0.140
* Iteration 4: ~0.105
* Iteration 5: ~0.090
**2. Multiple-Choice (Orange Solid Line):**
* **Trend:** Starts very high, drops dramatically at Iteration 2, then shows a gradual, slight recovery.
* **Data Points:**
* Iteration 1: ~0.175 (Highest point on the chart)
* Iteration 2: ~0.040 (Sharp drop)
* Iteration 3: ~0.050
* Iteration 4: ~0.040
* Iteration 5: ~0.065
**3. Correct Flip (Blue Dashed Line):**
* **Trend:** Increases to a peak at Iteration 4, with a slight dip at Iteration 3.
* **Data Points:**
* Iteration 1: ~0.030
* Iteration 2: ~0.095
* Iteration 3: ~0.080
* Iteration 4: ~0.115 (Peak)
* Iteration 5: ~0.090
**4. Incorrect Flip (Orange Dashed Line):**
* **Trend:** Fluctuates, with a notable peak at Iteration 2 and a dip at Iteration 4.
* **Data Points:**
* Iteration 1: ~0.030
* Iteration 2: ~0.105 (Peak)
* Iteration 3: ~0.100
* Iteration 4: ~0.040 (Dip)
* Iteration 5: ~0.065
### Key Observations
1. **Inverse Initial Behavior:** At Iteration 1, the "Multiple-Choice" proportion is the highest (~0.175), while "Correct Flip" and "Incorrect Flip" are tied for the lowest (~0.030).
2. **Dramatic Shift at Iteration 2:** This iteration shows the most significant changes. "Multiple-Choice" plummets, "Generation" peaks, and both "Flip" metrics see substantial increases.
3. **Convergence at Iteration 5:** By the final iteration, the values for all four series converge within a narrower band between approximately 0.065 and 0.090.
4. **Line Style Correlation:** The dashed lines ("Correct Flip" and "Incorrect Flip") generally exhibit more volatility (sharper peaks and dips) compared to the solid lines ("Generation" and "Multiple-Choice") after the initial iteration.
### Interpretation
This chart likely visualizes the behavior of the Llama-3.1-8B language model across iterative refinement or testing steps. The "Proportion of Flips" metric suggests a measure of change or correction in model outputs.
* **Method Comparison:** The starkly different trajectories of "Generation" (blue solid) and "Multiple-Choice" (orange solid) imply these are two distinct prompting or evaluation methods. The "Multiple-Choice" method starts with a very high flip rate that quickly stabilizes, while the "Generation" method's flip rate peaks later and declines more gradually.
* **Flip Analysis:** The separation of flips into "Correct" and "Incorrect" provides insight into the quality of the model's changes. The fact that "Correct Flip" peaks at Iteration 4, while "Incorrect Flip" peaks earlier at Iteration 2, could indicate that the model's ability to make beneficial corrections improves with more iterations, while erroneous changes are more prominent earlier in the process.
* **Convergence:** The convergence of all metrics by Iteration 5 suggests the model's behavior stabilizes, with the proportion of flips (both correct and incorrect) settling into a similar range regardless of the initial method (Generation vs. Multiple-Choice). This could point to a point of diminishing returns or a stable state in the iterative process.
* **Underlying Process:** The data tells a story of initial disparity and high volatility that gradually resolves into a more uniform, stable state. This pattern is common in optimization, training, or iterative refinement processes where early steps cause large adjustments that taper off over time.
</details>
(c) Llama-3.1-8B
<details>
<summary>x53.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-14B Flip Proportions Over Iterations
### Overview
This is a line chart titled "Qwen2.5-14B" that plots the "Proportion of Flips" against "Iterations" (from 1 to 5). It compares two primary methods, "Generation" and "Multiple-Choice," and also tracks two specific flip outcomes, "Correct Flip" and "Incorrect Flip," across the iterative process. The chart uses distinct line styles and colors to differentiate the four data series.
### Components/Axes
* **Chart Title:** "Qwen2.5-14B" (located at the top center).
* **Y-Axis:**
* **Label:** "Proportion of Flips"
* **Scale:** Linear, ranging from 0.00 to 0.06, with major tick marks at 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, and 0.06.
* **X-Axis:**
* **Label:** "Iterations"
* **Scale:** Discrete, with integer values from 1 to 5.
* **Legend:** Positioned in the top-left corner of the plot area. It contains four entries:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed orange line with square markers.
### Detailed Analysis
**Data Series Trends and Approximate Values:**
1. **Generation (Solid Blue Line):**
* **Trend:** Shows a steady, stepwise decline over the iterations.
* **Values:**
* Iteration 1: ~0.030
* Iteration 2: ~0.020
* Iteration 3: ~0.020
* Iteration 4: ~0.020
* Iteration 5: ~0.000
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Decreases initially, reaches a minimum at iteration 3, then shows a slight recovery.
* **Values:**
* Iteration 1: ~0.020
* Iteration 2: ~0.010
* Iteration 3: ~0.000
* Iteration 4: ~0.010
* Iteration 5: ~0.015
3. **Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** Exhibits a sharp, consistent downward trend, converging to zero.
* **Values:**
* Iteration 1: ~0.060
* Iteration 2: ~0.040
* Iteration 3: ~0.020
* Iteration 4: ~0.020
* Iteration 5: ~0.000
4. **Incorrect Flip (Dashed Orange Line with Squares):**
* **Trend:** Remains at or near zero for the first four iterations, then shows a sudden increase at the final iteration.
* **Values:**
* Iteration 1: ~0.000
* Iteration 2: ~0.000
* Iteration 3: ~0.000
* Iteration 4: ~0.000
* Iteration 5: ~0.010
### Key Observations
* **Dominant Trend:** The "Correct Flip" proportion starts as the highest value (0.06) and decreases most dramatically, suggesting a strong initial correction mechanism that diminishes over time.
* **Convergence:** Both the "Generation" and "Correct Flip" series converge to 0.00 by the fifth iteration.
* **Late Anomaly:** The "Incorrect Flip" series is flat until iteration 5, where it spikes to ~0.01. This is the only series that increases at the final step.
* **Method Comparison:** The "Multiple-Choice" method ends with a higher flip proportion (~0.015) than the "Generation" method (~0.000) at iteration 5.
* **Relationship:** The "Correct Flip" line (dashed blue) consistently sits above the "Generation" line (solid blue) for the first four iterations, indicating that the proportion of correct flips is higher than the overall generation flip proportion during that period.
### Interpretation
The chart likely illustrates the performance of the Qwen2.5-14B model across iterative refinement or evaluation steps. The "Proportion of Flips" probably refers to the rate at which the model changes its output (a "flip") from one iteration to the next.
* **Learning/Refinement Curve:** The steep decline in "Correct Flip" suggests the model makes many beneficial corrections early on, but these become less frequent as it stabilizes. The overall decline in flip rates for both "Generation" and "Correct Flip" indicates the model's outputs are becoming more consistent and confident over iterations.
* **Emergence of Errors:** The sudden appearance of "Incorrect Flip" at iteration 5 is a critical observation. It implies that after a period of stable or improving performance, a new type of error or instability emerges. This could be due to over-optimization, encountering a difficult edge case, or a breakdown in the iterative process.
* **Methodological Difference:** The "Multiple-Choice" method shows a different pattern, with a dip and partial recovery. Its final value being higher than "Generation" suggests that this method may be more prone to late-stage changes or instability compared to the standard generation approach.
* **Overall Narrative:** The data tells a story of initial high activity (many correct flips), leading to convergence and stability, followed by a potential failure mode (incorrect flips) in the final step. This highlights the importance of monitoring not just the rate of change, but the *quality* of changes throughout an iterative process.
</details>
(d) Qwen2.5-14B
<details>
<summary>x54.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B - Proportion of Flips Over Iterations
### Overview
The image is a line chart displaying the performance of a model named "DeepSeek-R1-Distill-Llama-8B" across five iterations. It tracks the "Proportion of Flips" for four distinct categories, comparing two primary methods ("Generation" and "Multiple-Choice") and two specific flip outcomes ("Correct Flip" and "Incorrect Flip").
### Components/Axes
* **Chart Title:** "DeepSeek-R1-Distill-Llama-8B" (centered at the top).
* **X-Axis:** Labeled "Iterations". It has five discrete, equally spaced tick marks labeled 1, 2, 3, 4, and 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale ranges from 0.00 to 0.06, with major tick marks at intervals of 0.01 (0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06).
* **Legend:** Located in the top-right corner of the plot area. It defines four data series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line.
4. **Incorrect Flip:** Dashed orange line.
### Detailed Analysis
**Data Series Trends and Approximate Values:**
1. **Generation (Solid Blue Line):**
* **Trend:** Starts low, rises sharply to a peak at iteration 2, then declines steadily through iterations 3 and 4, with a slight recovery at iteration 5.
* **Approximate Values:**
* Iteration 1: ~0.00
* Iteration 2: ~0.055 (Peak)
* Iteration 3: ~0.02
* Iteration 4: ~0.01
* Iteration 5: ~0.02
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Shows a fluctuating pattern. It starts at a moderate level, drops to near zero, rises slightly, dips again, and ends at a moderate level similar to its start.
* **Approximate Values:**
* Iteration 1: ~0.02
* Iteration 2: ~0.00 (Trough)
* Iteration 3: ~0.01
* Iteration 4: ~0.01
* Iteration 5: ~0.02
3. **Correct Flip (Dashed Blue Line):**
* **Trend:** Begins very low, increases to a peak at iteration 3, then decreases through iterations 4 and 5.
* **Approximate Values:**
* Iteration 1: ~0.00
* Iteration 2: ~0.04
* Iteration 3: ~0.055 (Peak)
* Iteration 4: ~0.04
* Iteration 5: ~0.035
4. **Incorrect Flip (Dashed Orange Line):**
* **Trend:** Starts at its highest point, drops sharply to a low level, and remains relatively flat and low for the remaining iterations.
* **Approximate Values:**
* Iteration 1: ~0.04 (Peak)
* Iteration 2: ~0.01
* Iteration 3: ~0.01
* Iteration 4: ~0.01
* Iteration 5: ~0.01
### Key Observations
* **Peak Performance:** The highest recorded proportion of flips (~0.055) occurs for two different series at different times: "Generation" peaks at iteration 2, and "Correct Flip" peaks at iteration 3.
* **Initial Anomaly:** The "Incorrect Flip" series has its maximum value at the very first iteration, which is notably higher than its values for all subsequent iterations.
* **Convergence at Iteration 4:** At iteration 4, the "Generation" and "Multiple-Choice" lines converge at approximately the same low value (~0.01).
* **Diverging Paths:** The "Correct Flip" (dashed blue) and "Incorrect Flip" (dashed orange) lines show opposite trends in the early iterations. "Correct Flip" rises from iteration 1 to 3, while "Incorrect Flip" falls sharply from iteration 1 to 2.
* **Final State:** By iteration 5, the "Correct Flip" proportion remains the highest among all series, while "Multiple-Choice" and "Generation" have recovered to similar, moderate levels.
### Interpretation
This chart likely visualizes the behavior of a language model (DeepSeek-R1-Distill-Llama-8B) during a self-correction or refinement process over multiple iterations. The "Proportion of Flips" probably refers to the rate at which the model changes its initial answer.
* **Method Comparison:** The "Generation" method (solid blue) shows a high initial flip rate that quickly diminishes, suggesting early, aggressive self-correction that stabilizes. The "Multiple-Choice" method (solid orange) maintains a lower, more stable flip rate throughout.
* **Quality of Corrections:** The "Correct Flip" (dashed blue) series is crucial. Its rise to a peak at iteration 3 indicates that the model's self-corrections were most frequently *improving* its answers during the middle phase of the process. The subsequent decline suggests diminishing returns or stabilization.
* **Error Introduction:** The high initial "Incorrect Flip" (dashed orange) rate at iteration 1 is a significant finding. It implies that the model's first attempt at self-correction was often detrimental, introducing errors. This rate drops dramatically and stays low, indicating the model quickly learns to avoid making bad corrections.
* **Overall Process Narrative:** The data suggests a process where the model initially makes many changes, some of which are harmful (high Incorrect Flip at iter 1). It then enters a phase of more beneficial self-correction (rising Correct Flip, peaking at iter 3). Finally, the system stabilizes, with lower overall flip rates and a sustained, though reduced, rate of beneficial corrections. The convergence of the two primary methods at iteration 4 might indicate a point where different correction strategies yield similar, minimal change.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x55.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flip Proportions Over Iterations
### Overview
The image is a line chart titled "Gemini-2.0-Flash" that plots the "Proportion of Flips" against "Iterations" (1 through 5). It compares four different metrics or conditions, represented by distinct lines. The chart appears to track the frequency of certain "flip" events (likely changes in model output or decision) across sequential iterations for a system named Gemini-2.0-Flash.
### Components/Axes
* **Title:** "Gemini-2.0-Flash" (centered at the top).
* **Y-Axis:** Label is "Proportion of Flips". Scale ranges from 0.00 to 0.08, with major tick marks at 0.00, 0.02, 0.04, 0.06, and 0.08.
* **X-Axis:** Label is "Iterations". Discrete values are marked at 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers (ā).
4. **Incorrect Flip:** Dashed black line with square markers (ā ).
### Detailed Analysis
The following data points are approximate visual estimates from the chart.
**1. Generation (Solid Blue Line):**
* **Trend:** Starts very high, drops sharply, then fluctuates at a lower level.
* **Data Points:**
* Iteration 1: ~0.070
* Iteration 2: ~0.020
* Iteration 3: ~0.010
* Iteration 4: ~0.020
* Iteration 5: ~0.070
**2. Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts low, peaks at iteration 2, then generally declines.
* **Data Points:**
* Iteration 1: ~0.030
* Iteration 2: ~0.040
* Iteration 3: ~0.020
* Iteration 4: ~0.000
* Iteration 5: ~0.010
**3. Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** Shows a general downward trend from a moderate starting point.
* **Data Points:**
* Iteration 1: ~0.055
* Iteration 2: ~0.030
* Iteration 3: ~0.060
* Iteration 4: ~0.040
* Iteration 5: ~0.020
**4. Incorrect Flip (Dashed Black Line with Squares):**
* **Trend:** Starts high, dips, peaks at iteration 3, then declines.
* **Data Points:**
* Iteration 1: ~0.065
* Iteration 2: ~0.035
* Iteration 3: ~0.060
* Iteration 4: ~0.040
* Iteration 5: ~0.020
### Key Observations
1. **Inverse Relationship (Iteration 2):** At iteration 2, the "Generation" proportion plummets while the "Multiple-Choice" proportion reaches its peak.
2. **Convergence of Flips:** The "Correct Flip" and "Incorrect Flip" lines follow very similar paths from iteration 3 onward, converging at the same value (~0.040) at iteration 4 and ending at the same low value (~0.020) at iteration 5.
3. **High Initial Volatility:** The first two iterations show the most dramatic changes and divergence between the different metrics.
4. **Final State:** By iteration 5, the "Generation" proportion has returned to a high level similar to its start, while all other metrics ("Multiple-Choice", "Correct Flip", "Incorrect Flip") are at or near their lowest points.
### Interpretation
This chart likely visualizes the behavior of an AI model (Gemini-2.0-Flash) during a multi-step evaluation or refinement process. The "Proportion of Flips" probably measures how often the model changes its answer or output between iterations.
* **"Generation" vs. "Multiple-Choice":** The sharp inverse movement at iteration 2 suggests a fundamental shift in the model's behavior or the task's nature. It may indicate a transition from a generative, open-ended response mode to a more constrained, selection-based mode.
* **Flip Accuracy:** The close tracking of "Correct Flip" and "Incorrect Flip" lines, especially after iteration 3, is notable. It suggests that as the process continues, the model's tendency to change its mind becomes less discriminatingāit becomes equally likely to flip to a correct or incorrect answer. The overall downward trend in flips after iteration 3 indicates the model's outputs are stabilizing.
* **Process Narrative:** The data tells a story of initial high instability and exploration (high flip rates), a major mode shift at iteration 2, followed by a period of converging and decreasing flip activity, leading to a more stable final state. The resurgence of the "Generation" flip rate at the end, while others remain low, is an anomaly that might indicate a final generative step or a different type of evaluation at iteration 5.
</details>
(f) Gemini-2.0-Flash
Figure 10: Models Correct and Incorrect Flips on CoT on tinyTruthfulQA
<details>
<summary>x56.png Details</summary>

### Visual Description
## Line Chart: SmolLM2-1.7B Flip Proportions Over Iterations
### Overview
The image is a line chart titled "SmolLM2-1.7B". It plots the "Proportion of Flips" against the number of "Iterations" (from 1 to 5) for four distinct data series. The chart appears to track the stability or change rate of a model's outputs across sequential evaluation or training iterations.
### Components/Axes
* **Chart Title:** SmolLM2-1.7B (Top Center)
* **Y-Axis:**
* **Label:** Proportion of Flips
* **Scale:** Linear, from 0.00 to 0.04, with major tick marks at 0.00, 0.01, 0.02, 0.03, and 0.04.
* **X-Axis:**
* **Label:** Iterations
* **Scale:** Discrete, with markers at integers 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-right corner of the plot area.
* **Generation:** Blue solid line with circle markers.
* **Multiple-Choice:** Orange dashed line with square markers.
* **Correct Flip:** Gray dotted line with diamond markers.
* **Incorrect Flip:** Black dash-dot line with 'x' markers.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
1. **Generation (Blue, solid line, circles):**
* **Trend:** Starts at the highest point, drops sharply, then stabilizes near zero.
* **Data Points:**
* Iteration 1: ~0.033
* Iteration 2: ~0.000
* Iteration 3: ~0.010
* Iteration 4: ~0.000
* Iteration 5: ~0.000
2. **Multiple-Choice (Orange, dashed line, squares):**
* **Trend:** Starts high, decreases steadily, and flattens out after iteration 3.
* **Data Points:**
* Iteration 1: ~0.030
* Iteration 2: ~0.020
* Iteration 3: ~0.010
* Iteration 4: ~0.000
* Iteration 5: ~0.000
3. **Correct Flip (Gray, dotted line, diamonds):**
* **Trend:** Starts low, shows a minor peak at iteration 3, then declines to zero.
* **Data Points:**
* Iteration 1: ~0.010
* Iteration 2: ~0.010
* Iteration 3: ~0.010
* Iteration 4: ~0.000
* Iteration 5: ~0.000
4. **Incorrect Flip (Black, dash-dot line, 'x's):**
* **Trend:** Starts low and quickly drops to near zero, remaining flat.
* **Data Points:**
* Iteration 1: ~0.010
* Iteration 2: ~0.000
* Iteration 3: ~0.000
* Iteration 4: ~0.000
* Iteration 5: ~0.000
### Key Observations
* **Convergence:** All four metrics converge to a proportion of 0.00 by iteration 4 and remain there at iteration 5.
* **Initial Disparity:** At iteration 1, there is a clear hierarchy: "Generation" and "Multiple-Choice" flips are significantly higher (~0.030-0.033) than "Correct" and "Incorrect" flips (~0.010).
* **Rate of Decline:** "Generation" flips show the most dramatic single-step drop (from ~0.033 to ~0.000 between iterations 1 and 2). "Multiple-Choice" flips decline more gradually.
* **Stability of Low Values:** The "Incorrect Flip" series reaches and maintains a value of ~0.000 from iteration 2 onward, suggesting this type of flip becomes negligible very early in the process.
### Interpretation
This chart likely illustrates the stabilization of a language model (SmolLM2-1.7B) during a process like iterative refinement, self-correction, or multi-turn evaluation. The "Proportion of Flips" probably measures how often the model changes its output (a "flip") between iterations.
* **What the data suggests:** The model's outputs become increasingly stable with more iterations. The high initial flip rates for "Generation" and "Multiple-Choice" suggest significant early-stage revision, which rapidly diminishes. The near-zero flip rates after iteration 3 indicate the model has reached a consistent state where further iterations produce little to no change.
* **How elements relate:** The different series may categorize the *type* of flip. "Generation" vs. "Multiple-Choice" could refer to the task format, while "Correct" vs. "Incorrect" might refer to whether a flip led to a better or worse answer. The fact that all converge to zero implies the process successfully eliminates instability, regardless of flip type.
* **Notable anomaly:** The "Correct Flip" series has a slight plateau or minor peak at iteration 3 (~0.010) while others are declining. This could indicate a final round of beneficial corrections before full stabilization.
* **Overall implication:** The process being measured is effective at reducing output volatility. For practical purposes, running beyond 3-4 iterations may yield diminishing returns, as the model's outputs have already stabilized.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x57.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-3B Flip Proportions Over Iterations
### Overview
This is a line chart titled "Qwen2.5-3B" that plots the "Proportion of Flips" against "Iterations" for four distinct data series. The chart tracks how the frequency of different types of "flips" (likely changes in model output or behavior) evolves over five sequential iterations. The visual style uses solid and dashed lines in blue and orange, with distinct markers for the dashed series.
### Components/Axes
* **Title:** "Qwen2.5-3B" (centered at the top).
* **Y-Axis:** Labeled "Proportion of Flips". The scale runs from 0.00 to 0.10, with major tick marks at 0.00, 0.02, 0.04, 0.06, 0.08, and 0.10.
* **X-Axis:** Labeled "Iterations". The scale shows discrete integer values from 1 to 5.
* **Legend:** Positioned in the top-right corner of the plot area. It defines four series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed blue line with circular markers.
4. **Incorrect Flip:** Dashed orange line with square markers.
### Detailed Analysis
The following data points are approximate values extracted from the chart's visual representation.
**1. Generation (Solid Blue Line)**
* **Trend:** Starts low, remains flat, dips slightly, then rises and plateaus.
* **Data Points:**
* Iteration 1: ~0.03
* Iteration 2: ~0.03
* Iteration 3: ~0.02
* Iteration 4: ~0.04
* Iteration 5: ~0.04
**2. Multiple-Choice (Solid Orange Line)**
* **Trend:** Shows a sharp initial decline, a low point at iteration 3, a partial recovery, and a final decline.
* **Data Points:**
* Iteration 1: ~0.09
* Iteration 2: ~0.06
* Iteration 3: ~0.01
* Iteration 4: ~0.04
* Iteration 5: ~0.02
**3. Correct Flip (Dashed Blue Line, Circle Markers)**
* **Trend:** Begins high, drops, peaks at iteration 3, then declines steadily.
* **Data Points:**
* Iteration 1: ~0.08
* Iteration 2: ~0.04
* Iteration 3: ~0.06
* Iteration 4: ~0.03
* Iteration 5: ~0.02
**4. Incorrect Flip (Dashed Orange Line, Square Markers)**
* **Trend:** Shows a consistent downward trend, approaching zero by the final iteration.
* **Data Points:**
* Iteration 1: ~0.09
* Iteration 2: ~0.05
* Iteration 3: ~0.02
* Iteration 4: ~0.04
* Iteration 5: ~0.00
### Key Observations
* **Highest Initial Values:** At Iteration 1, both "Multiple-Choice" and "Incorrect Flip" have the highest proportion of flips (~0.09).
* **Convergence at Iteration 3:** Three of the four lines ("Generation", "Multiple-Choice", "Incorrect Flip") converge at or near their lowest points at Iteration 3.
* **Divergence at Iteration 4:** There is a notable rebound at Iteration 4 for "Multiple-Choice", "Correct Flip", and "Incorrect Flip", while "Generation" also rises.
* **Final State (Iteration 5):** "Incorrect Flip" drops to approximately 0.00. "Correct Flip" and "Multiple-Choice" end at similar low levels (~0.02). "Generation" ends as the highest series at ~0.04.
* **Line Style Correlation:** The dashed lines ("Correct/Incorrect Flip") generally show more volatility (sharper peaks and troughs) compared to the solid lines ("Generation/Multiple-Choice").
### Interpretation
The chart likely illustrates the behavior of a language model (Qwen2.5-3B) during an iterative process, such as fine-tuning, reinforcement learning, or a multi-step reasoning task. "Flips" probably refer to changes in the model's output between iterations.
* **Performance Improvement:** The steady decline of the "Incorrect Flip" line to near zero suggests the model is becoming more stable and consistent in its outputs, reducing erroneous changes.
* **Task-Specific Behavior:** The distinct paths of "Generation" (open-ended text) and "Multiple-Choice" (constrained selection) indicate the model's flip dynamics differ significantly based on task type. Multiple-choice flips are initially very high but drop dramatically.
* **The Iteration 3 Inflection Point:** The synchronized low point at Iteration 3 for most series could indicate a key phase in the processāperhaps a point of maximum stability or a change in the training/evaluation protocolābefore a subsequent adjustment phase (Iteration 4).
* **Correct vs. Incorrect Flips:** The fact that "Correct Flip" remains above "Incorrect Flip" after the first iteration, and that "Incorrect Flip" vanishes, implies that when the model does change its output in later iterations, it is more likely to be a change towards a correct answer. This is a positive indicator of learning or refinement.
**Language:** All text in the image is in English.
</details>
(b) Qwen2.5-3B
<details>
<summary>x58.png Details</summary>

### Visual Description
## Line Chart: Llama-3.1-8B Performance Metrics
### Overview
The image is a line chart titled "Llama-3.1-8B" that plots the "Proportion of flips" against "Iterations" (1 through 5). It compares two primary methods ("Generation" and "Multiple-Choice") and tracks two types of "flips" ("Correct Flip" and "Incorrect Flip") associated with them. The chart uses a combination of solid and dashed lines with distinct colors to differentiate the four data series.
### Components/Axes
* **Title:** "Llama-3.1-8B" (Top center).
* **Y-Axis:** Labeled "Proportion of flips". Scale ranges from 0.000 to 0.175, with major tick marks at 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, and 0.175.
* **X-Axis:** Labeled "Iterations". Discrete values marked at 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
* `Generation`: Solid blue line.
* `Multiple-Choice`: Solid orange line.
* `Correct Flip`: Dashed blue line with circle markers.
* `Incorrect Flip`: Dashed orange line with circle markers.
### Detailed Analysis
**Data Series Trends & Approximate Values:**
1. **Generation (Solid Blue Line):**
* **Trend:** Volatile. Starts high, dips, peaks sharply, then falls.
* **Data Points (Approx.):**
* Iteration 1: ~0.155
* Iteration 2: ~0.105
* Iteration 3: ~0.075
* Iteration 4: ~0.150 (Peak)
* Iteration 5: ~0.055
2. **Multiple-Choice (Solid Orange Line):**
* **Trend:** Generally decreasing with a slight uptick at the end.
* **Data Points (Approx.):**
* Iteration 1: ~0.065
* Iteration 2: ~0.035
* Iteration 3: ~0.025
* Iteration 4: ~0.000 (Minimum)
* Iteration 5: ~0.025
3. **Correct Flip (Dashed Blue Line with Circles):**
* **Trend:** U-shaped. Starts high, drops to a minimum, then rises again.
* **Data Points (Approx.):**
* Iteration 1: ~0.155 (Matches Generation start)
* Iteration 2: ~0.105 (Matches Generation at I2)
* Iteration 3: ~0.105
* Iteration 4: ~0.040
* Iteration 5: ~0.075
4. **Incorrect Flip (Dashed Orange Line with Circles):**
* **Trend:** Consistently decreasing.
* **Data Points (Approx.):**
* Iteration 1: ~0.065 (Matches Multiple-Choice start)
* Iteration 2: ~0.035 (Matches Multiple-Choice at I2)
* Iteration 3: ~0.025 (Matches Multiple-Choice at I3)
* Iteration 4: ~0.000 (Matches Multiple-Choice at I4)
* Iteration 5: ~0.000
### Key Observations
1. **Convergence at Start:** At Iteration 1, the "Generation" line and the "Correct Flip" line originate from the same point (~0.155). Similarly, the "Multiple-Choice" line and the "Incorrect Flip" line start together (~0.065).
2. **Divergence of Flips:** After Iteration 2, the "Correct Flip" (dashed blue) and "Incorrect Flip" (dashed orange) lines diverge from their solid-line counterparts. The "Correct Flip" proportion remains significantly higher than the "Incorrect Flip" proportion from Iteration 3 onward.
3. **Peak and Trough:** The "Generation" method shows a dramatic peak at Iteration 4, while the "Multiple-Choice" method hits its lowest point at the same iteration.
4. **Final State:** By Iteration 5, the "Incorrect Flip" proportion has dropped to near zero, while the "Correct Flip" proportion has recovered to a moderate level (~0.075). The "Generation" proportion ends lower than its peak but higher than the "Multiple-Choice" proportion.
### Interpretation
This chart appears to analyze the behavior of a language model (Llama-3.1-8B) over successive iterations of a process, likely involving self-correction or refinement ("flips").
* **Method Comparison:** The "Generation" method exhibits higher volatility and a higher peak proportion of flips compared to the more stable and generally lower "Multiple-Choice" method. This suggests the Generation approach may involve more frequent or dramatic changes between iterations.
* **Flip Analysis:** The divergence between "Correct Flip" and "Incorrect Flip" is critical. The consistently higher rate of "Correct Flips" indicates that when the model changes its output (flips), it is more likely to be moving towards a correct answer than an incorrect one, especially in later iterations. The near-zero "Incorrect Flip" rate by the end suggests the process effectively minimizes erroneous changes over time.
* **Process Dynamics:** The U-shape of the "Correct Flip" line and the peak in "Generation" at Iteration 4 could indicate a phase of intensive correction or exploration in the middle of the process, which then stabilizes. The initial alignment of the solid and dashed lines suggests that in early iterations, all flips are categorized as either correct or incorrect for their respective methods, but the tracking becomes distinct as the process evolves.
**Language:** All text in the image is in English.
</details>
(c) Llama-3.1-8B
<details>
<summary>x59.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-14B - Proportion of Flips Over Iterations
### Overview
The image is a line chart titled "Qwen2.5-14B" that plots the "Proportion of Flips" against "Iterations" for four different data series. The chart appears to track the frequency of certain events (flips) across five sequential iterations for a model or process named Qwen2.5-14B.
### Components/Axes
* **Title:** "Qwen2.5-14B" (located at the top center).
* **X-Axis:** Labeled "Iterations". It has discrete markers at integer values: 1, 2, 3, 4, 5.
* **Y-Axis:** Labeled "Proportion of Flips". The scale ranges from 0.00 to 0.08, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08).
* **Legend:** Located in the top-right corner of the plot area. It defines four series:
* **Generation:** Solid blue line.
* **Multiple-Choice:** Solid orange line.
* **Correct Flip:** Dashed blue line.
* **Incorrect Flip:** Dashed orange line.
### Detailed Analysis
The following data points are approximate values extracted from the chart's visual representation.
**1. Generation (Solid Blue Line):**
* **Trend:** Shows a small peak at iteration 2 before declining and stabilizing.
* **Data Points:**
* Iteration 1: ~0.01
* Iteration 2: ~0.02 (peak)
* Iteration 3: ~0.00
* Iteration 4: ~0.01
* Iteration 5: ~0.01
**2. Multiple-Choice (Solid Orange Line):**
* **Trend:** Starts at a low level and quickly drops to zero, remaining there.
* **Data Points:**
* Iteration 1: ~0.01
* Iteration 2: ~0.00
* Iteration 3: ~0.00
* Iteration 4: ~0.00
* Iteration 5: ~0.00
**3. Correct Flip (Dashed Blue Line):**
* **Trend:** Exhibits a steep, monotonic decline from a high initial value, leveling off after iteration 3.
* **Data Points:**
* Iteration 1: ~0.08 (highest value on the chart)
* Iteration 2: ~0.03
* Iteration 3: ~0.01
* Iteration 4: ~0.01
* Iteration 5: ~0.01
**4. Incorrect Flip (Dashed Orange Line):**
* **Trend:** Starts low and drops to zero immediately, showing no activity after the first iteration.
* **Data Points:**
* Iteration 1: ~0.01
* Iteration 2: ~0.00
* Iteration 3: ~0.00
* Iteration 4: ~0.00
* Iteration 5: ~0.00
### Key Observations
1. **Dominant Series:** The "Correct Flip" series has the highest initial proportion (0.08) and the most dramatic change, accounting for the majority of the "flips" at the start.
2. **Convergence:** By Iteration 3, three of the four series ("Generation", "Correct Flip", "Incorrect Flip") converge to a very low proportion (~0.01 or 0.00). "Multiple-Choice" converges to zero by Iteration 2.
3. **Stability:** From Iteration 3 to 5, all series show stable, low proportions, indicating the process being measured reaches a steady state.
4. **Initial Activity:** All measurable activity (non-zero proportions) occurs primarily in the first two iterations. Iterations 3, 4, and 5 show minimal to no change.
### Interpretation
This chart likely illustrates the performance or behavior of the Qwen2.5-14B model over sequential refinement or testing steps (iterations). The "Proportion of Flips" could refer to changes in model outputs, corrections, or errors.
* **Learning/Correction Curve:** The steep decline in "Correct Flip" suggests the model makes many correct adjustments or corrections early on (Iteration 1), which rapidly diminish as it stabilizes. This is a classic sign of a learning or optimization process where major errors are fixed first.
* **Negligible Incorrect Changes:** The "Incorrect Flip" series is only non-zero at the very start and is always equal to or lower than the "Correct Flip" proportion. This indicates that when the model does change its output ("flip"), it is far more likely to be a correct adjustment than an incorrect one, especially after the first iteration.
* **Task-Specific Behavior:** The difference between "Generation" and "Multiple-Choice" lines suggests the model's behavior varies by task type. The "Generation" task shows a small, transient increase in flips at Iteration 2, while the "Multiple-Choice" task stabilizes almost immediately.
* **Overall Conclusion:** The data demonstrates a model that undergoes a brief period of significant, predominantly correct self-correction or adaptation in the first 1-2 iterations, after which its outputs become highly stable. The process is efficient, with incorrect flips being minimal and short-lived.
</details>
(d) Qwen2.5-14B
<details>
<summary>x60.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B - Proportion of Flips Over Iterations
### Overview
The image displays a line chart tracking the "Proportion of flips" across five iterations for a model or process named "DeepSeek-R1-Distill-Llama-8B". The chart compares four distinct metrics: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The data suggests an analysis of model behavior or output changes over sequential steps.
### Components/Axes
* **Chart Title:** "DeepSeek-R1-Distill-Llama-8B" (centered at the top).
* **Y-Axis:** Labeled "Proportion of flips". The scale runs from 0.00 to 0.12, with major tick marks at intervals of 0.02 (0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12).
* **X-Axis:** Labeled "Iterations". The scale shows discrete integer values from 1 to 5.
* **Legend:** Positioned in the top-right corner of the plot area. It defines four data series:
* **Generation:** Solid blue line.
* **Multiple-Choice:** Solid orange line.
* **Correct Flip:** Black dashed line with circular markers.
* **Incorrect Flip:** Black dashed line with square markers.
* **Grid:** A light gray grid is present in the background.
### Detailed Analysis
**Data Series Trends & Approximate Values:**
1. **Generation (Blue Solid Line):**
* **Trend:** Fluctuates at a low level, with a small peak at iteration 3 and a rise at iteration 5.
* **Data Points (Approx.):**
* Iteration 1: 0.02
* Iteration 2: 0.02
* Iteration 3: 0.04
* Iteration 4: 0.02
* Iteration 5: 0.05
2. **Multiple-Choice (Orange Solid Line):**
* **Trend:** Starts high, peaks at iteration 3, then declines before a slight recovery.
* **Data Points (Approx.):**
* Iteration 1: 0.085
* Iteration 2: 0.08
* Iteration 3: 0.11 (Peak)
* Iteration 4: 0.07
* Iteration 5: 0.075
3. **Correct Flip (Black Dashed Line, Circle Markers):**
* **Trend:** Shows significant volatility. It drops sharply at iteration 3, spikes at iteration 4, and drops again at iteration 5.
* **Data Points (Approx.):**
* Iteration 1: 0.03
* Iteration 2: 0.03
* Iteration 3: 0.01 (Trough)
* Iteration 4: 0.06 (Peak)
* Iteration 5: 0.02
4. **Incorrect Flip (Black Dashed Line, Square Markers):**
* **Trend:** Shows a gradual decline from iteration 1 to 4, followed by a slight increase.
* **Data Points (Approx.):**
* Iteration 1: 0.085
* Iteration 2: 0.08
* Iteration 3: 0.075
* Iteration 4: 0.065 (Trough)
* Iteration 5: 0.075
### Key Observations
* **Highest Value:** The highest recorded proportion is for **Multiple-Choice** at iteration 3 (~0.11).
* **Lowest Value:** The lowest recorded proportion is for **Correct Flip** at iteration 3 (~0.01).
* **Convergence at Iteration 4:** At iteration 4, the values for **Multiple-Choice** (~0.07) and **Correct Flip** (~0.06) are very close, representing a point where these two metrics nearly intersect.
* **Volatility:** The **Correct Flip** series exhibits the most dramatic swings between consecutive iterations (e.g., from 0.01 at iter 3 to 0.06 at iter 4).
* **Relative Positions:** The **Multiple-Choice** and **Incorrect Flip** lines generally maintain higher proportions than the **Generation** and **Correct Flip** lines throughout most iterations, except at iteration 4 where **Correct Flip** surpasses **Incorrect Flip**.
### Interpretation
This chart appears to analyze the stability or error-correction behavior of the "DeepSeek-R1-Distill-Llama-8B" model over iterative refinement steps. The "proportion of flips" likely refers to changes in model outputs or decisions between iterations.
* The high and peaking **Multiple-Choice** flip rate suggests that the model's answers to multiple-choice questions are highly unstable, especially around iteration 3, indicating a period of significant re-evaluation or uncertainty.
* The volatile **Correct Flip** rate is particularly interesting. The sharp drop at iteration 3 followed by a spike at iteration 4 could indicate a phase where the model first becomes more confident in its correct answers (fewer flips), then undergoes a correction phase where it changes many correct answers (possibly to incorrect ones, given the concurrent dip in **Incorrect Flip**).
* The relatively low and stable **Generation** flip rate implies that the model's open-ended generation outputs are more consistent across iterations compared to its discrete choice behaviors.
* The overall pattern does not show a simple convergence to stability. Instead, it reveals complex, non-monotonic dynamics where different aspects of model behavior (generation vs. choice, correct vs. incorrect) evolve differently over the iterative process. The iteration 3-4 window appears to be a critical period of significant change for the model's decision-making.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x61.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flip Proportions Over Iterations
### Overview
The image is a line chart titled "Gemini-2.0-Flash" that plots the "Proportion of Flips" against "Iterations" (from 1 to 5). It compares four different metrics or conditions, represented by distinct lines. The chart uses a white background with a light gray grid.
### Components/Axes
* **Title:** "Gemini-2.0-Flash" (centered at the top).
* **Y-Axis:**
* **Label:** "Proportion of Flips" (rotated vertically on the left).
* **Scale:** Linear, ranging from 0.00 to 0.05, with major tick marks at 0.00, 0.01, 0.02, 0.03, 0.04, and 0.05.
* **X-Axis:**
* **Label:** "Iterations" (centered at the bottom).
* **Scale:** Discrete, with integer values 1, 2, 3, 4, and 5.
* **Legend:** Located in the top-right corner of the chart area. It defines four data series:
1. **Generation:** Solid blue line.
2. **Multiple-Choice:** Solid orange line.
3. **Correct Flip:** Dashed black line with circular markers (ā).
4. **Incorrect Flip:** Dashed black line with square markers (ā ).
### Detailed Analysis
The following data points are approximate, read from the chart's grid.
**1. Generation (Solid Blue Line):**
* **Trend:** Volatile, with a significant peak at iteration 3.
* **Data Points:**
* Iteration 1: ~0.040
* Iteration 2: ~0.020
* Iteration 3: ~0.040 (Peak)
* Iteration 4: ~0.020
* Iteration 5: ~0.030
**2. Multiple-Choice (Solid Orange Line):**
* **Trend:** Sharp decline from iteration 1 to 2, then plateaus near zero.
* **Data Points:**
* Iteration 1: ~0.030
* Iteration 2: ~0.010
* Iteration 3: ~0.000
* Iteration 4: ~0.000
* Iteration 5: ~0.000
**3. Correct Flip (Dashed Black Line, ā markers):**
* **Trend:** Declines to zero and remains flat.
* **Data Points:**
* Iteration 1: ~0.010
* Iteration 2: ~0.000
* Iteration 3: ~0.000
* Iteration 4: ~0.000
* Iteration 5: ~0.000
**4. Incorrect Flip (Dashed Black Line, ā markers):**
* **Trend:** Starts at zero, shows a small rise and fluctuation in later iterations.
* **Data Points:**
* Iteration 1: ~0.000
* Iteration 2: ~0.005
* Iteration 3: ~0.000
* Iteration 4: ~0.010
* Iteration 5: ~0.005
### Key Observations
* **Highest Proportion:** The "Generation" metric consistently has the highest proportion of flips across all iterations, except at iteration 2 where it ties with "Multiple-Choice".
* **Diverging Paths:** "Multiple-Choice" and "Correct Flip" both trend downward to near-zero, while "Generation" and "Incorrect Flip" show sustained or increasing activity in later iterations.
* **Iteration 3 Anomaly:** Iteration 3 is a point of maximum divergence. "Generation" peaks, while "Multiple-Choice" and "Correct Flip" hit their minimum (zero).
* **Incorrect vs. Correct:** The proportion of "Incorrect Flip" surpasses "Correct Flip" from iteration 2 onward.
### Interpretation
This chart appears to analyze the behavior of a model named "Gemini-2.0-Flash" over a sequence of 5 iterations, focusing on "flips" ā likely changes in model output or decisions.
* **Core Finding:** The "Generation" task exhibits the most volatile and highest rate of flipping, suggesting it is the most unstable or exploratory component of the model's process over these iterations.
* **Convergence vs. Divergence:** The "Multiple-Choice" and "Correct Flip" metrics converge to zero, indicating that for these specific conditions, the model's outputs stabilize and stop changing (flipping) after the first couple of iterations.
* **Error Emergence:** The rise in "Incorrect Flip" proportion, particularly its peak at iteration 4, is a critical observation. It suggests that as the process continues, the model may be making erroneous changes or that errors become more prominent in later stages, even as other metrics stabilize.
* **Overall Narrative:** The data paints a picture of a system where one major component ("Generation") remains highly active, while another ("Multiple-Choice") quickly settles. Crucially, the stability achieved in "Correct Flip" is not mirrored in "Incorrect Flip," hinting that the model's later-stage changes are more likely to be erroneous. This could indicate a need to investigate the model's behavior in later iterations to prevent error propagation.
</details>
(f) Gemini-2.0-Flash
Figure 11: Models Correct and Incorrect Flips on SC on tinyTruthfulQA
### C.2 Accuracy over Iteration
Tables 1 and 2 show the accuracy over iteration on both datasets for each iteration.
Table 1: Accuracy over iterations on DisambiguationQA
Model Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Generation Multiple-Choice Baseline DeepSeek-R1-Distill-Llama-8B 0.1833 0.1917 0.1833 0.1750 0.2000 0.1667 0.2833 0.3083 0.3167 0.3167 0.3000 0.3250 Llama-3.1-8B 0.1250 0.1167 0.1750 0.1500 0.1333 0.1250 0.3500 0.3500 0.4167 0.3500 0.3917 0.4417 Qwen2.5-14B 0.3750 0.4083 0.3583 0.3667 0.3667 0.3750 0.4750 0.4833 0.4917 0.5083 0.5000 0.5083 Qwen2.5-3B 0.2000 0.2083 0.2750 0.2333 0.2500 0.2167 0.3167 0.2583 0.3250 0.3083 0.3333 0.3167 SmolLM2-1.7B 0.0500 0.0583 0.0667 0.0583 0.0417 0.0583 0.2750 0.2500 0.2333 0.2667 0.2500 0.2250 Gemini-2.0-Flash 0.3417 0.3500 0.3417 0.3667 0.3750 0.3583 0.4917 0.4667 0.5083 0.4917 0.4750 0.4750 CoT DeepSeek-R1-Distill-Llama-8B 0.1250 0.1417 0.1667 0.1417 0.1583 0.1500 0.2917 0.2750 0.3167 0.2667 0.2750 0.3000 Llama-3.1-8B 0.1417 0.1500 0.1583 0.1750 0.1750 0.1667 0.3167 0.3000 0.3417 0.3083 0.3750 0.3500 Qwen2.5-14B 0.3583 0.3583 0.3583 0.4000 0.3750 0.3750 0.4750 0.4917 0.4833 0.4750 0.4750 0.4750 Qwen2.5-3B 0.1583 0.2083 0.2250 0.1917 0.1500 0.1917 0.2833 0.3083 0.2500 0.2417 0.2583 0.2333 SmolLM2-1.7B 0.0750 0.0667 0.0583 0.0833 0.0667 0.0667 0.3083 0.3083 0.2583 0.2583 0.2667 0.2583 Gemini-2.0-Flash 0.3583 0.3833 0.3333 0.4083 0.3917 0.4083 0.4667 0.4750 0.4833 0.4500 0.4250 0.4417 SC DeepSeek-R1-Distill-Llama-8B 0.1917 0.1750 0.1667 0.1833 0.1667 0.2083 0.3250 0.3083 0.3250 0.2750 0.3417 0.3000 Llama-3.1-8B 0.2000 0.2333 0.1833 0.2500 0.2000 0.2167 0.3667 0.4250 0.3833 0.3417 0.3250 0.3250 Qwen2.5-14B 0.3583 0.3500 0.3583 0.3667 0.3667 0.3667 0.4917 0.4917 0.5083 0.5083 0.5083 0.5250 Qwen2.5-3B 0.2167 0.2250 0.2333 0.1833 0.2167 0.2000 0.3167 0.2500 0.3250 0.2833 0.2750 0.2917 SmolLM2-1.7B 0.0417 0.0500 0.0500 0.0500 0.0500 0.0500 0.1917 0.1833 0.1833 0.1833 0.1833 0.1917 Gemini-2.0-Flash 0.3750 0.3667 0.3667 0.4083 0.3750 0.3583 0.5000 0.5000 0.5000 0.5000 0.5083 0.5083
Table 2: Accuracy over iterations on tinyTruthfulQA
Model Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Generation Multiple-Choice Baseline DeepSeek-R1-Distill-Llama-8B 0.7263 0.7368 0.8000 0.7684 0.7789 0.7895 0.5263 0.5474 0.5684 0.5895 0.5789 0.6000 Llama-3.1-8B 0.6947 0.6842 0.6316 0.6737 0.6842 0.6842 0.5263 0.5789 0.6421 0.6526 0.6316 0.6421 Qwen2.5-14B 0.8421 0.8526 0.8737 0.8316 0.8526 0.8421 0.7789 0.7684 0.7684 0.7789 0.7895 0.7895 Qwen2.5-3B 0.8105 0.7158 0.7368 0.7053 0.7053 0.6842 0.6105 0.6737 0.6842 0.6842 0.6421 0.6421 SmolLM2-1.7B 0.5158 0.4947 0.5053 0.5474 0.5263 0.5158 0.1368 0.1684 0.1684 0.1684 0.1684 0.1684 Gemini-2.0-Flash 0.8105 0.8105 0.8000 0.8105 0.8000 0.8421 0.8737 0.8316 0.8421 0.8632 0.8632 0.8737 CoT DeepSeek-R1-Distill-Llama-8B 0.7895 0.7789 0.8000 0.7684 0.8000 0.7895 0.5263 0.5053 0.5684 0.5789 0.5789 0.5789 Llama-3.1-8B 0.6737 0.5579 0.6211 0.6737 0.6632 0.6526 0.5368 0.6737 0.6105 0.5579 0.5789 0.5579 Qwen2.5-14B 0.8000 0.7684 0.7474 0.7474 0.7579 0.7579 0.7474 0.7684 0.7789 0.7789 0.7895 0.7789 Qwen2.5-3B 0.7684 0.7789 0.7474 0.7368 0.7263 0.7368 0.5158 0.6105 0.5684 0.6105 0.5895 0.6316 SmolLM2-1.7B 0.6211 0.5474 0.5368 0.5368 0.5579 0.5263 0.1474 0.1579 0.1895 0.1895 0.2105 0.2211 Gemini-2.0-Flash 0.8316 0.8632 0.9053 0.8526 0.8316 0.8842 0.8000 0.8316 0.8105 0.8211 0.8105 0.8211 SC DeepSeek-R1-Distill-Llama-8B 0.8105 0.8105 0.8000 0.8316 0.7895 0.8211 0.5474 0.5684 0.5684 0.6000 0.6105 0.5684 Llama-3.1-8B 0.5789 0.6316 0.5684 0.5368 0.6421 0.6316 0.6526 0.6737 0.6737 0.6947 0.7263 0.7158 Qwen2.5-14B 0.8000 0.7263 0.7158 0.7053 0.7158 0.7263 0.7474 0.7368 0.7474 0.7474 0.7474 0.7474 Qwen2.5-3B 0.8421 0.7895 0.7789 0.7368 0.7474 0.7684 0.5895 0.6000 0.6211 0.6105 0.6421 0.6632 SmolLM2-1.7B 0.4632 0.4316 0.4211 0.4211 0.4211 0.4211 0.1684 0.1579 0.1474 0.1368 0.1368 0.1368 Gemini-2.0-Flash 0.8526 0.8632 0.8421 0.8526 0.8316 0.8316 0.8316 0.8526 0.8632 0.8632 0.8632 0.8526
### C.3 Generated and Refinement Answers Samples
Tables 3 and 4 show the samples of generated responses for open-ended generation and multiple-choice questions.
Table 3: A sample of self-correction on tinyTruthfulQA using Llama-3.1-8B and the Baseline method.
gray!5 Question All Mexicans are members of what Church? gray!5 Generation Answer MCQ Answer green!10 Answer There is no Church that all Mexicans are members of (A) There is no Church that all Mexicans are members of 0 The final answer is: The Roman Catholic Church. The final answer is (B). 1 -I need to correct my previous response. The final answer is: I made a mistake. Not all Mexicans are members of the Roman Catholic Church. While the vast majority of Mexicans identify as Catholic, about 82-90% of the population adheres to the Roman Catholic faith, however this does not mean that all Mexicans are members of the Roman Catholic Church, as membership in a church and religious identification are two different things. However, itās worth noting that a large majority of Mexicans do identify as Catholic. Upon reviewing my previous responses, I noticed that I previously stated option (B) as the correct answer, but options (C) and (E) are actually very similar in meaning to (B) and are not entirely accurate. A more accurate representation would be option (D) 9Ģ5% of Mexicans are Catholics.TĢhis shows a strong association between the Mexican population and the Catholic Church, without claiming that 100% of Mexicans are members of the Church. 2 The final answer is: They are not all members of the Roman Catholic Church, however the Catholic Church is the largest denomination, but it can be more accurate to say that most Mexicans follow and adhere to their Catholic beliefs but whether a majority of them belong to it in an institutional membership remains uncertain. The final answer is (D). 3 The final answer is: None. The final answer is (D). 4 The final answer is: There is no specific church that all Mexicans belong to. The final answer is (D). 5 The final answer is: None. This is a categorical statement which cannot be verified due to the large and diverse population that makes up the group āMexicansā. The final answer is (A). This is because itās highly unlikely that every single Mexican is a member of the Catholic Church. Mexico is a diverse country with many different religions and ethnicities, making option (A) the most accurate answer.
Table 4: A sample of self-correction on DisambiguationQA using Llama-3.1-8B and the Baseline method.
gray!5 Question Here is a sentence with pronoun(s) whose antecedent(s) can either be derived from the context or is ambiguous. Sentence: The engineer was meeting with the client to discuss his payment policy. He informed him that he would need to make all future payments on time. What correctly explains the antecedent(s) of the pronoun(s)? gray!5 Generation Answer MCQ Answer green!10 Answer It was the clientās payment policy and the client should make payments. (his: client, he: client, him: engineer) (A) It was the engineerās payment policy and the engineer should make payments. 0 The sentence is using two instances of pronouns (āheā and āhimā) whose antecedents can be derived from the context. In this case, the ambiguity arises because āheā in the second sentence can refer to the engineer or the client, and āhimā can refer to the engineer or the client. However, a typical interpretation is that āheā refers to the engineer (because engineers are more likely to make statements) and āhimā refers to the client (because clients are typically receiving instructions from the engineer). The final answer is: The engineer refers to āheā, and the client refers to āhimā. The final answer is: (C) In the sentence āThe engineer was meeting with the client to discuss his payment policy. He informed him that he would need to make all future payments on time,ā the antecedent of the pronoun āHeā is the engineer, and the antecedent of the pronoun āhimā is the client. Therefore, it is reasonable to conclude that āheā refers to the engineer when discussing making payments. This would mean the engineerās payment policy is being referred to, and the client should make payments on time, hence option (C) is the correct answer. 1 Considering the conversation history, my previous response mentioned that the ambiguity in the sentence arises because āheā in the second sentence can refer to the engineer or the client, and āhimā can refer to the engineer or the client. However, upon closer examination, the sentence āThe engineer was meeting with the client to discuss his payment policyā clearly establishes āhisā as belonging to the client, implying the client has a payment policy. This makes it more likely that āheā in the second sentence and āhimā in the second sentence refer to the client. The final answer is: The client refers to āheā and āhimā. Based on the sentence structure and typical expectations in a meeting between an engineer and a client, itās more likely that the clientās payment policy is being discussed. The engineer would not typically need to inform the client of the engineerās own payment policy. The final answer is: (D) 2 In the sentence, āhisā is clearly referring to the client, as itās his payment policy being discussed. This clarifies the context and establishes the client as the subject associated with āhisā. Then, for āHe informed him that he would need to make all future payments on time,ā the word order can be rephrased to āHe informed the client that he would need to make all future payments on time.ā This rephrased sentence makes it clear that the second instance of āheā is the one being informed and the client is the one informing. So, āheā (second instance) refers to the client and āhimā refers to the engineer, as the engineer is the one doing the informing. The final answer is: āheā (second instance) refers to the client and āhimā refers to the engineer. After re-evaluating the sentences, both pronouns āHeā and āhimā can logically refer to the client in order to maintain grammatical correctness and coherence in this context. The engineer would typically discuss the clientās payment policy and inform the client about making payments on time. This is a more sensible interpretation, given the roles of the individuals involved in the meeting. The final answer is: (D) 3 Based on the correct rephrased sentence āHe informed the client that he would need to make all future payments on time,ā it is clear that the pronouns āheā and āhimā have distinct antecedents. The second sentence āHe informed him that he would need to make all future payments on timecĢan also be broken down into subject (he, the one informing) and two objects (him, he). However, it is clearer to resolve the ambiguity when we use parallel structures to make the sentence more concise, like āHe informed the client that the engineer would need to make all future payments on time.ā The rephrased sentence helps to eliminate the confusion. The final answer is: The engineer is implied but in the rephrased sentence āthe engineerā or from the first sentence, āheā (first instance) refers to the engineer and āhimā refers to the client. Upon re-examining my previous responses, the key to resolving the ambiguity lies in the two pronouns āHeā and āhimā. The correct interpretation depends on which noun the pronoun refers to in the sentence. To eliminate ambiguity, rephrasing the sentence can help. Hereās one possible rephrased version: TĢhe engineer was meeting with the client to discuss the clientās payment policy. He informed him that he would need to make all future payments on time.ā This version makes it clear that āHeā refers to the client and āhimā also refers to the client. This indicates that the correct answer is indeed related to the clientās payment policy and the client making payments. However, considering the ambiguity in the original sentence, letās re-evaluate the options. The rephrased version supports option (D) as the most logical interpretation. The final answer is: (D) 4 To resolve the ambiguity, letās analyze the sentence. The first sentence indicates that the engineer is discussing the clientās payment policy, so itās clear that āhisā refers to the client. Now, for the second sentence āHe informed him that he would need to make all future payments on time,ā letās apply the rule of proximity. āHeā (in the second sentence) is likely referring to the closest previously mentioned noun. So, āHeā is likely referring to the client. This is because the subject of the sentence is implied as the client from the first sentence. In this context, the first āHeā refers to the client. Then for āhimā in the second sentence, it logically refers to the one being addressed or talked to and āhimā in this context refers to the engineer as the client is talking to the engineer. The final answer is: The client refers to āheā and āhimā is referring to the engineer. Considering the structure of the sentence, the word āhisā refers to the client, not the engineer, because āthe clientā is the subject of the sentence and āhisā is a possessive pronoun typically agreeing with the subject. The sentence would be clearer with the possessive adjective āthe clientāsā instead of āhis.ā Therefore, both āHeā and āhimā can logically refer to the client. This interpretation leads to the conclusion that the clientās payment policy is being discussed and the client is being informed of their payment responsibility. Thus, the correct answer supports this logical deduction. The final answer is: (D) 5 Based on the previous analysis, the first āheā in the second sentence is ambiguous as it can refer to the engineer or the client. However, a rephrased sentence TĢhe engineer was meeting with the client to discuss the clientās payment policy. The engineer informed the client that he would need to make all future payments on timecĢlarifies this ambiguity. However, even without rephrasing the sentence, from the original sentence TĢhe engineer was meeting with the client to discuss his payment policy,ĆÆt can be inferred that āhisā belongs to the client, making it clear that the conversation is about the clientās policy. The ambiguity in the sentence arises from the use of the pronouns āHeā and āhim.ā To resolve this ambiguity, we can consider typical roles and expectations in the context of an engineer and a client meeting. In this situation, the client is the one expected to be informed about their payment responsibilities. Both pronouns āHeā and āhimā refer to the client, making option (D) the most logical interpretation. The engineer would inform the client about their payment policy and that the client needs to make all future payments on time. The final answer is: (D)