# Self-Correcting Large Language Models: Generation vs. Multiple Choice
> Work was done prior to joining Amazon.
## Abstract
Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes:
While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs. Codes and experiments are available at https://github.com/rahmanidashti/llm-self-correction
Self-Correcting Large Language Models: Generation vs. Multiple Choice
Hossein A. Rahmani $\dagger$ , Satyapriya Krishna $\ddagger$ thanks: Work was done prior to joining Amazon., Xi Wang $\nabla$ , Mohammadmehdi Naghiaei $\diamondsuit$ , Emine Yilmaz $\dagger$ $\dagger$ University College London, $\ddagger$ Amazon AGI, $\nabla$ University of Sheffield, $\diamondsuit$ University of Southern California {hossein.rahmani.22, emine.yilmaz}@ucl.ac.uk, skrishna@g.harvard.edu xi.wang@sheffield.ac.uk, naghiaei@usc.edu
## 1 Introduction
Recent advances in Large Language Models (LLMs) have illustrated that iterative self-correction, where a model re-examines and revises its output under a self-reflection framework, can lead to significant performance gains across a variety of tasks (Madaan et al., 2023; Cook et al., 2024; Shinn et al., 2023; Gou et al., 2024, inter alia). This emergent ability is often attributed to the modelsā capacity to integrate chain-of-thought reasoning Kamoi et al. (2024); Chang et al. (2024); Wei et al. (2022), prompting them to refine their own outputs as addressed by a human proofreader or mentor. Regarding performance validation, existing studies on self-correction have generally focused on free-form text generation (Huang et al., 2023; Madaan et al., 2023; Zelikman et al., 2022; Ma et al., 2025; Kumar et al., 2025; Krishna et al., 2024, inter alia), such as dialogue response, code optimization, and acronym generation. These tasks align with the strategy of language model optimization in addressing next token prediction.
However, as LLM applications expand, evaluation restricted to free-form generation offers an incomplete picture. For instance, NVIDIA advocates the deployment of smaller language models in agentic systems for tasks such as API calls and orchestration with external tools, motivated by sustainability and efficiency considerations Belcak et al. (2025). This highlights the need to examine self-correction beyond open-ended generation. In this study, we categorize natural language modeling tasks into two broad paradigms: free-form text generation and multi-choice prediction. The former treats modeling as unconstrained sequence generation over the full vocabulary, while the latter frames it as classification over a fixed set of candidate answers. These paradigms are complementary: multi-choice tasks test precise discrimination under constraints, whereas free-form tasks assess expressive generation, and together they capture the main modes of LLM use in applications such as question answering, reasoning, and open-ended dialogue.
In this paper, we investigate how self-correction unfolds when comparing open-ended generation against multiple-choice question scenarios. We hypothesize that while open-ended generation may benefit from enhanced flexibility and creativity, it also faces a larger search space and the risk of compounding errors. By contrast, multiple-choice models operate in a constrained space, which can reduce semantic drift yet limit creative corrections. Our study explores how these respective factors interact with iterative refinement, shedding light on whether self-correction aligns more naturally with either unconstrained or constrained output space.
To address these questions, we conduct comprehensive experiments on two distinct datasets that differ in nature, one focusing on knowledge-intensive question answering and the other on reasoning-oriented problems. We perform iterative inference, giving the model multiple opportunities to reevaluate and revise. By comparing error rates, consistency across iterations, and eventual convergence in each paradigm, we expose nuanced trade-offs in how LLMs adapt to different output constraints under a self-correction regime. Our results provide practical insights for the design and deployment of LLM-based systems, highlighting opportunities to harness better or tailor self-correction behaviors for diverse application settings. Furthermore, we discuss how our findings inform the broader research agenda of aligning emergent capabilities in large-scale models with varied real-world task requirements.
## 2 Related Works
Iterative Reasoning and Self-correction in LLMs.
Large language models first showed an emergent ability to reason step-by-step when prompted with chain-of-thought (CoT) examples (Wei et al., 2022). Shortly after, Wang et al. (2023) demonstrated that sampling several independent reasoning traces and selecting the majority answerādubbed self-consistency (SC)āboosts accuracy on arithmetic and commonsense tasks. Follow-up studies made the correction loop explicit by asking the model to critique its own draft before rewriting it, leading to sizeable gains in factual QA and code generation (Madaan et al., 2023). Variants that call external tools such as Python or knowledge bases during the critique stage further reduce hallucinations in open-ended generation (Chen et al., 2023; Yao et al., 2023; Gou et al., 2024). These works collectively suggest that LLMs can act as both solver and reviewer, but they focus almost exclusively on free-form text outputs.
Verificationābased Refinement.
Instead of trusting the modelās final token distribution, several papers add lightweight verifiers. Cobbe et al. (2021) attach unit tests to code synthesis; Dixit et al. (2023) use factuality checkers for summarization; Pryzant (2023) adopt entailment models for reading comprehension. The common pattern is a two-step pipeline where the LLM proposes an answer, then a cheaper or more precise module scores it. Our work keeps the entire loop inside the language model, isolating the effect of output format itself (generation vs. multiple-choice) from external verification.
Answer Selection and Multiple-Choice Prompting.
Tasks with a closed candidate set (e.g., MMLU (Hendrycks et al., 2021), ARC (Clark and et al., 2018)) are typically solved by mapping each option to an independent prompt and picking the highest-logit answer (Brown and et al., 2020). Several groups have tried to retrofit iterative reasoning onto this template. Zhu and et al. (2024) prepend a self-explanation, rescore the options with the explanation as additional context, and report modest but consistent gains. Li and et al. (2024) show that calibrating logits with contrastive rationales helps low-parameter models, while Pan and et al. (2023) explore ensembling diverse rationales. Yet a systematic comparison between correction dynamics in open versus closed output spaces is missing; our study provides that head-to-head analysis.
Bridging the paradigms.
Contemporary benchmarks increasingly mix free-form and categorical sub-tasksāe.g., TruthfulQA has both short-answer and multiple-choice splits (Lin et al., 2022). Deployment settings such as tutoring agents or search assistants likewise alternate between generating explanations and selecting the best passages. Understanding whether self-correction behaves differently under these two regimes is therefore more than a methodological curiosity as it affects prompt engineering, compute budgeting, and safety guard-rail design. By re-implementing the main correction strategies from the literature under a unified experimental budget, we show that the shape of the output space itself controls how much an LLM can benefit from extra reflection rounds.
## 3 Open-ended Generation vs. Multiple-Choice Answer Selection
Large language models are increasingly expected to handle a wide spectrum of downstream tasks, ranging from unconstrained natural language generation, such as open-domain question answering, to highly structured classification problems, like sentiment analysis. Two of the most commonly encountered settings are (i) open-ended generation, where the model must produce a free-form text response, and (ii) multiple-choice answer selection, where it must select a single correct option from a predefined set of choices. While these two paradigms are often operationalized using the same model architecture and weights, they impose fundamentally different constraints on the output space and influence how self-correction unfolds over successive inference steps. This section formalizes these two paradigms, describes how self-correction mechanisms are instantiated within each, and presents qualitative differences that help explain the empirical patterns observed in Section 5.
Open-Ended Generation.
In the open-ended generation setting, the model is required to produce an output sequence $y^{(0)}=(y^{(0)}_{1},\ldots,y^{(0)}_{T})\in\mathcal{V}^{*}$ , where $\mathcal{V}$ denotes the vocabulary and $T$ is the (variable) sequence length. The generation is conditioned on an input $x$ , which may correspond to a question, prompt, or instruction, such that the model defines a conditional distribution:
$p(y^{(0)}\mid x)=\prod_{t=1}^{T}p(y^{(0)}_{t}\mid y^{(0)}_{<t},x)$
This formulation captures the standard auto-regressive decoding process for open-ended text generation. The generated sequence may consist of a sentence, paragraph, or longer passage, and there are no explicit structural constraints beyond syntactic plausibility and task relevance.
Self-correction in this paradigm typically proceeds by prompting the model to critique its initial output ā either via explicit instructions (āidentify any flawsā) or implicit prompting strategies (āthink step by stepā) ā followed by a new generation $y^{(1)}$ . This iterative process can be repeated multiple times, resulting in a sequence $\{y^{(k)}\}_{k=0}^{K}$ , where each revised answer aims to improve upon the previous one. A final answer can be selected using majority voting, log-probability re-ranking, or verifier-based scoring. Because generation is unconstrained, each iteration can introduce new content, restructure previous arguments, or expand omitted details. While this offers flexibility and the potential for substantial improvements, it also opens the door to risks such as semantic drift Ji et al. (2023b, a), where the answer becomes misaligned with the original question over time, or hallucinations, where fictitious facts are introduced in an attempt to improve fluency or apparent coherence. These failure modes tend to accumulate if the model āover-correctsā by deviating from the initial context Spataru (2024).
Multiple-Choice Answer Selection.
By contrast, the multi-choice setting restricts the output space to a finite set of candidate answers $A=\{a_{1},a_{2},\ldots,a_{M}\}$ . For each question $x$ , the model computes a logit vector $\ell(x)\in\mathbb{R}^{M}$ , from which a softmax distribution is derived, and selects the most probable answer. Self-correction in this paradigm does not involve rewriting text but rather involves revisiting the initial logits after incorporating additional information. One common strategy is to generate a rationale $r^{(t)}$ for why a particular answer is correct, then concatenate this rationale to the original prompt and recompute the logits to obtain $\ell^{(t+1)}(x,r^{(t)})$ Huang et al. (2023); Liu et al. (2024). Over successive iterations, this allows the model to refine its beliefs based on its own reasoning. However, since the answer set is fixed, the model cannot explore novel hypotheses or restructure the space of answers; instead, it can only shift probability mass among existing options. This bounded nature of the output space makes multiple-choice settings more stable and less prone to semantic drift, but also potentially less effective at recovering from early errors ā especially if the correct answer has low initial probability and the generated rationales fail to meaningfully influence the logits.
Qualitative Differences.
The two paradigms, i.e., open-ended generation and multiple-choice selection, exhibit distinct self-correction dynamics due to their differing output constraints. In open-ended generation, performance gains are typically front-loaded, with the most significant improvements occurring in the first few iterations as the model repairs inconsistencies or fills in missing details Cook et al. (2024); Huang et al. (2023); Gou et al. (2024). However, this flexibility also increases the risk of semantic drift in later rounds Spataru (2024): if the modelās revisions start to go off-topic or introduce inaccuracies, the session can degrade without external intervention. In contrast, multiple-choice tasks show steadier, more incremental improvements, benefiting from the stability of a fixed answer set. They may suffer, however, from logit inertia when the correct option is initially underweighted. The model can be difficult to move to a low-probability answer unless a very compelling rationale shifts the balance. Generation tends to be more compute-intensive due to longer outputs per iteration, while multiple-choice achieves better accuracy-to-token efficiency by focusing on short discriminative outputs. Additionally, model scale interacts differently across formats. Larger models can better mitigate drift in generation through coherent reasoning chains, while smaller models perform more reliably in multiple-choice settings due to the structured nature of the output space and the guidance provided by explicit options.
Understanding these qualitative and quantitative differences between the two paradigms is crucial for designing robust systems that use LLMs in iterative inference settings. Depending on the task requirements, whether correctness, stability, creativity, or inference budget is the primary constraint, one or the other format may be more appropriate, and self-correction strategies should be tailored accordingly.
## 4 Experimental Setup
Problem Statement.
In this study, we aim to evaluate the dynamics of iterative self-correction under constrained generation and multiple-choice selection across representative tasks. Let $x\in\mathcal{X}$ denote an input instance (e.g., a question) with ground-truth answer $y^{\star}$ . An LLM parameterised by $\theta$ produces an initial response $y^{(0)}$ whose format depends on the task paradigm. For open-ended generation, the model outputs a sequence $y^{(0)}\in V^{\ast}$ with $p_{\theta}\!\big(y^{(0)}\mid x\big)\;=\;\prod_{t=1}^{T}p_{\theta}\!\big(y^{(0)}t\mid y^{(0)}{<t},x\big)$ . In contrast, for multiple-choice selection, the model selects $y^{(0)}\in A=\{a_{1},\dots,a_{M}\}$ from logits $\ell(x)\in\mathbb{R}^{M}$ , i.e., $y^{(0)}\;=\;\arg\max_{a_{i}\in A}\ell_{i}(x),\qquad$ $\sigma_{i}^{(0)}(x)\;=\;\frac{e^{\ell_{i}(x)}}{\sum_{j=1}^{M}e^{\ell_{j}(x)}}$ . By applying iterative self-correct, given history $\mathcal{H}^{(k-1)}=(x,y^{(0)},\dots,y^{(k-1)})$ , the model produces a revision $y^{(k)}\sim p_{\theta}\!\big(\cdot\mid\mathcal{H}^{(k-1)}\big),\qquad k=1,\dots,K$ .
We study the sequence $\mathcal{Y}(x)=\{y^{(k)}\}_{k=0}^{K}$ and aim to maximize task accuracy of the terminal output $y^{(K)}$ over $x\sim\mathcal{D}$ . We seek to observe how performance evolves with successive self-correction iterations and how error correction or degradation manifests in each paradigm. To that end, we set up experiments on two distinct question-answering benchmarks and examine multiple LLMs under various prompting strategies.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iteration for Generation and Multiple-Choice Tasks
### Overview
This image presents a line chart comparing the accuracy of two tasks ā "Generation" and "Multiple-choice" ā over five iterations. The chart displays the mean accuracy and a shaded region representing the standard deviation or confidence interval around the mean.
### Components/Axes
* **X-axis:** "Iteration", ranging from 0 to 5.
* **Y-axis:** "Accuracy (%)", ranging from 0.0 to 1.0.
* **Data Series 1:** "Generation" ā represented by a blue line with circular markers.
* **Data Series 2:** "Multiple-choice" ā represented by an orange line with circular markers.
* **Legend:** Located in the top-right corner, labeling the two data series with their corresponding colors.
* **Shaded Regions:** Light blue and light orange areas surrounding each line, indicating the variability or confidence interval.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" accuracy shows an upward trend, but the increase is relatively small.
* Iteration 0: Approximately 0.24 accuracy.
* Iteration 1: Approximately 0.28 accuracy.
* Iteration 2: Approximately 0.32 accuracy.
* Iteration 3: Approximately 0.34 accuracy.
* Iteration 4: Approximately 0.35 accuracy.
* Iteration 5: Approximately 0.36 accuracy.
The shaded region around the blue line indicates a relatively wide range of variability, particularly at lower iteration values.
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" accuracy shows a downward trend, but the decrease is also relatively small.
* Iteration 0: Approximately 0.52 accuracy.
* Iteration 1: Approximately 0.48 accuracy.
* Iteration 2: Approximately 0.45 accuracy.
* Iteration 3: Approximately 0.44 accuracy.
* Iteration 4: Approximately 0.45 accuracy.
* Iteration 5: Approximately 0.46 accuracy.
The shaded region around the orange line also indicates variability, but it appears slightly more consistent than the blue line's variability.
### Key Observations
* The "Multiple-choice" task starts with a significantly higher accuracy than the "Generation" task.
* The "Generation" task's accuracy increases with iteration, while the "Multiple-choice" task's accuracy decreases slightly.
* Both tasks exhibit a relatively small change in accuracy across the five iterations.
* The confidence intervals (shaded regions) are relatively wide, suggesting considerable variability in the results.
### Interpretation
The data suggests that while the "Generation" task benefits from iterative improvement, the gains are modest. Conversely, the "Multiple-choice" task shows a slight decline in accuracy with iteration, potentially indicating overfitting or a saturation point. The wide confidence intervals suggest that the observed trends might not be statistically significant, or that there is substantial variation in performance across different runs or conditions.
The difference in initial accuracy between the two tasks could be due to the inherent difficulty of the tasks themselves. "Multiple-choice" tasks are generally easier to perform accurately than "Generation" tasks, which require more creative and complex reasoning. The convergence of the two lines over iterations suggests that the "Generation" task is approaching the performance level of the "Multiple-choice" task, but it is still lagging behind. Further investigation would be needed to determine the underlying causes of these trends and to assess the statistical significance of the observed differences.
</details>
(a) Baseline
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iteration for Generation and Multiple-Choice Models
### Overview
This image presents a line chart comparing the accuracy of two models ā "Generation" and "Multiple-choice" ā across five iterations. The chart displays the mean accuracy and a shaded region representing the standard deviation or confidence interval around the mean.
### Components/Axes
* **X-axis:** "Iteration", ranging from 0 to 5.
* **Y-axis:** "Accuracy (%)", ranging from 0.0 to 1.0.
* **Data Series 1:** "Generation" ā represented by a blue line with circular markers.
* **Data Series 2:** "Multiple-choice" ā represented by an orange line with circular markers.
* **Legend:** Located in the top-center of the chart, identifying the two data series by color and label.
* **Shaded Regions:** Blue shaded region around the "Generation" line, and orange shaded region around the "Multiple-choice" line, indicating variability.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" accuracy shows an upward trend from Iteration 0 to Iteration 2, then plateaus.
* Iteration 0: Approximately 0.15 accuracy.
* Iteration 1: Approximately 0.25 accuracy.
* Iteration 2: Approximately 0.33 accuracy.
* Iteration 3: Approximately 0.35 accuracy.
* Iteration 4: Approximately 0.35 accuracy.
* Iteration 5: Approximately 0.35 accuracy.
The shaded blue region indicates a relatively wide confidence interval, particularly at lower iteration values.
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" accuracy shows a slight downward trend from Iteration 0 to Iteration 1, then remains relatively stable.
* Iteration 0: Approximately 0.52 accuracy.
* Iteration 1: Approximately 0.45 accuracy.
* Iteration 2: Approximately 0.50 accuracy.
* Iteration 3: Approximately 0.50 accuracy.
* Iteration 4: Approximately 0.52 accuracy.
* Iteration 5: Approximately 0.52 accuracy.
The shaded orange region also indicates variability, but appears slightly narrower than the blue region.
### Key Observations
* The "Multiple-choice" model consistently exhibits higher accuracy than the "Generation" model across all iterations.
* The "Generation" model shows the most significant improvement in accuracy during the first two iterations.
* Both models appear to converge in accuracy after Iteration 2, with minimal changes observed in subsequent iterations.
* The confidence intervals (shaded regions) suggest greater uncertainty in the "Generation" model's accuracy, especially at lower iterations.
### Interpretation
The data suggests that the "Multiple-choice" approach is more effective than the "Generation" approach for the task being evaluated, as indicated by its consistently higher accuracy. The "Generation" model demonstrates initial learning (improvement from Iteration 0 to 2), but its performance plateaus, suggesting it may have reached its limit or requires further refinement. The wider confidence intervals for the "Generation" model suggest that its performance is more variable and less reliable. The convergence of the two lines after Iteration 2 indicates that both models are approaching a stable state, and further iterations may not yield significant improvements. This could be due to the nature of the task, the limitations of the models, or the training data. The chart provides a clear visual comparison of the two models' performance and highlights the strengths and weaknesses of each approach.
</details>
(b) CoT
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iteration
### Overview
This image presents a line chart comparing the accuracy of two methods ā "Generation" and "Multiple-choice" ā across five iterations. The chart displays the mean accuracy and a shaded region representing the standard deviation or confidence interval around the mean.
### Components/Axes
* **X-axis:** "Iteration", ranging from 0 to 5.
* **Y-axis:** "Accuracy (%)", ranging from 0.0 to 1.0.
* **Data Series 1:** "Generation" ā represented by a blue line with circular markers.
* **Data Series 2:** "Multiple-choice" ā represented by an orange line with circular markers.
* **Legend:** Located at the top-center of the chart, identifying the two data series by color and label.
* **Shaded Regions:** Blue shaded region around the "Generation" line, and orange shaded region around the "Multiple-choice" line, indicating variability.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" accuracy starts at approximately 0.25 at Iteration 0. It shows a slight upward trend, reaching a peak of around 0.35 at Iteration 2, then plateaus and slightly declines to approximately 0.32 at Iteration 4 and 0.30 at Iteration 5. The shaded blue region indicates a standard deviation or confidence interval, varying between approximately 0.15 and 0.45 across the iterations.
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" accuracy begins at approximately 0.42 at Iteration 0. It exhibits a decreasing trend, falling to around 0.40 at Iteration 1, 0.38 at Iteration 2, 0.36 at Iteration 3, and stabilizing around 0.35 at Iteration 4 and 5. The shaded orange region shows a standard deviation or confidence interval, ranging from approximately 0.30 to 0.55 across the iterations.
**Data Points (Approximate):**
| Iteration | Generation Accuracy (%) | Multiple-choice Accuracy (%) |
|---|---|---|
| 0 | 0.25 | 0.42 |
| 1 | 0.28 | 0.40 |
| 2 | 0.35 | 0.38 |
| 3 | 0.33 | 0.36 |
| 4 | 0.32 | 0.35 |
| 5 | 0.30 | 0.35 |
### Key Observations
* The "Generation" method shows a slight initial improvement in accuracy, but then plateaus.
* The "Multiple-choice" method consistently outperforms "Generation" in terms of accuracy, but its accuracy decreases over iterations.
* The confidence intervals (shaded regions) overlap significantly, suggesting that the difference in accuracy between the two methods may not be statistically significant.
* Both methods exhibit a relatively narrow range of accuracy, with "Multiple-choice" staying between approximately 0.30 and 0.55, and "Generation" between approximately 0.15 and 0.45.
### Interpretation
The data suggests that while the "Multiple-choice" method starts with a higher accuracy, both methods show limited improvement with increasing iterations. The decreasing trend in "Multiple-choice" accuracy could indicate overfitting or a saturation point where further iterations do not yield significant gains. The plateauing of "Generation" accuracy suggests that the method may have reached its performance limit. The overlapping confidence intervals imply that the observed differences in accuracy might be due to random variation rather than a true difference in the effectiveness of the two methods. Further investigation with a larger dataset or different evaluation metrics might be needed to draw more definitive conclusions. The chart demonstrates the performance of two different approaches over a series of iterations, highlighting their respective strengths and weaknesses.
</details>
(c) SC
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iteration
### Overview
This image presents a line chart comparing the accuracy of two methods ā "Generation" and "Multiple-choice" ā across five iterations. The chart displays accuracy on the y-axis and iteration number on the x-axis. Each line is accompanied by a shaded region, likely representing a confidence interval or standard deviation.
### Components/Axes
* **X-axis:** Labeled "Iteration", ranging from 0 to 5. Markers are present at each integer value.
* **Y-axis:** Labeled "Accuracy (%)", ranging from 0.0 to 1.0. Markers are present at 0.2 intervals.
* **Data Series 1:** "Generation" ā Represented by a blue line with circular data points.
* **Data Series 2:** "Multiple-choice" ā Represented by an orange line with circular data points.
* **Legend:** Located in the bottom-center of the chart, identifying the lines by color and label.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" shows an upward trend, starting at approximately 0.78 at Iteration 0 and increasing to approximately 0.88 at Iteration 4 and 5. The shaded region around the line indicates some variability, but the overall trend is consistently upward.
* Iteration 0: ~0.78
* Iteration 1: ~0.81
* Iteration 2: ~0.84
* Iteration 3: ~0.86
* Iteration 4: ~0.88
* Iteration 5: ~0.88
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" also shows an upward trend, but it is less pronounced than the "Generation" line. It starts at approximately 0.64 at Iteration 0 and increases to approximately 0.68 at Iteration 4 and 5. The shaded region around this line is also present, indicating variability.
* Iteration 0: ~0.64
* Iteration 1: ~0.65
* Iteration 2: ~0.66
* Iteration 3: ~0.67
* Iteration 4: ~0.68
* Iteration 5: ~0.68
### Key Observations
* The "Generation" method consistently outperforms the "Multiple-choice" method across all iterations.
* Both methods show improvement in accuracy with increasing iterations, but the rate of improvement appears to slow down after Iteration 3.
* The confidence intervals (represented by the shaded regions) suggest that the accuracy of the "Generation" method is more stable than that of the "Multiple-choice" method.
### Interpretation
The data suggests that the "Generation" method is more effective than the "Multiple-choice" method for the task being evaluated. The increasing accuracy with iteration for both methods indicates that the learning process is progressing, but diminishing returns are observed as the number of iterations increases. The wider confidence interval for the "Multiple-choice" method suggests that its performance is more sensitive to variations in the data or the learning process. This could be due to the inherent limitations of a multiple-choice approach compared to a generative one, which allows for more nuanced and flexible responses. The chart demonstrates the benefit of iterative improvement in both methods, but highlights the superior performance of the "Generation" approach.
</details>
(d) Baseline
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iteration for Generation and Multiple-Choice Tasks
### Overview
This image presents a line chart illustrating the accuracy of two tasks ā "Generation" and "Multiple-choice" ā over five iterations. The chart displays the mean accuracy and a shaded region representing the standard deviation or confidence interval around the mean.
### Components/Axes
* **X-axis:** "Iteration", ranging from 0 to 5.
* **Y-axis:** "Accuracy (%)", ranging from 0.0 to 1.0.
* **Data Series 1:** "Generation" ā represented by a blue line with circular markers.
* **Data Series 2:** "Multiple-choice" ā represented by an orange line with circular markers.
* **Legend:** Located in the bottom-left corner, associating colors with task names.
* **Shaded Regions:** Light blue and orange areas surrounding each line, indicating variability.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" shows an upward trend, initially increasing rapidly from iteration 0 to 1, then leveling off with a slight increase from iteration 1 to 5.
* Iteration 0: Approximately 0.78 accuracy.
* Iteration 1: Approximately 0.83 accuracy.
* Iteration 2: Approximately 0.85 accuracy.
* Iteration 3: Approximately 0.86 accuracy.
* Iteration 4: Approximately 0.87 accuracy.
* Iteration 5: Approximately 0.87 accuracy.
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" also shows an upward trend, but it is less pronounced than the "Generation" line. The increase is more gradual and plateaus earlier.
* Iteration 0: Approximately 0.63 accuracy.
* Iteration 1: Approximately 0.67 accuracy.
* Iteration 2: Approximately 0.68 accuracy.
* Iteration 3: Approximately 0.69 accuracy.
* Iteration 4: Approximately 0.70 accuracy.
* Iteration 5: Approximately 0.70 accuracy.
The shaded regions around each line indicate the variability in accuracy. The shaded region for "Generation" is relatively consistent across iterations, while the shaded region for "Multiple-choice" appears slightly wider at lower iteration values.
### Key Observations
* "Generation" consistently outperforms "Multiple-choice" across all iterations.
* The accuracy of "Generation" increases rapidly in the initial iterations and then plateaus.
* The accuracy of "Multiple-choice" increases more slowly and also plateaus.
* The variability in accuracy appears to be slightly higher for "Multiple-choice" at the beginning of the iterations.
### Interpretation
The data suggests that the "Generation" task benefits more from iterative improvement than the "Multiple-choice" task. The initial rapid increase in accuracy for "Generation" could indicate that the model quickly learns the fundamental patterns required for the task. The subsequent plateau suggests that further iterations yield diminishing returns. The "Multiple-choice" task, being potentially simpler, reaches a performance limit more quickly. The shaded regions indicate the confidence in the accuracy measurements; wider regions suggest greater uncertainty. The consistent outperformance of "Generation" suggests that the model is more capable of complex reasoning or creative tasks compared to simple selection tasks. The chart demonstrates the impact of iterative refinement on model performance, highlighting the varying degrees of improvement achievable for different task types.
</details>
(e) CoT
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iteration
### Overview
This image presents a line chart comparing the accuracy of two methods ā "Generation" and "Multiple-choice" ā across five iterations. The chart displays accuracy as a percentage on the y-axis and iteration number on the x-axis. Each line is accompanied by a shaded region, likely representing a confidence interval or standard deviation.
### Components/Axes
* **X-axis:** "Iteration" ranging from 0 to 5. Markers are present at each integer value.
* **Y-axis:** "Accuracy (%)" ranging from 0.0 to 1.0. Markers are present at 0.2 intervals.
* **Data Series 1:** "Generation" represented by a dark blue line with circular markers.
* **Data Series 2:** "Multiple-choice" represented by an orange line with circular markers.
* **Legend:** Located in the bottom-center of the chart, identifying the lines by their respective names ("Generation" and "Multiple-choice").
* **Shaded Regions:** A light blue shaded region surrounds the "Generation" line, and a light orange shaded region surrounds the "Multiple-choice" line.
### Detailed Analysis
**Generation (Dark Blue Line):**
The "Generation" line exhibits a slight upward trend initially, then plateaus.
* Iteration 0: Approximately 0.79 accuracy.
* Iteration 1: Approximately 0.81 accuracy.
* Iteration 2: Approximately 0.80 accuracy.
* Iteration 3: Approximately 0.80 accuracy.
* Iteration 4: Approximately 0.80 accuracy.
* Iteration 5: Approximately 0.80 accuracy.
**Multiple-choice (Orange Line):**
The "Multiple-choice" line shows a similar initial increase, followed by a plateau.
* Iteration 0: Approximately 0.64 accuracy.
* Iteration 1: Approximately 0.67 accuracy.
* Iteration 2: Approximately 0.67 accuracy.
* Iteration 3: Approximately 0.67 accuracy.
* Iteration 4: Approximately 0.67 accuracy.
* Iteration 5: Approximately 0.67 accuracy.
The shaded regions indicate variability in the accuracy measurements. The light blue region around "Generation" is relatively narrow, suggesting more consistent results. The light orange region around "Multiple-choice" is also relatively narrow, suggesting consistent results.
### Key Observations
* "Generation" consistently outperforms "Multiple-choice" across all iterations.
* Both methods appear to converge in accuracy after the first iteration, with minimal improvement observed beyond that point.
* The accuracy of "Generation" remains relatively stable around 0.80, while the accuracy of "Multiple-choice" remains relatively stable around 0.67.
### Interpretation
The data suggests that the "Generation" method is more accurate than the "Multiple-choice" method for the task being evaluated. However, both methods exhibit diminishing returns after the first iteration, indicating that further iterations may not significantly improve performance. The consistent accuracy levels and narrow confidence intervals (represented by the shaded regions) suggest that the results are reliable and not due to random chance. The plateauing effect could indicate that the methods have reached their performance limits, or that the task itself has inherent limitations. Further investigation might involve exploring different methods or modifying the task to overcome these limitations. The chart demonstrates a clear performance difference between the two methods, with "Generation" being the superior approach in this context.
</details>
(f) SC
Figure 1: Average cumulative accuracy on generation and multiple-choice. (Top) Accuracy on the DisambiguationQA dataset shows that models perform better on the multiple-choice task when we iteratively self-correct the model response to the questions, while (bottom) shows the accuracy on the tinyTruthfulQA dataset, indicating that models perform better in generation tasks.
Research Questions.
Our study is guided by the following three research questions:
- RQ1: How do self-correction dynamics differ between open-ended and multiple-choice tasks?
- RQ2: How do model scale and prompting strategy influence self-correction across the two paradigms?
- RQ3: How does iterative self-correction affect correctness, stability, and semantic drift, and what mechanisms explain these effects?
Datasets.
We evaluate on two benchmarks, DisambiguationQA and tinyTruthfulQA, that each provide parallel formulations for both multiple-choice questions and open-ended generation. This allows us to study self-correction dynamics under consistent task content but different output constraints.
- DisambiguationQA Kazemi et al. (2025) is typically phrased in multiple-choice form, where each question presents a pronoun or reference with referential ambiguity and provides four candidate referents. However, the same questions can also be cast into an open-ended format by asking models to generate the referent rather than choose among options. Thus, DisambiguationQA instantiates a scenario where the answer space is tightly constrained but also amenable to open-ended generation in a parallel setup.
- tinyTruthfulQA Polo et al. (2024) is a challenging subset of the TruthfulQA benchmark Lin et al. (2022) focused on short-form factual queries that tend to provoke false or misleading answers from LLMs. While TruthfulQA is usually evaluated via free-form generation, where models must produce a truthful answer, a multiple-choice variant has also been developed, offering for each question a small set of candidate answers drawn from the same reference answer pool. Therefore, tinyTruthfulQA inherits this dual-format nature, where the same questions support both open-ended and multiple-choice instantiations. This dataset exemplifies scenarios requiring knowledge retrieval and precision in generation.
By evaluating both tasks, we cover one case where the ground-truth answer is within a closed set of options and one case where the answer must be generated. We therefore can compare how iterative self-correction dynamics differ when the modelās output is tightly constrained versus freely generative.
Models.
We evaluate the dynamics of iterative selfācorrection under unconstrained generation and multipleāchoice selection using six preātrained language models ranging from small to large parameters. We evaluate SmolLM2-1.7B Allal et al. (2025), Qwen2.5-3B Qwen et al. (2025), Llama-3.1-8B Grattafiori et al. (2024), Qwen2.5-14B Qwen et al. (2025), DeepSeek-R1-Distill-Llama-8B Guo et al. (2025), and Gemini-2.0-Flash Comanici et al. (2025). These models represent diverse families and scales (from distilled smaller models to state-of-the-art large models). For each model and dataset, we compare three aforementioned prompting strategies: a direct Baseline prompt, zeroāshot chaināofāthought (CoT) prompting Kojima et al. (2022), and our iterative SC procedure that reviews and refines the modelās own previous response for up to five rounds. We use HuggingFace to run the models except Gemini-2.0-Flash, which is accessed through the API.
Prompts.
In our experiments, we use simplified prompts to minimize the impact of prompt design on performance across tasks, keeping the focus on the self-correction mechanism Huang et al. (2023). Specifically, we apply a basic prompt for the Baseline method and adopt zero-shot Chain-of-Thought (CoT) prompting Kojima et al. (2022) for both the CoT and Self-Consistency (SC) approaches. The initial prompts are used for the first attempt (iteration 0) under each strategy. They differ only in whether the model is encouraged to produce an explicit chain of reasoning before the final answer. For iterations beyond the first, we prepend instructions to review the prior attempts. In both cases, the model is reminded of its earlier answers (which are included in the conversation context) and encouraged to refine them. The CoT variant additionally maintains the directive to use a step-by-step reasoning process during revision. Our full prompts can be found in Appendix A.2.
Final Answer Extraction.
For all of our problems, we added the ā The final answer is: ā suffix to the text of the prompt to encourage the model to produce the final answer in a format that we can easily extract. More details in Appendix A.1.
## 5 Results
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average number of correct flips for two different methods: "Generation" and "Multiple-choice". The chart also includes shaded regions representing the uncertainty or variance around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Correct Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a blue line with circular markers.
* **Data Series 2:** "Multiple-choice" - Represented by an orange line with circular markers.
* **Legend:** Located in the top-right corner, identifying the two data series by color and name.
* **Shaded Regions:** Light blue and light orange areas surrounding each line, indicating the variance or confidence interval.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" shows a decreasing trend from Iteration 1 to Iteration 5.
* Iteration 1: Approximately 0.073
* Iteration 2: Approximately 0.053
* Iteration 3: Approximately 0.042
* Iteration 4: Approximately 0.032
* Iteration 5: Approximately 0.025
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" also exhibits a decreasing trend, but with more fluctuation.
* Iteration 1: Approximately 0.078
* Iteration 2: Approximately 0.062
* Iteration 3: Approximately 0.048
* Iteration 4: Approximately 0.037
* Iteration 5: Approximately 0.027
**Shaded Regions:**
The shaded regions around each line indicate the uncertainty in the data. The width of the shaded region varies across iterations, suggesting differing levels of confidence in the average correct flips at each point. The light blue region is wider in the earlier iterations, indicating greater variance for the "Generation" method. The light orange region is also wider in the earlier iterations, indicating greater variance for the "Multiple-choice" method.
### Key Observations
* Both methods show a decreasing trend in average correct flips as the iteration number increases.
* At Iteration 1, the "Multiple-choice" method has a slightly higher average correct flips than the "Generation" method.
* The shaded regions suggest that the "Generation" method has a higher variance in its performance compared to the "Multiple-choice" method, particularly in the early iterations.
* The difference between the two methods appears to diminish as the iteration number increases.
### Interpretation
The data suggests that both the "Generation" and "Multiple-choice" methods become less effective at identifying correct flips as the number of iterations increases. This could indicate that the problem becomes more difficult to solve with each iteration, or that the methods are converging towards a suboptimal solution. The initial higher performance of the "Multiple-choice" method might suggest it is more robust or easier to apply in the early stages of the process. The decreasing performance of both methods could be due to overfitting, where the methods become too specialized to the initial data and lose their ability to generalize to new data. The shaded regions highlight the inherent uncertainty in the measurements, and suggest that further investigation with larger sample sizes or more refined methods is needed to draw more definitive conclusions. The chart demonstrates a clear trend of diminishing returns with increasing iterations for both methods.
</details>
(a) Baseline
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average number of correct flips for two different methods: "Generation" and "Multiple-choice". The chart displays a decreasing trend for both methods as the iteration number increases. A shaded region behind the lines suggests a possible range of variation or confidence interval.
### Components/Axes
* **X-axis:** Labeled "Iteration", with markers at 1, 2, 3, 4, and 5.
* **Y-axis:** Labeled "Average Correct Flips", with a scale ranging from 0.000 to 0.100, incrementing by 0.025.
* **Legend:** Located in the top-right corner, identifying two data series:
* "Generation" ā represented by a blue line with circular markers.
* "Multiple-choice" ā represented by an orange line with circular markers.
* **Shaded Region:** A light purple/gray shaded area fills the background, potentially representing a standard deviation or confidence interval.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" shows a clear downward trend.
* Iteration 1: Approximately 0.077
* Iteration 2: Approximately 0.052
* Iteration 3: Approximately 0.047
* Iteration 4: Approximately 0.032
* Iteration 5: Approximately 0.024
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" also exhibits a downward trend, but with more fluctuation.
* Iteration 1: Approximately 0.082
* Iteration 2: Approximately 0.055
* Iteration 3: Approximately 0.042
* Iteration 4: Approximately 0.038
* Iteration 5: Approximately 0.028
### Key Observations
* Both methods demonstrate a decrease in average correct flips as the iteration number increases. This suggests that the methods become less effective with each iteration, or that the task becomes more difficult.
* The "Multiple-choice" method starts with a slightly higher average correct flips than the "Generation" method at Iteration 1, but the difference diminishes over subsequent iterations.
* The shaded region indicates variability in the data, and the lines represent the average trend.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods experience diminishing returns as the number of iterations increases. This could be due to several factors: the task becoming inherently more challenging, the methods converging on a suboptimal solution, or the introduction of noise or errors in later iterations. The initial higher performance of the "Multiple-choice" method might indicate that it is easier to achieve initial success with this approach, but the "Generation" method maintains a more consistent decline. The shaded region highlights the uncertainty in the data, suggesting that the observed trends may not be universally applicable and could vary depending on the specific conditions of the experiment. Further investigation would be needed to determine the underlying causes of the observed trends and to identify strategies for improving the performance of these methods.
</details>
(b) CoT
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average correct flips for two different methods: "Generation" and "Multiple-choice". The chart also includes shaded regions representing the variance around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Correct Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** Generation (represented by a blue line with circular markers).
* **Data Series 2:** Multiple-choice (represented by an orange line with circular markers).
* **Legend:** Located in the top-right corner, identifying the two data series by color and name.
* **Shaded Regions:** Light purple around the "Generation" line and light orange around the "Multiple-choice" line, indicating variance or standard deviation.
### Detailed Analysis
**Generation (Blue Line):**
The blue line initially slopes downward from Iteration 1 to Iteration 2, then exhibits fluctuations.
* Iteration 1: Approximately 0.063
* Iteration 2: Approximately 0.035
* Iteration 3: Approximately 0.042
* Iteration 4: Approximately 0.026
* Iteration 5: Approximately 0.028
**Multiple-choice (Orange Line):**
The orange line also slopes downward from Iteration 1 to Iteration 2, then fluctuates more significantly.
* Iteration 1: Approximately 0.052
* Iteration 2: Approximately 0.032
* Iteration 3: Approximately 0.022
* Iteration 4: Approximately 0.024
* Iteration 5: Approximately 0.027
The shaded regions around each line indicate the variability of the data. The purple shaded region around the "Generation" line is generally wider than the orange shaded region around the "Multiple-choice" line, suggesting greater variance in the "Generation" method.
### Key Observations
* Both methods show a decreasing trend in average correct flips from Iteration 1 to Iteration 2.
* The "Generation" method starts with a slightly higher average correct flips than the "Multiple-choice" method at Iteration 1.
* The "Multiple-choice" method appears to have less variance than the "Generation" method.
* The lines fluctuate between Iterations 2 and 5, indicating that the performance of both methods is not consistently improving or declining.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods initially experience a decrease in performance (as measured by average correct flips) as the iteration number increases. The fluctuations observed after Iteration 2 indicate that the methods may be reaching a point of diminishing returns or encountering challenges in further improvement. The greater variance in the "Generation" method suggests that its performance is more sensitive to the specific conditions or data used in each iteration.
The data could be demonstrating the learning curve of two different algorithms or approaches. The initial drop might represent an initial period of adjustment or refinement, while the subsequent fluctuations could indicate the algorithms are converging towards a stable state or encountering limitations in their ability to improve. The difference in variance could be due to the inherent complexity or randomness of the "Generation" method compared to the "Multiple-choice" method.
</details>
(c) SC
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Average Incorrect Flips vs. Iteration
### Overview
This image presents a line chart illustrating the trend of "Average Incorrect Flips" over "Iteration" for two different methods: "Generation" and "Multiple-choice". The chart also displays confidence intervals around each line, represented by shaded regions.
### Components/Axes
* **X-axis:** "Iteration", ranging from 1 to 5.
* **Y-axis:** "Average Incorrect Flips", ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a blue dashed line with circular markers.
* **Data Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Legend:** Located in the top-right corner, associating colors with the methods.
* **Confidence Intervals:** Shaded regions around each line, colored light blue for "Generation" and light orange for "Multiple-choice".
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" initially starts at approximately 0.063 at Iteration 1. It then slopes downward, reaching a minimum of approximately 0.028 at Iteration 4. It then increases slightly to approximately 0.033 at Iteration 5. The confidence interval is relatively wide, especially at the beginning, indicating greater uncertainty.
* Iteration 1: ~0.063
* Iteration 2: ~0.052
* Iteration 3: ~0.042
* Iteration 4: ~0.028
* Iteration 5: ~0.033
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" begins at approximately 0.085 at Iteration 1. It decreases sharply to approximately 0.048 at Iteration 2, then continues to decrease, reaching a minimum of approximately 0.025 at Iteration 3. It then increases to approximately 0.038 at Iteration 4 and finally to approximately 0.042 at Iteration 5. The confidence interval is also wide, but appears slightly narrower than that of "Generation".
* Iteration 1: ~0.085
* Iteration 2: ~0.048
* Iteration 3: ~0.025
* Iteration 4: ~0.038
* Iteration 5: ~0.042
### Key Observations
* Both methods show a decreasing trend in "Average Incorrect Flips" as the "Iteration" number increases, suggesting improvement in performance with each iteration.
* The "Multiple-choice" method starts with a higher error rate than "Generation" but converges to a similar level of accuracy by Iteration 5.
* The confidence intervals indicate that the observed differences between the methods, particularly at the beginning, may not be statistically significant.
* The "Generation" method appears to have a more stable performance across iterations, as indicated by the relatively consistent slope of the line.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods improve their performance (reduce incorrect flips) with each iteration. The initial higher error rate of the "Multiple-choice" method could be due to a more challenging initial state or a slower learning curve. The convergence of the two methods towards the end suggests that both approaches are capable of achieving similar levels of accuracy given enough iterations. The wide confidence intervals highlight the inherent variability in the results and the need for further investigation to determine whether the observed differences are statistically significant. The data implies that iterative refinement is beneficial for both methods, but the optimal number of iterations may vary depending on the specific application and desired level of accuracy.
</details>
(d) Baseline
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Line Chart: Average Incorrect Flips vs. Iteration
### Overview
This image presents a line chart comparing the average number of incorrect flips over five iterations for two methods: "Generation" and "Multiple-choice". The chart also includes shaded regions representing the variance or confidence intervals around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Incorrect Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a dashed blue line with circular markers.
* **Data Series 2:** "Multiple-choice" - Represented by a dashed orange line with circular markers.
* **Legend:** Located in the top-right corner, associating colors with the methods.
* **Shaded Regions:** Light purple and orange shading around each line, indicating variance.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" generally slopes downward from Iteration 1 to Iteration 5, indicating a decrease in average incorrect flips over time.
* Iteration 1: Approximately 0.062
* Iteration 2: Approximately 0.052
* Iteration 3: Approximately 0.038
* Iteration 4: Approximately 0.028
* Iteration 5: Approximately 0.035
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" shows a more fluctuating trend. It starts high at Iteration 1, decreases to Iteration 4, and then increases again at Iteration 5.
* Iteration 1: Approximately 0.082
* Iteration 2: Approximately 0.072
* Iteration 3: Approximately 0.062
* Iteration 4: Approximately 0.025
* Iteration 5: Approximately 0.038
**Shaded Regions:**
The shaded regions around each line indicate the variability of the data. The purple shading around the blue line is relatively narrow, suggesting less variance in the "Generation" method. The orange shading around the orange line is wider, indicating more variance in the "Multiple-choice" method.
### Key Observations
* The "Generation" method consistently exhibits fewer incorrect flips than the "Multiple-choice" method across all iterations.
* The "Generation" method shows a clear decreasing trend in incorrect flips, suggesting improvement with each iteration.
* The "Multiple-choice" method is more volatile, with a decrease followed by an increase in incorrect flips.
* The variance in the "Multiple-choice" method is higher than in the "Generation" method.
### Interpretation
The data suggests that the "Generation" method is more stable and effective at reducing incorrect flips over iterations compared to the "Multiple-choice" method. The decreasing trend in the "Generation" method indicates that it learns and improves with each iteration. The higher variance in the "Multiple-choice" method suggests that its performance is more sensitive to the specific input or conditions. The initial higher error rate of the "Multiple-choice" method, combined with its increased variance, suggests it may be less reliable or require more iterations to converge to a stable solution. The increase in incorrect flips at Iteration 5 for the "Multiple-choice" method is an anomaly that warrants further investigation. It could be due to a change in the input data, a bug in the algorithm, or simply random variation.
</details>
(e) CoT
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Line Chart: Average Incorrect Flips vs. Iteration
### Overview
The image presents a line chart illustrating the relationship between the iteration number and the average number of incorrect flips for two different methods: "Generation" and "Multiple-choice". The chart includes shaded regions representing the standard deviation or confidence interval around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Incorrect Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a blue dashed line with circular markers.
* **Data Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Legend:** Located in the top-right corner, identifying the two data series and their corresponding colors.
* **Shaded Regions:** Light purple around the "Generation" line and light orange around the "Multiple-choice" line, indicating variability.
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" shows a decreasing trend.
* Iteration 1: Approximately 0.063
* Iteration 2: Approximately 0.042
* Iteration 3: Approximately 0.032
* Iteration 4: Approximately 0.030
* Iteration 5: Approximately 0.023
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" also shows a decreasing trend, but with more fluctuation.
* Iteration 1: Approximately 0.053
* Iteration 2: Approximately 0.035
* Iteration 3: Approximately 0.045
* Iteration 4: Approximately 0.035
* Iteration 5: Approximately 0.025
The shaded regions around each line indicate the variability in the data. The purple shaded region around the "Generation" line is wider in the first two iterations, suggesting greater variability initially. The orange shaded region around the "Multiple-choice" line shows similar variability.
### Key Observations
* Both methods demonstrate a decrease in average incorrect flips as the iteration number increases, indicating improvement in performance.
* The "Generation" method consistently exhibits a lower average incorrect flip rate than the "Multiple-choice" method across all iterations.
* The variability (as indicated by the shaded regions) appears to decrease slightly with increasing iteration number for both methods.
* The "Multiple-choice" method shows a slight increase in incorrect flips between Iteration 2 and Iteration 3.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods improve with each iteration, as evidenced by the decreasing trend in average incorrect flips. However, the "Generation" method consistently outperforms the "Multiple-choice" method, suggesting it is a more effective approach. The shaded regions indicate the uncertainty or variability in the results, which is important to consider when interpreting the data. The slight increase in incorrect flips for the "Multiple-choice" method at Iteration 3 could be due to random fluctuations or a temporary setback in the learning process. Overall, the data suggests that iterative refinement leads to improved performance in both methods, with the "Generation" method demonstrating a more consistent and superior outcome. The chart provides quantitative evidence supporting the effectiveness of iterative processes in improving the accuracy of these methods.
</details>
(f) SC
Figure 2: Average Correct and Incorrect Flips on DisambiguationQA
We now analyze the results in relation to our three research questions.
Improvement Patterns Across Iterations (RQ1).
To address RQ1, we first examine the aggregate performance reported in Figure 1, which compares accuracy across correction iterations for generation and multiple-choice formats. The generation paradigm improves rapidly in the first one or two iterations, showing that early revisions are effective at fixing obvious errors or adding missing information. However, after these early gains, performance often plateaus or declines, as additional revisions increase the risk of semantic drift and lead to new mistakes. In contrast, the multiple-choice paradigm improves more gradually and steadily. Accuracy rises incrementally with each round of self-correction, reflecting cautious re-weighting among fixed options. Yet this format struggles to recover from poor initial predictions: if the modelās first choice is wrong, subsequent iterations rarely flip it to the correct option, showing the effects of logit inertia.
Figures 2 and 3 present the āflipā dynamics of self-correction on the two datasets, broken down into correct (a previously wrong answer corrected to right) and incorrect (a previously correct answer changed to wrong) flips over successive iterations. On DisambiguationQA (Figure 2), multiple-choice self-correction yields very few flips overall. Correct answers are stably retained, but wrong initial guesses are seldom corrected. Generation, by contrast, produces more frequent flips: many beneficial in early iterations (correcting ambiguous references) but increasingly harmful in later ones, as correct answers are sometimes replaced with incorrect ones, once the model starts to over-correct or drift. On tinyTruthfulQA (Figure 3), the contrast is sharper: generation produces a high number of flips, with many early correct flips (replacing misconceptions with truths), but also a rising number of incorrect flips in later rounds, reflecting semantic drift. Multiple-choice again remains stable, with minimal incorrect flips but limited ability to recover from an early mistake.
Taken together, we show that open-ended generation offers adaptability and rapid early gains but suffers from instability in later iterations, whereas multiple-choice offers stability and incremental improvement but is hampered by inertia when the first choice is wrong. This confirms that self-correction effectiveness is strongly dependent on task format: open-ended generation can exploit flexibility to correct errors but risks drift, while multiple-choice provides reliable retention of correct answers at the expense of recoverability. If the model doesnāt get the answer right on the first attempt, it has a hard time changing to the correct option later. This fundamental difference in dynamics directly answers RQ1: self-correction behaves very differently in open-ended versus fixed-option scenarios, with each paradigm exhibiting its own pattern of improvement and failure modes.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average number of correct flips for two different methods: "Generation" and "Multiple-choice". The chart displays the trend of these metrics over five iterations. Shaded regions around the lines indicate some measure of variance or uncertainty.
### Components/Axes
* **X-axis:** Labeled "Iteration", with markers at 1, 2, 3, 4, and 5.
* **Y-axis:** Labeled "Average Correct Flips", with a scale ranging from 0.000 to 0.100, incrementing by 0.025.
* **Data Series 1:** "Generation", represented by a blue line with diamond markers.
* **Data Series 2:** "Multiple-choice", represented by an orange line with circle markers.
* **Legend:** Located in the top-right corner, identifying the two data series and their corresponding colors.
* **Shaded Regions:** Light purple shading around the blue line and light orange shading around the orange line, representing a confidence interval or standard deviation.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" initially starts at approximately 0.052 at Iteration 1. It then decreases to roughly 0.042 at Iteration 3, dips to a minimum of approximately 0.035 at Iteration 4, and rises slightly to around 0.045 at Iteration 5. The trend is generally decreasing, with a slight recovery in the final iteration.
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" begins at approximately 0.062 at Iteration 1. It steadily declines to around 0.048 at Iteration 2, continues to decrease to approximately 0.030 at Iteration 3, drops to a minimum of roughly 0.018 at Iteration 4, and then increases to approximately 0.025 at Iteration 5. This line exhibits a clear downward trend, followed by a slight increase in the final iteration.
**Data Points (Approximate):**
| Iteration | Generation (Average Correct Flips) | Multiple-choice (Average Correct Flips) |
|---|---|---|
| 1 | 0.052 | 0.062 |
| 2 | 0.050 | 0.048 |
| 3 | 0.042 | 0.030 |
| 4 | 0.035 | 0.018 |
| 5 | 0.045 | 0.025 |
### Key Observations
* The "Multiple-choice" method consistently shows a higher average correct flips value than the "Generation" method at Iteration 1 and 2.
* Both methods exhibit a decreasing trend in average correct flips from Iteration 1 to Iteration 4.
* Both methods show a slight increase in average correct flips at Iteration 5, suggesting a potential stabilization or recovery.
* The shaded regions indicate a degree of uncertainty or variability in the results for both methods. The shaded regions are wider for the "Multiple-choice" method, suggesting greater variability.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods initially perform well, but their effectiveness decreases with each iteration up to Iteration 4. The slight increase in performance at Iteration 5 could indicate that the methods are approaching a limit or that further iterations might not yield significant improvements. The "Multiple-choice" method appears to be more volatile, as indicated by the wider shaded regions, suggesting that its performance is more sensitive to variations in the data or process. The decreasing trend could be due to factors such as overfitting, diminishing returns, or the inherent difficulty of the task. Further investigation would be needed to determine the underlying causes of these trends and to identify strategies for improving the performance of both methods. The data suggests that the "Multiple-choice" method is initially superior, but both methods degrade in performance over time.
</details>
(a) Baseline
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average number of correct flips, comparing two methods: "Generation" and "Multiple-choice". The chart displays a decreasing trend for both methods as the iteration number increases. Shaded regions indicate the uncertainty or variance around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Correct Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** Generation (represented by a blue line with blue filled circle markers).
* **Data Series 2:** Multiple-choice (represented by an orange line with orange filled circle markers).
* **Legend:** Located in the top-right corner, identifying the two data series and their corresponding colors.
* **Shaded Region:** A light purple/grey shaded region fills the background of the chart, and a darker grey shaded region surrounds each line, representing the uncertainty.
### Detailed Analysis
**Generation (Blue Line):**
The blue line representing "Generation" starts at approximately 0.075 at Iteration 1 and slopes downward.
* Iteration 1: ~0.075
* Iteration 2: ~0.065
* Iteration 3: ~0.050
* Iteration 4: ~0.040
* Iteration 5: ~0.035
**Multiple-choice (Orange Line):**
The orange line representing "Multiple-choice" starts at approximately 0.075 at Iteration 1 and exhibits a steeper downward slope than the "Generation" line.
* Iteration 1: ~0.075
* Iteration 2: ~0.030
* Iteration 3: ~0.025
* Iteration 4: ~0.015
* Iteration 5: ~0.010
The shaded regions around each line indicate a degree of uncertainty. The width of the shaded region appears to be consistent across iterations for both methods.
### Key Observations
* Both methods show a decrease in average correct flips as the iteration number increases.
* The "Multiple-choice" method exhibits a more rapid decline in average correct flips compared to the "Generation" method.
* At Iteration 1, both methods have approximately the same average correct flips.
* By Iteration 5, the "Generation" method has a higher average correct flips than the "Multiple-choice" method.
* The shaded regions suggest that the variance in the results is relatively consistent across iterations for both methods.
### Interpretation
The data suggests that both the "Generation" and "Multiple-choice" methods become less effective at identifying correct flips as the number of iterations increases. However, the "Generation" method appears to be more robust to iterative refinement, maintaining a higher average correct flips count compared to the "Multiple-choice" method after several iterations. This could indicate that the "Generation" method is less susceptible to overfitting or that it benefits from continued exploration of the solution space. The decreasing trend for both methods might be due to the inherent difficulty of the task or the limitations of the algorithms used. The shaded regions indicate the variability in the results, suggesting that the observed trends are not deterministic and that there is a degree of randomness involved. The initial equivalence of the methods at Iteration 1 suggests that both start with similar performance, but their subsequent trajectories diverge, highlighting the different characteristics of each approach.
</details>
(b) CoT
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: Average Correct Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average number of correct flips for two different methods: "Generation" and "Multiple-choice". The chart uses lines with markers to represent the data, and shaded areas to indicate the variance or confidence interval around each line.
### Components/Axes
* **X-axis:** Labeled "Iteration", with markers at 1, 2, 3, 4, and 5.
* **Y-axis:** Labeled "Average Correct Flips", with a scale ranging from 0.000 to 0.100, incrementing by 0.025.
* **Legend:** Located in the top-right corner, containing two entries:
* "Generation" - represented by a dark blue line with circular markers.
* "Multiple-choice" - represented by an orange line with circular markers.
* **Shaded Areas:** A light blue shaded area surrounds the "Generation" line, and a pale orange shaded area surrounds the "Multiple-choice" line. These likely represent standard deviation or confidence intervals.
### Detailed Analysis
**Generation (Dark Blue Line):**
The line representing "Generation" initially slopes downward from Iteration 1 to Iteration 2, then exhibits a slight upward trend from Iteration 2 to Iteration 4, before decreasing again at Iteration 5.
* Iteration 1: Approximately 0.045
* Iteration 2: Approximately 0.030
* Iteration 3: Approximately 0.032
* Iteration 4: Approximately 0.038
* Iteration 5: Approximately 0.030
**Multiple-choice (Orange Line):**
The line representing "Multiple-choice" consistently slopes downward from Iteration 1 to Iteration 5.
* Iteration 1: Approximately 0.065
* Iteration 2: Approximately 0.045
* Iteration 3: Approximately 0.020
* Iteration 4: Approximately 0.022
* Iteration 5: Approximately 0.018
The shaded areas around each line indicate the variability of the data. The "Generation" shaded area is wider at Iteration 2 and 3, suggesting greater variance in those iterations. The "Multiple-choice" shaded area appears relatively consistent across all iterations.
### Key Observations
* The "Multiple-choice" method starts with a higher average correct flips than the "Generation" method.
* The "Multiple-choice" method shows a consistent decrease in average correct flips with each iteration.
* The "Generation" method exhibits more fluctuation in average correct flips across iterations, with a slight increase at Iteration 4.
* By Iteration 5, the "Generation" method's average correct flips is approaching that of the "Multiple-choice" method.
### Interpretation
The chart suggests that while the "Multiple-choice" method initially performs better, its effectiveness diminishes rapidly with each iteration. The "Generation" method, although starting with a lower performance, shows some resilience and even a temporary improvement at Iteration 4. This could indicate that the "Generation" method benefits from continued iterations, potentially learning or adapting over time, while the "Multiple-choice" method reaches a point of diminishing returns. The wider shaded area around the "Generation" line suggests that the results for this method are more sensitive to variations or random factors. The overall trend suggests a trade-off between initial performance and long-term adaptability between the two methods. The data implies that for a longer series of iterations, the "Generation" method may eventually outperform the "Multiple-choice" method.
</details>
(c) SC
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: Average Incorrect Flips vs. Iteration
### Overview
This image presents a line chart comparing the average number of incorrect flips over five iterations for two methods: "Generation" and "Multiple-choice". The chart includes shaded regions representing the variance or confidence interval around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Incorrect Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a dashed blue line with filled blue circles as markers.
* **Data Series 2:** "Multiple-choice" - Represented by a dashed orange line with filled orange circles as markers.
* **Legend:** Located in the top-right corner, associating colors with the methods.
* **Shaded Regions:** A light blue shaded region surrounds the "Generation" line, and a light orange shaded region surrounds the "Multiple-choice" line. These likely represent standard deviation or confidence intervals.
### Detailed Analysis
**Generation (Blue Line):**
The blue line shows a decreasing trend from Iteration 1 to Iteration 5.
* Iteration 1: Approximately 0.077
* Iteration 2: Approximately 0.045
* Iteration 3: Approximately 0.035
* Iteration 4: Approximately 0.030
* Iteration 5: Approximately 0.025
**Multiple-choice (Orange Line):**
The orange line also shows a decreasing trend, but it starts at a lower value and has a less pronounced decrease than the blue line.
* Iteration 1: Approximately 0.035
* Iteration 2: Approximately 0.025
* Iteration 3: Approximately 0.015
* Iteration 4: Approximately 0.010
* Iteration 5: Approximately 0.005
The shaded regions indicate the variability around each line. The blue shaded region is wider, suggesting greater variance in the "Generation" method. The orange shaded region is narrower, indicating less variance in the "Multiple-choice" method.
### Key Observations
* Both methods show a decrease in average incorrect flips as the iteration number increases, indicating improvement in performance.
* The "Generation" method consistently has a higher average incorrect flip rate than the "Multiple-choice" method across all iterations.
* The variance in the "Generation" method is notably higher than in the "Multiple-choice" method.
* The rate of decrease in incorrect flips appears to slow down after Iteration 3 for both methods.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods improve with each iteration, as evidenced by the decreasing trend in incorrect flips. However, the "Multiple-choice" method consistently outperforms the "Generation" method, exhibiting a lower error rate throughout the five iterations. The larger shaded region around the "Generation" line indicates that the results for this method are more variable, potentially suggesting that its performance is more sensitive to initial conditions or other factors.
The slowing rate of decrease after Iteration 3 could indicate that the methods are approaching a performance limit, or that further improvements require more significant changes to the underlying algorithms or parameters. The data suggests that, for this particular task, the "Multiple-choice" approach is more reliable and consistent than the "Generation" approach. Further investigation could explore the reasons for the higher variance in the "Generation" method and identify strategies to reduce it.
</details>
(d) Baseline
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Line Chart: Incorrect Flips vs. Iteration
### Overview
The image presents a line chart illustrating the relationship between the iteration number and the average number of incorrect flips for two different methods: "Generation" and "Multiple-choice". The chart displays a decreasing trend for both methods as the iteration number increases, suggesting improvement in performance with each iteration.
### Components/Axes
* **X-axis:** "Iteration", ranging from 1 to 5.
* **Y-axis:** "Average Incorrect Flips", ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a blue dashed line with circular markers.
* **Data Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Legend:** Located in the top-right corner, identifying the two data series and their corresponding colors.
* **Shaded Regions:** Light blue and light orange shaded regions around each line, potentially representing a confidence interval or standard deviation.
### Detailed Analysis
**Generation (Blue Line):**
The blue line starts at approximately 0.095 at Iteration 1 and exhibits a downward trend.
* Iteration 1: ~0.095
* Iteration 2: ~0.065
* Iteration 3: ~0.060
* Iteration 4: ~0.045
* Iteration 5: ~0.035
**Multiple-choice (Orange Line):**
The orange line begins at approximately 0.025 at Iteration 1 and also shows a decreasing trend, though less pronounced than the blue line.
* Iteration 1: ~0.025
* Iteration 2: ~0.040
* Iteration 3: ~0.030
* Iteration 4: ~0.020
* Iteration 5: ~0.020
The shaded regions around each line indicate variability in the data. The blue shaded region is larger, suggesting greater variability in the "Generation" method.
### Key Observations
* Both methods demonstrate a reduction in average incorrect flips as the iteration number increases.
* The "Generation" method starts with a significantly higher number of incorrect flips compared to the "Multiple-choice" method.
* The rate of decrease in incorrect flips appears to be faster for the "Generation" method in the initial iterations.
* The "Multiple-choice" method consistently maintains a lower number of incorrect flips throughout all iterations.
* The shaded regions suggest that the "Generation" method has more variance in its performance than the "Multiple-choice" method.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods improve with each iteration, as indicated by the decreasing number of incorrect flips. However, the "Multiple-choice" method consistently outperforms the "Generation" method, starting with fewer errors and maintaining a lower error rate throughout the iterations. The larger shaded region around the "Generation" line suggests that the performance of this method is more variable, potentially indicating greater sensitivity to initial conditions or other factors.
The data could be interpreted as evidence that the "Multiple-choice" approach is more robust or efficient for this particular task, while the "Generation" approach, although improving, remains less reliable. The decreasing trend for both methods implies that continued iteration could lead to further performance gains. The initial high error rate for "Generation" could be due to a more complex initial state or a more challenging learning process.
</details>
(e) CoT
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Line Chart: Average Incorrect Flips vs. Iteration
### Overview
This image presents a line chart illustrating the relationship between the iteration number and the average number of incorrect flips for two different methods: "Generation" and "Multiple-choice". The chart includes shaded regions representing the uncertainty or variance around each line.
### Components/Axes
* **X-axis:** Iteration, ranging from 1 to 5.
* **Y-axis:** Average Incorrect Flips, ranging from 0.000 to 0.100.
* **Data Series 1:** "Generation" - Represented by a blue dashed line with circular markers.
* **Data Series 2:** "Multiple-choice" - Represented by an orange dashed line with circular markers.
* **Legend:** Located in the top-right corner, associating colors with the methods.
* **Shaded Regions:** Light blue and light orange areas surrounding each line, indicating the variance or confidence interval.
### Detailed Analysis
**Generation (Blue Line):**
The blue line shows a decreasing trend, indicating a reduction in average incorrect flips as the iteration number increases.
* Iteration 1: Approximately 0.063
* Iteration 2: Approximately 0.053
* Iteration 3: Approximately 0.045
* Iteration 4: Approximately 0.033
* Iteration 5: Approximately 0.025
**Multiple-choice (Orange Line):**
The orange line also exhibits a decreasing trend, but the rate of decrease appears more pronounced initially and then plateaus.
* Iteration 1: Approximately 0.043
* Iteration 2: Approximately 0.033
* Iteration 3: Approximately 0.023
* Iteration 4: Approximately 0.018
* Iteration 5: Approximately 0.012
The shaded regions around each line indicate the uncertainty in the data. The blue shaded region is wider, suggesting greater variance in the "Generation" method's performance. The orange shaded region is narrower, indicating more consistent performance for the "Multiple-choice" method.
### Key Observations
* Both methods demonstrate a reduction in average incorrect flips with increasing iterations.
* The "Multiple-choice" method consistently exhibits a lower average incorrect flip rate than the "Generation" method across all iterations.
* The "Generation" method has a larger variance in its performance, as indicated by the wider shaded region.
* The rate of improvement for the "Multiple-choice" method appears to slow down after iteration 3.
### Interpretation
The chart suggests that both the "Generation" and "Multiple-choice" methods improve with iterative refinement, as evidenced by the decreasing trend in incorrect flips. However, the "Multiple-choice" method consistently outperforms the "Generation" method, indicating it is a more effective approach for this particular task. The wider variance in the "Generation" method suggests that its performance is more sensitive to variations in the input or process. The plateauing improvement in the "Multiple-choice" method after iteration 3 might indicate a point of diminishing returns, where further iterations yield only marginal improvements. This data could be used to inform decisions about which method to prioritize for further development or deployment, and to optimize the number of iterations used in the process. The chart demonstrates a clear relationship between iterative refinement and performance improvement, highlighting the importance of iterative processes in achieving desired outcomes.
</details>
(f) SC
Figure 3: Average Correct and Incorrect Flips on tinyTruthfulQA
Effects of Model Scale and Prompting Strategy (RQ2).
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Line Charts: Model Performance on QA Tasks Across Iterations
### Overview
The image presents six line charts comparing the performance of several language models on two question answering (QA) tasks ā DisambiguationQA and tinyTruthfulQA ā under three different prompting strategies: Baseline, Chain-of-Thought (CoT), and Self-Consistency. Performance is measured by accuracy (in percentage) across five iterations. Each chart displays accuracy as a function of iteration number, with separate lines representing different models.
### Components/Axes
* **X-axis:** Iteration (ranging from 0 to 5)
* **Y-axis:** Accuracy (%) (ranging from 0.0 to 0.8, with increments of 0.2)
* **Chart Titles:**
* DisambiguationQA - Baseline
* DisambiguationQA - CoT
* DisambiguationQA - Self-Consistency
* tinyTruthfulQA - Baseline
* tinyTruthfulQA - CoT
* tinyTruthfulQA - Self-Consistency
* **Legend:**
* Generation (Blue Circle)
* Multiple-choice (Green Triangle)
* Gemini-2.0-Flash (Orange Square)
* Llama-3.1-8B (Red Diamond)
* DeepSeek-R1-Distill-Llama-8B (Gray Right-Pointing Triangle)
* Qwen2.5-3B (Black Diamond)
* SmolLM2-1.7B (Purple Circle)
### Detailed Analysis or Content Details
**1. DisambiguationQA - Baseline**
* **Generation (Blue):** Starts at approximately 0.25, increases to around 0.35 at iteration 1, then fluctuates between 0.3 and 0.4 for the remaining iterations.
* **Multiple-choice (Green):** Starts at approximately 0.35, increases to around 0.45 at iteration 1, then decreases to around 0.35 at iteration 5.
* **Gemini-2.0-Flash (Orange):** Starts at approximately 0.3, increases to around 0.4 at iteration 1, then fluctuates between 0.35 and 0.45 for the remaining iterations.
* **Llama-3.1-8B (Red):** Starts at approximately 0.2, increases to around 0.3 at iteration 1, then fluctuates between 0.25 and 0.35 for the remaining iterations.
* **DeepSeek-R1-Distill-Llama-8B (Gray):** Starts at approximately 0.25, increases to around 0.35 at iteration 1, then fluctuates between 0.3 and 0.4 for the remaining iterations.
* **Qwen2.5-3B (Black):** Starts at approximately 0.2, increases to around 0.3 at iteration 1, then fluctuates between 0.25 and 0.35 for the remaining iterations.
* **SmolLM2-1.7B (Purple):** Starts at approximately 0.2, remains relatively stable around 0.25 for all iterations.
**2. DisambiguationQA - CoT**
* **Generation (Blue):** Starts at approximately 0.2, increases to around 0.35 at iteration 1, then fluctuates between 0.3 and 0.4 for the remaining iterations.
* **Multiple-choice (Green):** Starts at approximately 0.3, increases to around 0.4 at iteration 1, then decreases to around 0.3 at iteration 5.
* **Gemini-2.0-Flash (Orange):** Starts at approximately 0.25, increases to around 0.35 at iteration 1, then fluctuates between 0.3 and 0.4 for the remaining iterations.
* **Llama-3.1-8B (Red):** Starts at approximately 0.15, increases to around 0.25 at iteration 1, then fluctuates between 0.2 and 0.3 for the remaining iterations.
* **DeepSeek-R1-Distill-Llama-8B (Gray):** Starts at approximately 0.2, increases to around 0.3 at iteration 1, then fluctuates between 0.25 and 0.35 for the remaining iterations.
* **Qwen2.5-3B (Black):** Starts at approximately 0.15, increases to around 0.25 at iteration 1, then fluctuates between 0.2 and 0.3 for the remaining iterations.
* **SmolLM2-1.7B (Purple):** Starts at approximately 0.15, remains relatively stable around 0.2 for all iterations.
**3. DisambiguationQA - Self-Consistency**
* Similar trends to Baseline and CoT, with generally lower accuracy values.
**4. tinyTruthfulQA - Baseline**
* **Generation (Blue):** Starts at approximately 0.6, remains relatively stable around 0.65 for all iterations.
* **Multiple-choice (Green):** Starts at approximately 0.65, remains relatively stable around 0.7 for all iterations.
* **Gemini-2.0-Flash (Orange):** Starts at approximately 0.6, remains relatively stable around 0.65 for all iterations.
* **Llama-3.1-8B (Red):** Starts at approximately 0.55, remains relatively stable around 0.6 for all iterations.
* **DeepSeek-R1-Distill-Llama-8B (Gray):** Starts at approximately 0.6, remains relatively stable around 0.65 for all iterations.
* **Qwen2.5-3B (Black):** Starts at approximately 0.55, remains relatively stable around 0.6 for all iterations.
* **SmolLM2-1.7B (Purple):** Starts at approximately 0.5, remains relatively stable around 0.55 for all iterations.
**5. tinyTruthfulQA - CoT**
* **Generation (Blue):** Starts at approximately 0.6, remains relatively stable around 0.65 for all iterations.
* **Multiple-choice (Green):** Starts at approximately 0.65, remains relatively stable around 0.7 for all iterations.
* **Gemini-2.0-Flash (Orange):** Starts at approximately 0.6, remains relatively stable around 0.65 for all iterations.
* **Llama-3.1-8B (Red):** Starts at approximately 0.55, remains relatively stable around 0.6 for all iterations.
* **DeepSeek-R1-Distill-Llama-8B (Gray):** Starts at approximately 0.6, remains relatively stable around 0.65 for all iterations.
* **Qwen2.5-3B (Black):** Starts at approximately 0.55, remains relatively stable around 0.6 for all iterations.
* **SmolLM2-1.7B (Purple):** Starts at approximately 0.5, remains relatively stable around 0.55 for all iterations.
**6. tinyTruthfulQA - Self-Consistency**
* Similar trends to Baseline and CoT, with generally higher accuracy values.
### Key Observations
* The "tinyTruthfulQA" task consistently yields higher accuracy scores than the "DisambiguationQA" task across all models and prompting strategies.
* The "Baseline" and "CoT" prompting strategies generally result in similar performance, while "Self-Consistency" shows varying results depending on the task.
* "Multiple-choice" consistently outperforms "Generation" in most scenarios.
* The accuracy scores tend to stabilize after the first few iterations, indicating diminishing returns from further iterations.
### Interpretation
The data suggests that the choice of QA task significantly impacts model performance, with "tinyTruthfulQA" being an easier task for the evaluated models. The prompting strategy (Baseline, CoT, Self-Consistency) has a moderate effect on performance, with no single strategy consistently outperforming the others. The "Multiple-choice" approach generally leads to better results than "Generation," potentially because it reduces the complexity of the task by providing a limited set of options. The stabilization of accuracy scores after a few iterations suggests that the models are converging towards their maximum performance level within the given experimental setup. The differences in performance between models (e.g., Gemini-2.0-Flash vs. SmolLM2-1.7B) highlight the varying capabilities of different language models. The relatively low accuracy scores for DisambiguationQA suggest that this task is more challenging and requires more sophisticated reasoning abilities.
</details>
Figure 4: Accuracy per iteration per model on generation and multiple-choice.
Here, we investigate how a modelās size and the prompting strategy influence self-correction, and whether these effects differ between the two output paradigms. Figure 4 provides a detailed view of accuracy per iteration for various models under different prompting methods. A clear finding is that task difficulty moderates these effects. On the challenging DisambiguationQA benchmark, accuracy is low for all models: even the largest (e.g., Gemini-2.0-Flash, Qwen2.5-14B) plateau around 50% in multiple-choice and below 20% in generation, while smaller models perform far worse. In contrast, on the easier tinyTruthfulQA, generative accuracy ranges from 60ā90% and multiple-choice from 50ā80%, with even small models performing well. Thus, model scale yields clear benefits on harder tasks, but differences narrow considerably on simpler ones.
The prompting strategy has a modest but noticeable effect, more so on the difficult task. On DisambiguationQA, using an explicit CoT prompt or a SC approach yields slight accuracy improvements over the Baseline direct prompting. For example, prompting the model to āthink step by stepā or to consider multiple reasoning paths sometimes helps it disambiguate the question better, nudging up the accuracy by a few percentage points. These gains, while not dramatic, suggest that reasoning-oriented prompts can aid the model on ambiguous, challenging questions. In contrast, on tinyTruthfulQA, all three prompting strategies lead to very similar performance. The accuracy curves for different prompts on this task are nearly overlapping (Figure 4), indicating that when a question is relatively straightforward or the model already knows the domain (e.g., common truths vs. misconceptions), an elaborate prompt does not provide much benefit. In summary, prompting variations have a task-dependent impact: they can be slightly beneficial for resolving difficult queries (DisambiguationQA) but mostly redundant for simpler factual questions (tinyTruthfulQA). This aligns with the findings in the literature Sprague et al. (2025).
Model scale shows a similarly nuanced effect. Larger models generally outperform smaller ones, especially on DisambiguationQA, where 14B+ models clearly surpass 1ā3B models. On tinyTruthfulQA, however, the performance gap narrows, with small models often approaching large-model accuracy. In some cases, scaling produces diminishing returns, indicating that size matters more for difficult tasks but offers limited advantage once a task is already within reach.
Notably, repeated iterations of self-correction do not consistently boost accuracy for either paradigm, regardless of model size or prompt strategy. Across our experiments, most performance curves over iterations (spanning iteration 0 through 5) are relatively flat after the initial step. As highlighted by Figure 4, it is rare to see a clear upward trajectory beyond the first one or two iterations; instead, accuracy often oscillates with minor gains or losses. For example, a model might correct a mistake at iteration 1, only to introduce a different mistake at iteration 3, ending up with a similar accuracy as it started. This plateauing behavior implies that giving the model many chances to self-correct yields diminishing returns. Neither larger scale nor advanced prompting fundamentally changes this outcome ā their benefits tend to manifest in the first attempt or two, but they do not drive continual improvement with more iterations. In some cases, we even observed slight performance degradation with too many iterations (echoing the drift issues from RQ1). In summary, the impact of model scale and prompting strategy on self-correction is real but nuanced: larger models and CoT-style prompts can improve initial accuracy, especially on hard tasks, but these factors are task-dependent and ultimately insufficient to guarantee ongoing improvements through iterative self-correction alone. Multiple-choice and generation formats alike see their gains saturate early, and improvements from scaling or better prompting taper off without addressing the core limitations of each paradigm. Notably, we also found that the multiple-choice paradigm often reaped slightly more benefit from increased model size and reasoning prompts than the generation paradigm did (especially on DisambiguationQA), reinforcing the idea that constrained decision tasks can more readily capitalize on those enhancements. Still, neither paradigm achieves a dramatically upward performance trend with iteration ā a key insight for understanding the boundaries of current self-correction capabilities.
Trade-offs Between Adaptability and Stability (RQ3).
RQ3 examines how iterative self-correction influences correctness, stability, and semantic drift across unconstrained and constrained outputs. In the generation setting, flexibility allows models to revise and often improve answers in the first one or two iterations, but this same flexibility leads to semantic drift in later rounds. As Figures 2 and 3 as well as the detailed plots of per model evaluation in Appendix C.1, generation produces many flips: early ones are often correct (e.g., resolving an ambiguity or correcting a misconception), but over time, incorrect flips dominate as the model over-edits or drifts away from the question. This suggests that while generation supports adaptability, it lacks effective internal checks to prevent harmful revisions. By contrast, in the multiple-choice setting, the output space is restricted to fixed options, which prevents drift altogether. Correct answers remain locked in across iterations, reflecting high stability. However, this comes with logit inertia: wrong initial answers persist, with very few corrective flips observed in Figures 2 and 3. The mechanism here is that once a wrong option is selected, the model rarely shifts its ranking enough to choose the correct one later, even when revisiting its reasoning.
These patterns reveal a fundamental adaptabilityāstability trade-off. Generation is exploratory and can recover from initial mistakes, but risks undermining correctness as iterations accumulate. Multiple-choice ensures consistency once correct, but limits opportunities to fix errors. For system design, this implies that neither paradigm is universally optimal. Applications requiring stable outputs, such as safety-critical domains, benefit from constrained correction, though additional mechanisms may be needed to overcome inertia (e.g., external verification or re-ranking). Conversely, tasks where capturing every possible correction is crucial may favor open-ended revision, provided that safeguards against drift are implemented. Promising directions include hybrid strategies that combine paradigms, using generation to explore candidate answers followed by constrained verification to anchor correctness, and dynamic stopping rules that halt iteration once improvements saturate or harmful drift is detected. Addressing these trade-offs directly, by mitigating semantic drift in generation and reducing inertia in multiple-choice, will be key to making iterative self-correction a reliable capability of LLM systems.
## 6 Conclusion
This study compared iterative self-correction in large language models across open-ended generation and multiple-choice question answering. Results show that the structure of the output space fundamentally shapes correction dynamics. Generation achieves rapid early gains by correcting errors in the first few iterations, but suffers from semantic drift as revisions accumulate, resulting in increasing rates of incorrect flips. Multiple-choice responses remain highly stable and avoid drift, but exhibit logit inertia: wrong initial answers are rarely overturned, and improvements are incremental at best. Model scale and prompting strategy modulate performance but do not alter these core patterns. Larger models and reasoning-oriented prompts (CoT, SC) yield slight improvements, especially on the harder DisambiguationQA task, but their effects are modest and task-dependent. Across both paradigms, accuracy generally plateaus after the first one or two iterations, showing that repeated self-correction brings limited benefit.
These findings highlight an inherent adaptabilityāstability trade-off. Open-ended generation enables recovery from errors but risks instability, while multiple-choice ensures reliability but limits correction. Future work should explore hybrid strategies, such as using generation for exploration and constrained formats for verification, as well as dynamic stopping criteria to prevent late drift. Addressing drift and inertia directly will be essential for building reliable self-correcting LLM systems.
## Limitations
This study focuses on benchmarks that provide parallel formulations for both open-ended generation and multiple-choice questions. While this setup enables a controlled analysis of self-correction across task formats, it also limits the number of datasets available for evaluation, as few benchmarks support both types of tasks. Moreover, our experiments are conducted using currently available models of moderate scale. Recent larger models, which may exhibit different self-correction dynamics and reasoning behaviors, are not included in our analysis. Future work could extend our study to such models to provide a more comprehensive understanding of scaling effects.
## Ethical Considerations
We have carefully verified that the software, model checkpoints and existing datasets utilised in this work are permitted for access, distribution and, where relevant, modification. Our use and purpose comply with those terms.
## Acknowledgments
This research is supported by the Engineering and Physical Sciences Research Council [EP/S021566/1] and the EPSRC Fellowship titled āTask Based Information Retrievalā [EP/P024289/1].
## References
- A2i (2025) A2i. 2025. TruthfulQA Truth Judge. Accessed: 2025.
- Allal et al. (2025) Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martin Blazquez, Guilherme Penedo, Lewis Tunstall, AndrĆ©s Marafioti, AgustĆn Piqueres LajarĆn, Hynek KydlĆÄek, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan Son NGUYEN, Ben Burtenshaw, ClĆ©mentine Fourrier, Haojun Zhao, Hugo Larcher, Mathieu Morlon, Cyril Zakka, and 3 others. 2025. SmolLM2: When smol goes big ā data-centric training of a fully open small language model. In Second Conference on Language Modeling.
- Belcak et al. (2025) Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small language models are the future of agentic ai. arXiv preprint arXiv:2506.02153.
- Brown and et al. (2020) Tom B Brown and et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Chang et al. (2024) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, and 1 others. 2024. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology, 15(3):1ā45.
- Chen et al. (2023) Mark Y Chen, Chia-Wei Liu, Xuezhi Wang, Quoc V Le, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Program-aided language models: Language models as programs. arXiv preprint arXiv:2303.11366.
- Clark and et al. (2018) Peter Clark and et al. 2018. Think you have reasoning solved? evaluating the arc challenge. arXiv preprint arXiv:1803.05457.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Anish Madaan, and et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Comanici et al. (2025) Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261.
- Cook et al. (2024) Jonathan Cook, Tim RocktƤschel, Jakob Nicolaus Foerster, Dennis Aumiller, and Alex Wang. 2024. TICKing all the boxes: Generated checklists improve LLM evaluation and generation. In Language Gamification - NeurIPS 2024 Workshop.
- Dixit et al. (2023) Tanay Dixit, Fei Wang, Muhao Chen, and et al. 2023. Improving factuality of abstractive summarization without sacrificing summary quality. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 902ā913.
- Gou et al. (2024) Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Santi Basart, and et al. 2021. Measuring massive multitask language understanding. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10013ā10023.
- Huang et al. (2023) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
- Ji et al. (2023a) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023a. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1ā38.
- Ji et al. (2023b) Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023b. Towards mitigating llm hallucination via self-reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827ā1843.
- Kamoi et al. (2024) Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. 2024. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12:1417ā1440.
- Kazemi et al. (2025) Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Peter Chen, and 1 others. 2025. Big-bench extra hard. arXiv preprint arXiv:2502.19187.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199ā22213.
- Krishna et al. (2024) Satyapriya Krishna, Chirag Agarwal, and Himabindu Lakkaraju. 2024. Understanding the effects of iterative prompting on truthfulness. In Proceedings of the 41st International Conference on Machine Learning, ICMLā24. JMLR.org.
- Kumar et al. (2025) Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. 2025. Training language models to self-correct via reinforcement learning. In The Thirteenth International Conference on Learning Representations.
- Li and et al. (2024) Wei Li and et al. 2024. Logitlens: Calibrating reasoning in language models with internal consistency. NeurIPS 2024.
- Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), pages 2129ā2144.
- Liu et al. (2024) Dancheng Liu, Amir Nassereldine, Ziming Yang, Chenhui Xu, Yuting Hu, Jiajie Li, Utkarsh Kumar, Changjae Lee, Ruiyang Qin, Yiyu Shi, and 1 others. 2024. Large language models have intrinsic self-correction ability. arXiv preprint arXiv:2406.15673.
- Ma et al. (2025) Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, and Jia Li. 2025. S 2 R: Teaching LLMs to self-verify and self-correct via reinforcement learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22632ā22654.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534ā46594.
- Pan and et al. (2023) Xinyu Pan and et al. 2023. Multiple rationales for multiple-choice question answering. arXiv preprint arXiv:2305.03495.
- Polo et al. (2024) Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. 2024. tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992.
- Pryzant (2023) Ryan Pryzant. 2023. Automatic prompt optimization with "gradient descent" for language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 494ā507.
- Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634ā8652.
- Spataru (2024) Ava Spataru. 2024. Know when to stop: A study of semantic drift in text generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3656ā3671.
- Sprague et al. (2025) Zayne Rea Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. 2025. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning. In The Thirteenth International Conference on Learning Representations.
- Suzgun and Kalai (2024) Mirac Suzgun and Adam Tauman Kalai. 2024. Meta-prompting: Enhancing language models with task-agnostic scaffolding. arXiv preprint arXiv:2401.12954.
- Suzgun et al. (2025) Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. 2025. Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952.
- Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Zelikman et al. (2022) Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. STar: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595ā46623.
- Zhu and et al. (2024) Xue Zhu and et al. 2024. Mcrepair: Enhancing multiple-choice reasoning with self-explanation and rescoring. arXiv preprint arXiv:2405.18711.
## Appendix A Details on Experimental Setup
### A.1 Details on Final Answer Extraction
For all of our problems, we added a short phrase to the text of the question to guide the model to give the final answer in a clear format: ā provide your final answer after the āThe final answer is: ā.ā To extract the answer, we split the output of the model using this phrase and take what comes after it. Since models sometimes change the phrase slightly, we also check for different variations until one is found: āThe answer is: ā, āThe answer is ā. āThe final answer is: ā, āThe final answer is ā. Once we get the final answer, we clean it up with a few simple steps:
1. If the answer is inside symbols like boxed, text, texttt, or wrapped in **, we remove those and keep only the text inside.
1. For multiple-choice questions, if the model adds extra text after the final answer (for example, by putting a newline \n), we split on \n and keep only the first part. We then lowercase both the final answer and the label, and then check the correctness with the following rules:
- If the final answer and label are identical, we consider the final answer correct.
- If they only differ by quotes or brackets around the answer, we consider it to be correct.
- For multiple-choice questions, the label is in the format (<LETTER>). If the model only gives the letter (like A instead of (A)), we still count it as correct.
### A.2 Prompts
#### A.2.1 Start Prompts
Baseline
Question: {question}. Provide your final answer after the āThe final answer is: ā.
Chain-of-Thought (CoT)
Question: {question}. Think step by step, and provide your final answer after the āThe final answer is: ā.
#### A.2.2 Iterative (Self-Correction) Prompts
Baseline
Question: {question}. Review your previous responses, and provide your final answer after the āThe final answer is: ā.
Chain-of-Thought (CoT)
Question: {question}. Review your previous responses, think step by step and provide your final answer after the āThe final answer is: ā.
## Appendix B Evaluation Protocol
Given the differences between task formats, we adopt distinct evaluation strategies tailored to the characteristics of each settingāopen-ended generation and multiple-choice questions. For multiple-choice questions, we use Soft Match (SM) Suzgun and Kalai (2024); Suzgun et al. (2025), a lenient metric that considers an answer correct if the ground-truth label appears in the modelās output, disregarding minor formatting variations such as punctuation or whitespace.
For open-ended generation, we employ the LLM-as-a-Judge Zheng et al. (2023) approach to assess the correctness of the generated answers relative to the ground-truth responses for each dataset. Specifically, we use the fine-tuned model https://github.com/yizhongw/truthfulqa_reeval introduced by A2i for evaluating generations on tinyTruthfulQA. For DisambiguationQA, we prompt a large model, GPT-4o, by providing the question, the model-generated answer, and the reference answer, asking it to determine whether the generated answer is correct. The exact prompt used for DisambiguationQA evaluation is shown below:
Evaluation Prompt
You are an expert in answer correctness evaluation. Given a question, its reference answer, and a generated answer, please evaluate the correctness of the generated answer based on the question and the reference answer. Here are the question, reference answer, and generated answer: - Question: {question} - Reference Answer: {gold answer} - Generated Answer: {generated answer} Please assess the correctness of the generated answer by considering the question and comparing it against the reference answer. Return yes if the generated answer is completely correct, otherwise, return ānoā. The final answer must only be āyesā or ānoā, corresponding to the correctness of the generated answer.
## Appendix C Additional Experiments and Results
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison
### Overview
This bar chart compares the accuracy of several language models on two different task types: "Generation" and "Multiple-choice". The accuracy is measured as a percentage, ranging from 0.0 to 0.6. The models being compared are DeepSeek-RL1, Llama-2-6B, Gwer2.5-14B, Gwer2.5-3B, SmalM2-1.7B, and Gemini-2.0-Flash.
### Components/Axes
* **X-axis:** Model Names - DeepSeek-RL1, Llama-2-6B, Gwer2.5-14B, Gwer2.5-3B, SmalM2-1.7B, Gemini-2.0-Flash.
* **Y-axis:** Accuracy (%) - Scale ranges from 0.0 to 0.6, with increments of 0.1.
* **Legend:**
* Blue bars: "Generation"
* Orange bars: "Multiple-choice"
* **Positioning:** The legend is located at the bottom-center of the chart.
### Detailed Analysis
The chart consists of paired bars for each model, representing its performance on the "Generation" and "Multiple-choice" tasks.
* **DeepSeek-RL1:** Generation accuracy is approximately 0.24. Multiple-choice accuracy is approximately 0.39.
* **Llama-2-6B:** Generation accuracy is approximately 0.29. Multiple-choice accuracy is approximately 0.54.
* **Gwer2.5-14B:** Generation accuracy is approximately 0.46. Multiple-choice accuracy is approximately 0.55.
* **Gwer2.5-3B:** Generation accuracy is approximately 0.32. Multiple-choice accuracy is approximately 0.44.
* **SmalM2-1.7B:** Generation accuracy is approximately 0.08. Multiple-choice accuracy is approximately 0.34.
* **Gemini-2.0-Flash:** Generation accuracy is approximately 0.40. Multiple-choice accuracy is approximately 0.57.
**Trends:**
* For most models, the "Multiple-choice" accuracy is higher than the "Generation" accuracy.
* Gwer2.5-14B shows the highest "Generation" accuracy.
* Gemini-2.0-Flash shows the highest "Multiple-choice" accuracy.
* SmalM2-1.7B shows the lowest "Generation" accuracy.
### Key Observations
* There's a clear performance difference between models, with some consistently outperforming others on both tasks.
* The gap between "Generation" and "Multiple-choice" accuracy varies significantly across models.
* Gwer2.5-14B is a strong performer in the "Generation" task, while Gemini-2.0-Flash excels in "Multiple-choice".
* SmalM2-1.7B is a clear outlier with very low "Generation" accuracy.
### Interpretation
The data suggests that the choice of model significantly impacts performance on both generation and multiple-choice tasks. The higher accuracy scores for "Multiple-choice" across most models indicate that these models are generally better at selecting the correct answer from a given set of options than they are at generating novel responses. The substantial difference in performance between SmalM2-1.7B and the other models suggests that model size or architecture plays a crucial role in generation capabilities. The strong performance of Gwer2.5-14B in generation and Gemini-2.0-Flash in multiple-choice suggests that different models may be optimized for different types of tasks. This information is valuable for selecting the most appropriate model for a specific application. The chart highlights the trade-offs between different models and the importance of considering the task type when evaluating model performance.
</details>
(a) Baseline
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy Comparison of Language Models
### Overview
This image presents a bar chart comparing the accuracy of several language models on two different task types: "Generation" and "Multiple-choice". The chart uses paired bars for each model, with blue representing "Generation" accuracy and orange representing "Multiple-choice" accuracy. The x-axis lists the model names, and the y-axis represents accuracy as a percentage.
### Components/Axes
* **X-axis:** Model Names: DeepSeek-R1, Llama-3.1-6B, Qwen-2.5-14B, Qwen-2.5-3B, SmalLM2-1.7B, Gemini-2.0-Flash. The label "Dweil-Llama-8B" is present under "DeepSeek-R1" but appears to be a footnote or related information.
* **Y-axis:** Accuracy (%) - Scale ranges from 0.0 to 0.6, with increments of 0.1.
* **Legend:** Located at the bottom-center of the chart.
* Blue: Generation
* Orange: Multiple-choice
### Detailed Analysis
Let's analyze each model's performance, starting from left to right:
1. **DeepSeek-R1:**
* Generation (Blue): Approximately 0.18 (±0.02)
* Multiple-choice (Orange): Approximately 0.34 (±0.02)
2. **Llama-3.1-6B:**
* Generation (Blue): Approximately 0.32 (±0.02)
* Multiple-choice (Orange): Approximately 0.54 (±0.02)
3. **Qwen-2.5-14B:**
* Generation (Blue): Approximately 0.44 (±0.02)
* Multiple-choice (Orange): Approximately 0.54 (±0.02)
4. **Qwen-2.5-3B:**
* Generation (Blue): Approximately 0.28 (±0.02)
* Multiple-choice (Orange): Approximately 0.38 (±0.02)
5. **SmalLM2-1.7B:**
* Generation (Blue): Approximately 0.09 (±0.02)
* Multiple-choice (Orange): Approximately 0.38 (±0.02)
6. **Gemini-2.0-Flash:**
* Generation (Blue): Approximately 0.48 (±0.02)
* Multiple-choice (Orange): Approximately 0.51 (±0.02)
**Trends:**
* For most models, the Multiple-choice accuracy is higher than the Generation accuracy.
* The Generation accuracy varies significantly across models.
* The Multiple-choice accuracy is relatively consistent across models, generally falling between 0.34 and 0.54.
### Key Observations
* SmalLM2-1.7B exhibits particularly low Generation accuracy (around 0.09).
* Gemini-2.0-Flash shows the highest Generation accuracy (around 0.48).
* Llama-3.1-6B has a large difference between its Generation and Multiple-choice accuracy.
* Qwen-2.5-14B and Gemini-2.0-Flash have similar performance on both tasks.
### Interpretation
The chart demonstrates that the performance of these language models varies significantly depending on the task type. Multiple-choice tasks generally yield higher accuracy scores than generation tasks. This suggests that these models are better at selecting the correct answer from a given set of options than they are at creating novel text. The wide range of Generation accuracy scores indicates that some models are more capable of generating coherent and accurate text than others. The relatively consistent Multiple-choice accuracy suggests that this task is less sensitive to model architecture or training data. The outlier, SmalLM2-1.7B, performs poorly on Generation, indicating a potential weakness in its generative capabilities. The difference between Generation and Multiple-choice accuracy for Llama-3.1-6B could indicate a bias in its training data or a limitation in its ability to generalize to open-ended tasks. The chart provides valuable insights into the strengths and weaknesses of different language models, which can inform model selection for specific applications.
</details>
(b) CoT
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison
### Overview
The image presents a bar chart comparing the accuracy of several language models on two different task types: "Generation" and "Multiple-choice". The accuracy is measured in percentage (%). The chart displays the performance of DeepSeek-RL1, Llama-3.1-6B, Qwen-2.5-14B, Qwen-2.5-3B, SmalM2-1.7B, and Gemini-2.0-Flash models.
### Components/Axes
* **X-axis:** Model Names - DeepSeek-RL1, Llama-3.1-6B, Qwen-2.5-14B, Qwen-2.5-3B, SmalM2-1.7B, Gemini-2.0-Flash.
* **Y-axis:** Accuracy (%) - Scale ranges from 0.0 to 0.5, with increments of 0.1.
* **Legend:**
* Dark Blue: Generation
* Orange: Multiple-choice
* **Chart Title:** Not explicitly present, but the chart's content suggests a comparison of model accuracy.
### Detailed Analysis
The chart consists of paired bars for each model, representing its accuracy in "Generation" and "Multiple-choice" tasks.
* **DeepSeek-RL1:**
* Generation: Approximately 0.38 (±0.02)
* Multiple-choice: Approximately 0.45 (±0.02)
* **Llama-3.1-6B:**
* Generation: Approximately 0.39 (±0.02)
* Multiple-choice: Approximately 0.47 (±0.02)
* **Qwen-2.5-14B:**
* Generation: Approximately 0.41 (±0.02)
* Multiple-choice: Approximately 0.52 (±0.02)
* **Qwen-2.5-3B:**
* Generation: Approximately 0.32 (±0.02)
* Multiple-choice: Approximately 0.48 (±0.02)
* **SmalM2-1.7B:**
* Generation: Approximately 0.05 (±0.01)
* Multiple-choice: Approximately 0.23 (±0.02)
* **Gemini-2.0-Flash:**
* Generation: Approximately 0.44 (±0.02)
* Multiple-choice: Approximately 0.50 (±0.02)
The orange bars (Multiple-choice) generally trend higher than the blue bars (Generation) across all models.
### Key Observations
* SmalM2-1.7B exhibits significantly lower accuracy in both tasks compared to other models.
* Qwen-2.5-14B demonstrates the highest accuracy in the Multiple-choice task.
* The difference in accuracy between Generation and Multiple-choice is more pronounced for some models (e.g., SmalM2-1.7B) than others.
* The models generally perform better on the Multiple-choice task than on the Generation task.
### Interpretation
The data suggests that the evaluated language models are generally more proficient at Multiple-choice question answering than at open-ended text Generation. The large discrepancy in SmalM2-1.7B's performance indicates it may be less capable or require further optimization for these tasks. The higher accuracy of Qwen-2.5-14B in Multiple-choice suggests that model size or architecture plays a role in performance. The consistent trend of higher Multiple-choice accuracy could be due to the constrained nature of the task, making it easier for the models to identify the correct answer compared to generating coherent and accurate text. The chart provides a comparative overview of model capabilities, highlighting strengths and weaknesses in different task settings.
</details>
(c) SC
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison
### Overview
This bar chart compares the accuracy of several language models on two different tasks: "Generation" and "Multiple-choice". The accuracy is measured as a percentage, ranging from 0% to 1%. The chart displays the accuracy for each model and task using adjacent bars.
### Components/Axes
* **X-axis:** Model Names - DeepSeek-R1, Llama-3.1-6B, Qwen-2.5-14B, Qwen-2.5-3B, SmalLM2-1.7B, Gemini-2.0-Flash. Below each model name, a secondary label is present: "Dweel-Llama-8B" appears under DeepSeek-R1.
* **Y-axis:** Accuracy (%) - Scale ranges from 0.0 to 1.0, with increments of 0.2.
* **Legend:**
* Blue: Generation
* Orange: Multiple-choice
* **Chart Title:** Not explicitly present.
### Detailed Analysis
The chart consists of six sets of paired bars, one for each model. The blue bar represents the "Generation" accuracy, and the orange bar represents the "Multiple-choice" accuracy.
* **DeepSeek-R1:** Generation accuracy is approximately 0.72. Multiple-choice accuracy is approximately 0.64.
* **Llama-3.1-6B:** Generation accuracy is approximately 0.84. Multiple-choice accuracy is approximately 0.72.
* **Qwen-2.5-14B:** Generation accuracy is approximately 0.88. Multiple-choice accuracy is approximately 0.82.
* **Qwen-2.5-3B:** Generation accuracy is approximately 0.77. Multiple-choice accuracy is approximately 0.70.
* **SmalLM2-1.7B:** Generation accuracy is approximately 0.72. Multiple-choice accuracy is approximately 0.16.
* **Gemini-2.0-Flash:** Generation accuracy is approximately 0.86. Multiple-choice accuracy is approximately 0.78.
The Generation bars generally trend upwards, with Qwen-2.5-14B showing the highest accuracy. The Multiple-choice bars show more variability.
### Key Observations
* Qwen-2.5-14B consistently demonstrates the highest accuracy in both Generation and Multiple-choice tasks.
* SmalLM2-1.7B exhibits a significant disparity between Generation and Multiple-choice accuracy, with very low performance on the Multiple-choice task.
* The Generation task generally yields higher accuracy scores compared to the Multiple-choice task across all models.
* The secondary label "Dweel-Llama-8B" under DeepSeek-R1 suggests a potential relationship or comparison between these two models.
### Interpretation
The data suggests that the Qwen-2.5-14B model is the most accurate among those tested, performing well on both Generation and Multiple-choice tasks. The large difference in performance for SmalLM2-1.7B on the Multiple-choice task could indicate a weakness in its ability to select the correct answer from a given set of options, while it performs comparably on generating text. The consistently higher accuracy scores for the Generation task across all models might indicate that these models are generally better at creating text than at evaluating pre-defined options. The presence of "Dweel-Llama-8B" under DeepSeek-R1 could be a reference to a fine-tuned version or a related model used in the evaluation process. Further investigation would be needed to understand the exact relationship. The chart provides a comparative performance overview of these language models, which can be valuable for selecting the most appropriate model for specific natural language processing applications.
</details>
(d) Baseline
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Accuracy Comparison
### Overview
This image presents a bar chart comparing the accuracy of several language models on two different task types: "Generation" and "Multiple-choice". The accuracy is measured as a percentage, ranging from 0.0 to 1.0. The chart displays the accuracy scores for each model and task type using adjacent bars.
### Components/Axes
* **X-axis:** Model Names - DeepSeek-R1, Llama-3.1-6B, Qwen-2.5-14B, Qwen-2.5-3B, SmalLM2-1.7B, Gemini-2.0-Flash. Below each model name is a secondary label: "Dwebi-Llama-8B" appears under "DeepSeek-R1".
* **Y-axis:** Accuracy (%) - Scale ranges from 0.0 to 1.0, with increments of 0.2.
* **Legend:** Located at the bottom-center of the chart.
* Blue: Generation
* Orange: Multiple-choice
### Detailed Analysis
The chart consists of six sets of paired bars, one for each model. The blue bars represent the "Generation" accuracy, and the orange bars represent the "Multiple-choice" accuracy.
* **DeepSeek-R1 (Dwebi-Llama-8B):**
* Generation: Approximately 0.64 (±0.02)
* Multiple-choice: Approximately 0.60 (±0.02)
* **Llama-3.1-6B:**
* Generation: Approximately 0.83 (±0.02)
* Multiple-choice: Approximately 0.72 (±0.02)
* **Qwen-2.5-14B:**
* Generation: Approximately 0.86 (±0.02)
* Multiple-choice: Approximately 0.78 (±0.02)
* **Qwen-2.5-3B:**
* Generation: Approximately 0.90 (±0.02)
* Multiple-choice: Approximately 0.68 (±0.02)
* **SmalLM2-1.7B:**
* Generation: Approximately 0.68 (±0.02)
* Multiple-choice: Approximately 0.20 (±0.02)
* **Gemini-2.0-Flash:**
* Generation: Approximately 0.92 (±0.02)
* Multiple-choice: Approximately 0.84 (±0.02)
The "Generation" bars generally trend upwards from left to right, with the exception of DeepSeek-R1 and SmalLM2-1.7B. The "Multiple-choice" bars show more variability.
### Key Observations
* Gemini-2.0-Flash exhibits the highest accuracy for both "Generation" (approximately 0.92) and "Multiple-choice" (approximately 0.84).
* SmalLM2-1.7B performs poorly on the "Multiple-choice" task, with an accuracy of only approximately 0.20.
* Qwen-2.5-3B has the highest Generation accuracy, at approximately 0.90.
* The "Generation" accuracy is consistently higher than the "Multiple-choice" accuracy for most models.
### Interpretation
The chart demonstrates a clear difference in performance between the various language models on the two task types. Gemini-2.0-Flash consistently outperforms the other models, suggesting it is the most capable model in this comparison. The disparity in accuracy between "Generation" and "Multiple-choice" tasks suggests that these models may be better suited for generative tasks than for selecting from pre-defined options. The low performance of SmalLM2-1.7B on the "Multiple-choice" task could indicate a weakness in its ability to understand and reason about the given options. The secondary label "Dwebi-Llama-8B" under "DeepSeek-R1" suggests a potential relationship or derivation between these two models, possibly indicating that DeepSeek-R1 is built upon or fine-tuned from Dwebi-Llama-8B. The trend of increasing Generation accuracy as you move from left to right suggests a correlation between model complexity/size and performance on this task.
</details>
(e) CoT
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy Comparison of Language Models
### Overview
This bar chart compares the accuracy of several language models on two different tasks: "Generation" and "Multiple-choice". The accuracy is measured as a percentage, ranging from 0 to 1. The chart displays the accuracy for each model and task using adjacent bars.
### Components/Axes
* **X-axis:** Represents the language models being compared. The models listed are: DeepSeek-R1, Llama-3.1-6B, Qwen2.5-14B, Qwen2.5-3B, SmalM2-1.7B, Gemini-2.0-Flash. Below DeepSeek-R1 is the text "Dweil-Llama-8B".
* **Y-axis:** Represents the accuracy, labeled as "Accuracy (%)". The scale ranges from 0.0 to 0.9, with increments of 0.2.
* **Legend:** Located at the bottom-right of the chart.
* **Blue:** Represents "Generation" accuracy.
* **Orange:** Represents "Multiple-choice" accuracy.
### Detailed Analysis
The chart consists of six sets of paired bars, one for each language model.
* **DeepSeek-R1:** Generation accuracy is approximately 0.86. Multiple-choice accuracy is approximately 0.72.
* **Llama-3.1-6B:** Generation accuracy is approximately 0.74. Multiple-choice accuracy is approximately 0.73.
* **Qwen2.5-14B:** Generation accuracy is approximately 0.81. Multiple-choice accuracy is approximately 0.76.
* **Qwen2.5-3B:** Generation accuracy is approximately 0.89. Multiple-choice accuracy is approximately 0.69.
* **SmalM2-1.7B:** Generation accuracy is approximately 0.46. Multiple-choice accuracy is approximately 0.22.
* **Gemini-2.0-Flash:** Generation accuracy is approximately 0.90. Multiple-choice accuracy is approximately 0.82.
The "Generation" bars (blue) generally trend higher than the "Multiple-choice" bars (orange) for most models.
### Key Observations
* Gemini-2.0-Flash exhibits the highest accuracy for both Generation (approximately 0.90) and Multiple-choice (approximately 0.82).
* SmalM2-1.7B shows the lowest accuracy for both tasks, with a Generation accuracy of approximately 0.46 and a Multiple-choice accuracy of approximately 0.22.
* Qwen2.5-3B has a notably high Generation accuracy (approximately 0.89) compared to its Multiple-choice accuracy (approximately 0.69).
* Llama-3.1-6B has nearly identical accuracy for both tasks, around 0.73-0.74.
### Interpretation
The data suggests that the performance of language models varies significantly depending on the task and the specific model architecture. The "Generation" task appears to be generally easier for these models than the "Multiple-choice" task, as evidenced by the consistently higher Generation accuracy scores. Gemini-2.0-Flash stands out as the most accurate model across both tasks, while SmalM2-1.7B lags behind. The difference between Generation and Multiple-choice accuracy for Qwen2.5-3B could indicate a strength in open-ended text creation versus constrained selection. The fact that Dweil-Llama-8B is listed below DeepSeek-R1 suggests a possible relationship or comparison between these two models, potentially indicating Dweil-Llama-8B is a variant or predecessor of DeepSeek-R1. Further investigation would be needed to confirm this.
</details>
(f) SC
Figure 5: Cumulative accuracy (after final self-correction iteration) using different models on (top) DisambiguationQA and (bottom) tinyTruthfulQA. The results indicate that models perform completely differently on self-correction of generation and multiple-choice questions, depending on the dataset.
### C.1 Results on Correct and Incorrect Flips
Figures 6-11 show the correct and incorrect flips on different datasets and models.
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations for SmolLM2-1.7B
### Overview
This line chart depicts the proportion of flips observed across different iterations for the SmolLM2-1.7B model, comparing performance between "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip" methods. The x-axis represents the iteration number (1 to 5), and the y-axis represents the proportion of flips, ranging from 0 to 0.10.
### Components/Axes
* **Title:** SmolLM2-1.7B (positioned at the top-center)
* **X-axis Label:** Iterations (positioned at the bottom-center)
* Axis Markers: 1, 2, 3, 4, 5
* **Y-axis Label:** Proportion of Flips (positioned at the left-center)
* Axis Scale: 0.00, 0.02, 0.04, 0.06, 0.08, 0.10
* **Legend:** Located at the top-right corner.
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed Line)
* Incorrect Flip (Blue Dashed Line)
### Detailed Analysis
* **Generation (Blue Solid Line):** The line starts at approximately 0.01 at iteration 1, rises to a peak of around 0.022 at iteration 2, then declines to approximately 0.015 at iteration 3, fluctuates around 0.018 at iteration 4, and ends at approximately 0.021 at iteration 5. The trend is generally flat with minor fluctuations.
* **Multiple-Choice (Orange Solid Line):** This line begins at approximately 0.083 at iteration 1, drops sharply to around 0.038 at iteration 2, decreases further to approximately 0.032 at iteration 3, rises to around 0.04 at iteration 4, and ends at approximately 0.035 at iteration 5. The trend is a steep initial decline followed by a relatively stable plateau.
* **Correct Flip (Black Dashed Line):** The line starts at approximately 0.003 at iteration 1, rises to around 0.018 at iteration 2, declines to approximately 0.012 at iteration 3, decreases to around 0.008 at iteration 4, and ends at approximately 0.001 at iteration 5. The trend is initially upward, then a consistent decline.
* **Incorrect Flip (Blue Dashed Line):** The line begins at approximately 0.001 at iteration 1, rises to around 0.002 at iteration 2, declines to approximately 0.001 at iteration 3, rises to around 0.003 at iteration 4, and ends at approximately 0.003 at iteration 5. The trend is relatively flat with minor fluctuations.
### Key Observations
* The "Multiple-Choice" method starts with the highest proportion of flips, significantly higher than the other methods.
* The proportion of flips for "Multiple-Choice" decreases dramatically in the first two iterations.
* "Correct Flip" shows an initial increase in flips, followed by a consistent decrease.
* "Generation" and "Incorrect Flip" remain relatively stable throughout the iterations, with low proportions of flips.
* The "Incorrect Flip" line is consistently near zero.
### Interpretation
The data suggests that the "Multiple-Choice" method initially exhibits a high rate of changes (flips) but quickly stabilizes as iterations progress. This could indicate a rapid learning or adjustment phase. The "Generation" method shows a more consistent, albeit low, rate of flips, suggesting a slower or more gradual learning process. The "Correct Flip" method's initial increase followed by a decrease might indicate an initial exploration of correct solutions followed by convergence. The consistently low "Incorrect Flip" rate suggests that the model rarely makes incorrect changes.
The SmolLM2-1.7B model appears to adapt quickly when using the "Multiple-Choice" approach, as evidenced by the rapid decline in the proportion of flips. The other methods demonstrate more stable, lower rates of change. The differences in flip rates between the methods could be related to the complexity of the tasks or the effectiveness of the learning algorithms employed. The data provides insights into the model's learning dynamics and the relative performance of different methods.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-3B)
### Overview
This line chart depicts the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of a model (Qwen2.5-3B) over five iterations. The y-axis represents the "Proportion of Flips," ranging from 0.02 to 0.14, while the x-axis represents "Iterations," numbered from 1 to 5.
### Components/Axes
* **Title:** Qwen2.5-3B (top-center)
* **X-axis Label:** Iterations (bottom-center)
* **Y-axis Label:** Proportion of Flips (left-center)
* **Legend:** Located at the top-right of the chart.
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Black Circle-Line)
* Incorrect Flip (Black Dashed-Line)
* **X-axis Markers:** 1, 2, 3, 4, 5
* **Y-axis Markers:** 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Generation (Solid Blue Line):** This line starts at approximately 0.09 at Iteration 1, remains relatively stable around 0.10 at Iteration 2, drops to approximately 0.03 at Iteration 3, rises to around 0.05 at Iteration 4, and then decreases to approximately 0.04 at Iteration 5.
* **Multiple-Choice (Solid Orange Line):** This line begins at approximately 0.10 at Iteration 1, peaks at around 0.13 at Iteration 2, declines to approximately 0.06 at Iteration 3, increases to around 0.07 at Iteration 4, and then decreases to approximately 0.06 at Iteration 5.
* **Correct Flip (Black Circle-Line):** This line starts at approximately 0.09 at Iteration 1, drops to around 0.04 at Iteration 2, rises to approximately 0.07 at Iteration 3, decreases to around 0.05 at Iteration 4, and then declines to approximately 0.04 at Iteration 5.
* **Incorrect Flip (Black Dashed-Line):** This line begins at approximately 0.04 at Iteration 1, drops to around 0.03 at Iteration 2, rises to approximately 0.06 at Iteration 3, decreases to around 0.05 at Iteration 4, and then declines to approximately 0.03 at Iteration 5.
### Key Observations
* The "Multiple-Choice" method consistently exhibits the highest proportion of flips, particularly at Iteration 2.
* The "Incorrect Flip" method consistently shows the lowest proportion of flips throughout all iterations.
* The "Generation" and "Correct Flip" methods show similar trends, with fluctuations across iterations.
* All lines generally show a decreasing trend from Iteration 2 to Iteration 5, suggesting a convergence or stabilization of the model's behavior.
### Interpretation
The chart suggests that the "Multiple-Choice" method induces the most changes (flips) in the model's parameters during training, while the "Incorrect Flip" method induces the fewest. This could indicate that the "Multiple-Choice" method is more effective at exploring the parameter space and potentially improving the model's performance. The decreasing trend in all lines over iterations suggests that the model is converging, and the rate of change is slowing down. The Qwen2.5-3B model appears to be stabilizing after the initial iterations. The data suggests that the model is learning and adjusting its parameters, but the rate of adjustment is decreasing over time. The differences in the proportion of flips between the methods could be due to the different ways they influence the model's learning process. Further analysis would be needed to determine the optimal method for training the model and to understand the underlying reasons for the observed trends.
</details>
(b) Qwen2.5-3B
<details>
<summary>x28.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (Llama-3.1-8B)
### Overview
This line chart depicts the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart aims to show how these proportions change as the process iterates. The title "Llama-3.1-8B" suggests this data relates to a model with that specific configuration.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale from approximately 0.04 to 0.14).
* **Legend:** Located in the top-right corner.
* Generation (Blue solid line)
* Multiple-Choice (Orange dashed line)
* Correct Flip (Black solid line with circle markers)
* Incorrect Flip (Dark Blue dashed line with diamond markers)
### Detailed Analysis
The chart displays five data series, each representing a different method.
* **Generation (Blue solid line):** The line starts at approximately 0.11 at iteration 1, decreases to around 0.07 at iteration 2, rises to approximately 0.10 at iteration 3, drops to a minimum of around 0.05 at iteration 4, and then increases slightly to approximately 0.07 at iteration 5.
* **Multiple-Choice (Orange dashed line):** This line begins at approximately 0.10 at iteration 1, increases sharply to a peak of around 0.14 at iteration 3, decreases to approximately 0.11 at iteration 4, and then declines to around 0.08 at iteration 5.
* **Correct Flip (Black solid line with circle markers):** Starts at approximately 0.11 at iteration 1, decreases to around 0.08 at iteration 2, remains relatively stable at around 0.07-0.08 for iterations 3 and 4, and then drops to approximately 0.06 at iteration 5.
* **Incorrect Flip (Dark Blue dashed line with diamond markers):** Begins at approximately 0.07 at iteration 1, increases to around 0.09 at iteration 2, decreases to approximately 0.07 at iteration 3, drops to a minimum of around 0.04 at iteration 5.
Here's a table summarizing the approximate values:
| Iteration | Generation | Multiple-Choice | Correct Flip | Incorrect Flip |
|---|---|---|---|---|
| 1 | 0.11 | 0.10 | 0.11 | 0.07 |
| 2 | 0.07 | 0.08 | 0.08 | 0.09 |
| 3 | 0.10 | 0.14 | 0.07 | 0.07 |
| 4 | 0.05 | 0.11 | 0.08 | 0.06 |
| 5 | 0.07 | 0.08 | 0.06 | 0.04 |
### Key Observations
* The Multiple-Choice method exhibits the highest proportion of flips, peaking at iteration 3.
* The Generation method shows a generally decreasing trend, with a slight increase at the final iteration.
* The Incorrect Flip method has the lowest proportion of flips overall, and decreases significantly at iteration 5.
* The Correct Flip method remains relatively stable throughout the iterations.
### Interpretation
The data suggests that the Multiple-Choice method is the most sensitive to changes or "flips" within the Llama-3.1-8B model during the observed iterations. The initial decrease in Generation and Correct Flip proportions could indicate a convergence or stabilization of the model's responses. The sharp decline in Incorrect Flip proportion at iteration 5 might suggest that the model is becoming more accurate or less prone to errors. The fluctuations observed across all methods could be due to the inherent stochasticity of the model or the specific nature of the "flips" being applied. The fact that Multiple-Choice consistently shows a higher proportion of flips suggests it is more susceptible to these changes, potentially indicating a greater degree of uncertainty or variability in its responses. The model appears to be refining its responses over the iterations, as evidenced by the decreasing proportion of incorrect flips.
</details>
(c) Llama-3.1-8B
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-14B)
### Overview
This line chart depicts the proportion of flips observed across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart aims to illustrate how the frequency of flips changes as the process iterates. The title "Qwen2.5-14B" suggests this data relates to a model or experiment using that specific configuration.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale from 0.00 to 0.10, increments of 0.02).
* **Legend:** Located in the top-right corner.
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Black dotted line)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Generation (Blue):** This line starts at approximately 0.085 at Iteration 1, sharply decreases to around 0.06 at Iteration 2, continues to decrease to approximately 0.02 at Iteration 3, slightly increases to around 0.03 at Iteration 4, and then decreases again to approximately 0.025 at Iteration 5. The overall trend is decreasing, but with a slight fluctuation around Iteration 4.
* **Multiple-Choice (Orange):** This line begins at approximately 0.05 at Iteration 1, drops to around 0.02 at Iteration 2, continues to decrease to approximately 0.01 at Iteration 3, dips to around 0.005 at Iteration 4, and then slightly increases to approximately 0.01 at Iteration 5. This line shows a consistent downward trend.
* **Correct Flip (Black dashed):** This line starts at approximately 0.075 at Iteration 1, decreases to around 0.04 at Iteration 2, continues to decrease to approximately 0.02 at Iteration 3, remains relatively stable at around 0.02 at Iteration 4, and then decreases slightly to approximately 0.015 at Iteration 5.
* **Incorrect Flip (Black dotted):** This line begins at approximately 0.04 at Iteration 1, decreases to around 0.02 at Iteration 2, continues to decrease to approximately 0.01 at Iteration 3, remains relatively stable at around 0.01 at Iteration 4, and then decreases slightly to approximately 0.005 at Iteration 5.
### Key Observations
* All four lines demonstrate a decreasing trend in the proportion of flips as the number of iterations increases.
* The "Generation" method consistently exhibits the highest proportion of flips throughout all iterations.
* The "Incorrect Flip" method consistently exhibits the lowest proportion of flips throughout all iterations.
* The "Multiple-Choice" method shows the most consistent and rapid decline in flips.
* The "Correct Flip" and "Incorrect Flip" lines converge towards the end of the iterations.
### Interpretation
The data suggests that as the process iterates, the frequency of "flips" (presumably errors or changes in state) decreases for all methods tested. The "Generation" method, while starting with the highest flip rate, still shows a reduction over iterations. The rapid decline in flips for the "Multiple-Choice" method could indicate its efficiency in converging towards a stable state. The convergence of "Correct Flip" and "Incorrect Flip" towards the end suggests that distinguishing between these types of flips becomes more difficult as the iterations progress, potentially indicating a diminishing return in identifying the specific nature of the flips. The overall trend implies that the iterative process is effective in reducing instability or errors within the Qwen2.5-14B model or system being evaluated. The "flips" could represent changes in model parameters, incorrect predictions, or other forms of deviation from the desired outcome. Further investigation would be needed to understand the specific meaning of a "flip" in the context of this experiment.
</details>
(d) Qwen2.5-14B
<details>
<summary>x30.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance
### Overview
This line chart depicts the proportion of flips across different iterations for a model named DeepSeek-R1-Distill-Llama-8B. The chart compares the performance of "Generation" and "Multiple-Choice" methods against "Correct Flip" and "Incorrect Flip" outcomes. The x-axis represents iterations (1 to 5), and the y-axis represents the proportion of flips, ranging from 0.00 to 0.08.
### Components/Axes
* **Title:** DeepSeek-R1-Distill-Llama-8B
* **X-axis Label:** Iterations (with markers at 1, 2, 3, 4, and 5)
* **Y-axis Label:** Proportion of Flips (with markers at 0.00, 0.02, 0.04, 0.06, and 0.08)
* **Legend:**
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed Line)
* Incorrect Flip (Blue Dashed Line)
### Detailed Analysis
* **Generation (Blue Solid Line):** This line starts at approximately 0.034 at iteration 1, decreases to around 0.018 at iteration 2, fluctuates around 0.024 at iteration 3, drops sharply to nearly 0.00 at iteration 4, and then rises to approximately 0.022 at iteration 5.
* **Multiple-Choice (Orange Solid Line):** This line begins at approximately 0.062 at iteration 1, decreases to around 0.056 at iteration 2, decreases to approximately 0.048 at iteration 3, rises to approximately 0.060 at iteration 4, and then decreases to approximately 0.050 at iteration 5.
* **Correct Flip (Black Dashed Line):** This line starts at approximately 0.026 at iteration 1, remains relatively stable around 0.022-0.024 from iterations 2 to 3, increases to approximately 0.030 at iteration 4, and then decreases to approximately 0.020 at iteration 5.
* **Incorrect Flip (Blue Dashed Line):** This line begins at approximately 0.022 at iteration 1, remains relatively stable around 0.022-0.024 from iterations 2 to 3, increases to approximately 0.030 at iteration 4, and then decreases to approximately 0.020 at iteration 5.
### Key Observations
* The "Multiple-Choice" method consistently exhibits a higher proportion of flips compared to the "Generation" method throughout all iterations.
* The "Generation" method shows a significant drop in the proportion of flips at iteration 4, followed by a slight recovery at iteration 5.
* The "Correct Flip" and "Incorrect Flip" lines are very similar, suggesting a roughly equal distribution of correct and incorrect flips.
* The "Incorrect Flip" line mirrors the "Correct Flip" line, indicating a correlation between the two.
### Interpretation
The data suggests that the "Multiple-Choice" approach is more prone to flips (changes in model output) than the "Generation" approach for the DeepSeek-R1-Distill-Llama-8B model. The sharp decrease in flips for the "Generation" method at iteration 4 could indicate a stabilization or convergence of the model's output during that iteration. The similar trends of "Correct Flip" and "Incorrect Flip" suggest that the model is making errors at a consistent rate, and that these errors are not systematically biased towards specific types of flips. The overall trends indicate that the model's behavior is evolving over the five iterations, with some methods exhibiting more stability than others. The chart provides insights into the dynamics of the model's learning process and the relative robustness of different approaches.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x31.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flash Performance
### Overview
This image presents a line chart illustrating the "Proportion of Flips" across five "Iterations" for different evaluation methods: "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip". The chart appears to track the frequency of changes or adjustments made during an iterative process, likely related to model training or refinement.
### Components/Axes
* **Title:** Gemini-2.0-Flash (top-center)
* **X-axis:** Iterations (labeled 1 to 5, bottom-center)
* **Y-axis:** Proportion of Flips (labeled from 0.00 to 0.07, left-center)
* **Legend:** Located in the top-right corner.
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Black dashed-dotted line)
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **Generation (Blue Solid Line):** This line initially slopes downward from Iteration 1 to Iteration 4, then increases slightly in Iteration 5.
* Iteration 1: Approximately 0.042
* Iteration 2: Approximately 0.038
* Iteration 3: Approximately 0.040
* Iteration 4: Approximately 0.028
* Iteration 5: Approximately 0.036
* **Multiple-Choice (Orange Solid Line):** This line exhibits a strong downward trend, decreasing significantly from Iteration 1 to Iteration 5.
* Iteration 1: Approximately 0.062
* Iteration 2: Approximately 0.048
* Iteration 3: Approximately 0.010
* Iteration 4: Approximately 0.010
* Iteration 5: Approximately 0.002
* **Correct Flip (Black Solid Line):** This line shows a relatively stable pattern, with slight fluctuations.
* Iteration 1: Approximately 0.034
* Iteration 2: Approximately 0.032
* Iteration 3: Approximately 0.032
* Iteration 4: Approximately 0.024
* Iteration 5: Approximately 0.032
* **Incorrect Flip (Black Dashed-Dotted Line):** This line generally decreases, with a slight increase in Iteration 5.
* Iteration 1: Approximately 0.016
* Iteration 2: Approximately 0.014
* Iteration 3: Approximately 0.012
* Iteration 4: Approximately 0.012
* Iteration 5: Approximately 0.018
### Key Observations
* The "Multiple-Choice" method shows the most significant decrease in the "Proportion of Flips" over the iterations, suggesting rapid convergence or stabilization.
* The "Generation" method exhibits a more gradual decrease, with a slight increase in the final iteration.
* "Correct Flip" and "Incorrect Flip" remain relatively stable throughout the iterations.
* The "Incorrect Flip" proportion is consistently lower than the "Correct Flip" proportion.
### Interpretation
The chart suggests that the Gemini-2.0-Flash model is undergoing an iterative refinement process. The "Proportion of Flips" likely represents the frequency of parameter adjustments or changes made to the model during each iteration. The rapid decrease in "Multiple-Choice" flips indicates that this evaluation method quickly leads to a stable state, potentially because it's a simpler task. The more gradual change in "Generation" suggests that generating text is a more complex process requiring more iterations to refine. The relatively stable "Correct Flip" and "Incorrect Flip" proportions suggest that the model is consistently making a certain number of correct and incorrect adjustments, and this balance doesn't change dramatically over the iterations. The slight increase in "Incorrect Flip" in Iteration 5 could indicate a potential instability or a need for further refinement. Overall, the data suggests that the model is improving with each iteration, but the rate of improvement varies depending on the evaluation method used.
</details>
(f) Gemini-2.0-Flash
Figure 6: Models Correct and Incorrect Flips on Baseline on DisambiguationQA
<details>
<summary>x32.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations for SmolLM2-1.7B
### Overview
This line chart depicts the proportion of flips observed across different iterations for the SmolLM2-1.7B model. The chart compares the performance of "Generation" and "Multiple-Choice" methods, further categorized by whether the flips were "Correct," "Incorrect," or a combination of both. The x-axis represents the iteration number (1 to 5), and the y-axis represents the proportion of flips, ranging from 0.00 to 0.07.
### Components/Axes
* **Title:** SmolLM2-1.7B
* **X-axis Label:** Iterations (with markers at 1, 2, 3, 4, 5)
* **Y-axis Label:** Proportion of Flips (with markers at 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07)
* **Legend:**
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed-Dotted Line)
* Incorrect Flip (Black Dashed Line)
### Detailed Analysis
The chart displays five data series, each representing a different condition.
* **Generation (Blue Solid Line):** This line starts at approximately 0.016 at iteration 1, decreases to approximately 0.011 at iteration 2, rises to approximately 0.032 at iteration 3, drops sharply to approximately 0.002 at iteration 4, and then increases slightly to approximately 0.006 at iteration 5.
* **Multiple-Choice (Orange Solid Line):** This line begins at approximately 0.063 at iteration 1, decreases to approximately 0.028 at iteration 2, increases to approximately 0.042 at iteration 3, decreases to approximately 0.031 at iteration 4, and finally drops to approximately 0.004 at iteration 5.
* **Correct Flip (Black Dashed-Dotted Line):** This line starts at approximately 0.023 at iteration 1, decreases to approximately 0.010 at iteration 2, decreases further to approximately 0.008 at iteration 3, increases slightly to approximately 0.012 at iteration 4, and then decreases to approximately 0.009 at iteration 5.
* **Incorrect Flip (Black Dashed Line):** This line begins at approximately 0.021 at iteration 1, decreases to approximately 0.011 at iteration 2, decreases to approximately 0.007 at iteration 3, increases to approximately 0.011 at iteration 4, and then decreases to approximately 0.007 at iteration 5.
### Key Observations
* The "Multiple-Choice" method consistently exhibits a higher proportion of flips compared to the "Generation" method, especially in the initial iterations.
* Both "Multiple-Choice" and "Generation" show a decreasing trend in the proportion of flips as the number of iterations increases, suggesting improvement in model stability or convergence.
* The "Correct Flip" line remains relatively stable throughout the iterations, while the "Incorrect Flip" line also shows a decreasing trend.
* The most significant drop in both "Generation" and "Multiple-Choice" occurs between iterations 3 and 4.
### Interpretation
The data suggests that the SmolLM2-1.7B model improves its performance (as indicated by a decreasing proportion of flips) with each iteration. The "Multiple-Choice" method initially demonstrates a higher rate of change (flips) compared to "Generation," potentially indicating a more exploratory or sensitive learning process. The decrease in both "Correct" and "Incorrect" flips over iterations suggests that the model is becoming more confident and stable in its predictions. The sharp drop between iterations 3 and 4 could represent a critical learning phase or a point of convergence where the model significantly refines its parameters. The consistent difference between the "Correct" and "Incorrect" flip lines indicates that the model is learning to avoid incorrect changes, but still makes some. The chart provides insights into the learning dynamics of the model and the effectiveness of different methods ("Generation" vs. "Multiple-Choice") in guiding the learning process.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x33.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-3B)
### Overview
This image presents a line chart illustrating the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of a model (Qwen2.5-3B) over five iterations.
### Components/Axes
* **Title:** Qwen2.5-3B (positioned at the top-center)
* **X-axis:** Iterations (labeled 1 to 5, with equal spacing)
* **Y-axis:** Proportion of Flips (scale ranging from 0.00 to 0.10, with increments of 0.02)
* **Legend:** Located at the top-right corner, containing the following labels and corresponding line styles/colors:
* Generation (solid blue line)
* Multiple-Choice (solid orange line)
* Correct Flip (solid black line with circular markers)
* Incorrect Flip (dashed black line with square markers)
### Detailed Analysis
The chart displays five data series, each representing a different method.
* **Generation (Blue Line):** The line starts at approximately 0.065 at Iteration 1, rises to a peak of approximately 0.08 at Iteration 2, then sharply declines to approximately 0.02 at Iteration 3. It continues to decrease to approximately 0.01 at Iteration 4, and finally rises slightly to approximately 0.03 at Iteration 5.
* **Multiple-Choice (Orange Line):** The line begins at approximately 0.08 at Iteration 1, drops to approximately 0.04 at Iteration 2, remains relatively stable at approximately 0.04 to 0.05 through Iteration 3, increases to approximately 0.065 at Iteration 4, and then decreases to approximately 0.06 at Iteration 5.
* **Correct Flip (Black Line with Circles):** The line starts at approximately 0.055 at Iteration 1, rises to approximately 0.07 at Iteration 2, decreases to approximately 0.055 at Iteration 3, increases to approximately 0.06 at Iteration 4, and then decreases to approximately 0.04 at Iteration 5.
* **Incorrect Flip (Dashed Black Line with Squares):** The line begins at approximately 0.05 at Iteration 1, rises to approximately 0.06 at Iteration 2, decreases to approximately 0.05 at Iteration 3, increases to approximately 0.055 at Iteration 4, and then increases to approximately 0.06 at Iteration 5.
### Key Observations
* The "Generation" method exhibits the most significant fluctuation, with a substantial drop in the proportion of flips between Iteration 2 and Iteration 3.
* The "Multiple-Choice" method shows a relatively stable trend, with a slight increase towards the end of the iterations.
* "Correct Flip" and "Incorrect Flip" methods show relatively stable trends, with minor fluctuations throughout the iterations.
* The "Generation" method starts with a relatively high proportion of flips, but this decreases significantly over the iterations.
### Interpretation
The chart suggests that the "Generation" method initially shows a higher rate of flips, potentially indicating a greater degree of exploration or change in the generated outputs. However, this rate decreases substantially over the iterations, possibly indicating convergence or a reduction in variability. The "Multiple-Choice" method maintains a more consistent proportion of flips, suggesting a more stable performance. The "Correct Flip" and "Incorrect Flip" methods provide a baseline for comparison, showing relatively stable rates of flips.
The data implies that the model (Qwen2.5-3B) undergoes a period of significant adjustment in the early iterations, particularly when using the "Generation" method. The subsequent stabilization of the "Multiple-Choice" method suggests that it may be more robust or less sensitive to the initial conditions. The overall trend indicates that the model is learning and adapting over the iterations, as evidenced by the changes in the proportion of flips. The differences between the methods could be due to the different approaches used for generating or evaluating the outputs.
</details>
(b) Qwen2.5-3B
<details>
<summary>x34.png Details</summary>

### Visual Description
\n
## Line Chart: Llama-3.1-8B Performance Over Iterations
### Overview
This line chart depicts the proportion of flips for different methods (Generation, Multiple-Choice, Correct Flip, Incorrect Flip) across five iterations. The chart aims to illustrate how these methods perform in terms of flipping outcomes as the number of iterations increases. The title "Llama-3.1-8B" suggests this data relates to a model with that specific configuration.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale from 0.04 to 0.18).
* **Legend:** Located in the top-right corner, defining the lines:
* Generation (Blue solid line)
* Multiple-Choice (Orange dashed line)
* Correct Flip (Black solid line with circle markers)
* Incorrect Flip (Black dashed line with plus markers)
* **Title:** "Llama-3.1-8B" positioned at the top-center.
* **Gridlines:** Horizontal gridlines are present to aid in reading values.
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **Generation (Blue solid line):** The line generally slopes downward.
* Iteration 1: Approximately 0.095
* Iteration 2: Approximately 0.09
* Iteration 3: Approximately 0.075
* Iteration 4: Approximately 0.08
* Iteration 5: Approximately 0.05
* **Multiple-Choice (Orange dashed line):** This line exhibits a more fluctuating pattern, initially decreasing then increasing before decreasing again.
* Iteration 1: Approximately 0.14
* Iteration 2: Approximately 0.145
* Iteration 3: Approximately 0.15
* Iteration 4: Approximately 0.13
* Iteration 5: Approximately 0.10
* **Correct Flip (Black solid line with circle markers):** This line shows a slight downward trend with some variation.
* Iteration 1: Approximately 0.09
* Iteration 2: Approximately 0.085
* Iteration 3: Approximately 0.08
* Iteration 4: Approximately 0.08
* Iteration 5: Approximately 0.06
* **Incorrect Flip (Black dashed line with plus markers):** This line also shows a downward trend, becoming more pronounced towards the end.
* Iteration 1: Approximately 0.085
* Iteration 2: Approximately 0.08
* Iteration 3: Approximately 0.07
* Iteration 4: Approximately 0.07
* Iteration 5: Approximately 0.04
### Key Observations
* The "Multiple-Choice" method consistently has the highest proportion of flips throughout all iterations.
* Both "Generation" and "Incorrect Flip" show a clear decreasing trend in the proportion of flips as iterations increase.
* "Correct Flip" remains relatively stable, with a slight decrease over time.
* The "Generation" and "Incorrect Flip" lines converge towards the end of the iterations.
### Interpretation
The data suggests that as the number of iterations increases, the "Generation" method and the occurrence of "Incorrect Flips" become less frequent. This could indicate that the model is becoming more stable or refined with each iteration, leading to fewer errors and more consistent outputs. The consistently high proportion of flips for the "Multiple-Choice" method might suggest that this approach is more prone to changes or variations, or that it is exploring a wider range of possibilities. The relatively stable "Correct Flip" rate indicates a baseline level of successful outcomes that doesn't significantly change with iterations. The convergence of "Generation" and "Incorrect Flip" at iteration 5 could imply that the model is reaching a point where the generated outputs are less likely to be incorrect, or that the distinction between them is becoming blurred. The model "Llama-3.1-8B" appears to be improving in its stability and accuracy over the observed iterations.
</details>
(c) Llama-3.1-8B
<details>
<summary>x35.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-14B)
### Overview
This line chart depicts the proportion of flips observed across different iterations for two methods: Generation and Multiple-Choice, and further categorized into Correct Flip and Incorrect Flip. The chart aims to illustrate how the frequency of flips changes as the process iterates. The title "Qwen2.5-14B" suggests this data relates to a model or experiment using that specific configuration.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale from 0.00 to 0.08).
* **Legend:** Located in the top-right corner.
* **Generation:** Solid Blue Line
* **Multiple-Choice:** Solid Orange Line
* **Correct Flip:** Black Dashed Line
* **Incorrect Flip:** Black Solid Line
### Detailed Analysis
The chart displays four distinct lines representing the proportion of flips for each category.
* **Generation (Blue Line):** Starts at approximately 0.07, sharply decreases to around 0.04 at iteration 2, remains relatively stable at around 0.04 until iteration 3, then drops to approximately 0.00 at iteration 3, rises to around 0.02 at iteration 4, and finally increases to approximately 0.03 at iteration 5.
* **Multiple-Choice (Orange Line):** Begins at approximately 0.06, steadily declines to around 0.02 at iteration 2, continues to decrease to approximately 0.01 at iteration 4, and then slightly increases to around 0.02 at iteration 5.
* **Correct Flip (Black Dashed Line):** Starts at approximately 0.04, decreases to around 0.02 at iteration 2, remains relatively stable at around 0.02 until iteration 4, and then increases to approximately 0.03 at iteration 5.
* **Incorrect Flip (Black Solid Line):** Starts at approximately 0.04, decreases to around 0.02 at iteration 2, remains relatively stable at around 0.02 until iteration 3, then drops to approximately 0.00 at iteration 3, rises to around 0.02 at iteration 4, and finally increases to approximately 0.03 at iteration 5.
Here's a more detailed breakdown of approximate values at each iteration:
| Iteration | Generation | Multiple-Choice | Correct Flip | Incorrect Flip |
|---|---|---|---|---|
| 1 | 0.07 | 0.06 | 0.04 | 0.04 |
| 2 | 0.04 | 0.02 | 0.02 | 0.02 |
| 3 | 0.00 | 0.01 | 0.02 | 0.00 |
| 4 | 0.02 | 0.01 | 0.02 | 0.02 |
| 5 | 0.03 | 0.02 | 0.03 | 0.03 |
### Key Observations
* The "Generation" method exhibits a significant drop in the proportion of flips around iteration 3, reaching a minimum.
* The "Multiple-Choice" method shows a more gradual and consistent decline in the proportion of flips across all iterations.
* Both "Correct Flip" and "Incorrect Flip" lines remain relatively low throughout the iterations.
* The "Generation" and "Incorrect Flip" lines show a similar trend, decreasing sharply at iteration 3 and then increasing slightly in the final iterations.
### Interpretation
The data suggests that the "Generation" method undergoes a more substantial change in its flip behavior compared to the "Multiple-Choice" method. The sharp decrease in flips for "Generation" at iteration 3 could indicate a convergence or stabilization of the process, or potentially a point where the model starts to perform better. The "Multiple-Choice" method, on the other hand, demonstrates a more consistent and gradual improvement. The relatively low proportion of "Correct Flips" suggests that flips, in general, are more often incorrect, indicating a need for further refinement or optimization of the process. The fact that the "Generation" and "Incorrect Flip" lines mirror each other suggests that the flips are primarily incorrect, and the reduction in flips corresponds to a reduction in incorrect flips. The model "Qwen2.5-14B" appears to be undergoing a process where the frequency of flips is being reduced over iterations, with the "Generation" method showing a more dramatic change than the "Multiple-Choice" method. This could be related to a learning process or an optimization algorithm.
</details>
(d) Qwen2.5-14B
<details>
<summary>x36.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance
### Overview
This image presents a line chart illustrating the "Proportion of Flips" across five iterations for different methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of the DeepSeek-R1-Distill-Llama-8B model.
### Components/Axes
* **Title:** DeepSeek-R1-Distill-Llama-8B
* **X-axis:** Iterations (labeled 1 to 5)
* **Y-axis:** Proportion of Flips (scale from approximately 0.01 to 0.06)
* **Legend:**
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black solid line with circle markers)
* Incorrect Flip (Dark Blue dashed line with square markers)
### Detailed Analysis
The chart displays the proportion of flips for each method across the five iterations.
* **Generation (Blue):** The line starts at approximately 0.042 at iteration 1, dips to around 0.038 at iteration 2, rises to approximately 0.044 at iteration 3, decreases to 0.032 at iteration 4, and ends at approximately 0.034 at iteration 5. The trend is generally fluctuating around 0.04.
* **Multiple-Choice (Orange):** The line begins at approximately 0.043 at iteration 1, drops sharply to around 0.009 at iteration 2, peaks at approximately 0.056 at iteration 3, falls to approximately 0.022 at iteration 4, and rises to approximately 0.052 at iteration 5. This line exhibits the most significant fluctuations.
* **Correct Flip (Black):** The line starts at approximately 0.027 at iteration 1, decreases to approximately 0.021 at iteration 2, rises to approximately 0.033 at iteration 3, decreases to approximately 0.028 at iteration 4, and ends at approximately 0.031 at iteration 5. The trend is relatively stable, with a slight upward movement.
* **Incorrect Flip (Dark Blue):** The line begins at approximately 0.024 at iteration 1, decreases to approximately 0.019 at iteration 2, rises to approximately 0.041 at iteration 3, decreases to approximately 0.025 at iteration 4, and ends at approximately 0.028 at iteration 5. This line also shows fluctuations, but less pronounced than Multiple-Choice.
### Key Observations
* The Multiple-Choice method exhibits the largest variation in the proportion of flips, with a significant drop at iteration 2 and a peak at iteration 3.
* The Generation and Incorrect Flip methods show similar trends, fluctuating around a similar level.
* The Correct Flip method remains relatively stable throughout the iterations.
* The proportion of flips for all methods appears to be relatively low, generally below 0.06.
### Interpretation
The chart suggests that the Multiple-Choice method is the most sensitive to changes across iterations, as indicated by its large fluctuations. This could imply that the model's performance on Multiple-Choice tasks is more variable or that the method is more susceptible to the specific changes implemented in each iteration. The stability of the Correct Flip method might indicate that the model consistently identifies correct flips, or that the task is relatively easy. The similar trends of Generation and Incorrect Flip suggest a correlation between these two methods, potentially indicating that errors in generation lead to incorrect flips. The overall low proportion of flips suggests that the model is generally performing well, but there is still room for improvement, particularly in the Multiple-Choice method. The chart provides insights into the model's behavior under different conditions and can be used to identify areas for further optimization.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x37.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flash Performance
### Overview
This image presents a line chart illustrating the "Proportion of Flips" across five "Iterations" for different evaluation methods related to the Gemini-2.0-Flash model. The evaluation methods are "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip". The chart aims to show how the proportion of flips changes with each iteration for each method.
### Components/Axes
* **Title:** Gemini-2.0-Flash (positioned at the top-center)
* **X-axis:** Iterations (labeled at the bottom, ranging from 1 to 5)
* **Y-axis:** Proportion of Flips (labeled on the left, ranging from 0.00 to 0.10, with increments of 0.02)
* **Legend:** Located at the top-right corner, containing the following labels and corresponding colors:
* Generation (Blue) - Solid Line
* Multiple-Choice (Orange) - Solid Line
* Correct Flip (Black) - Dashed-Dot Line
* Incorrect Flip (Blue Dashed) - Dashed Line
### Detailed Analysis
Let's analyze each data series individually:
* **Generation (Blue Solid Line):** This line exhibits a fluctuating trend. It starts at approximately 0.065, decreases to a minimum of approximately 0.01 at iteration 3, and then increases to approximately 0.055 at iteration 5.
* Iteration 1: ~0.065
* Iteration 2: ~0.03
* Iteration 3: ~0.01
* Iteration 4: ~0.05
* Iteration 5: ~0.055
* **Multiple-Choice (Orange Solid Line):** This line generally decreases over the iterations. It begins at approximately 0.06, reaches a minimum of approximately 0.015 at iteration 3, and then slightly increases to approximately 0.025 at iteration 5.
* Iteration 1: ~0.06
* Iteration 2: ~0.02
* Iteration 3: ~0.015
* Iteration 4: ~0.01
* Iteration 5: ~0.025
* **Correct Flip (Black Dashed-Dot Line):** This line starts at approximately 0.08, decreases to approximately 0.04 at iteration 2, then fluctuates around 0.04-0.05 before decreasing to approximately 0.03 at iteration 5.
* Iteration 1: ~0.08
* Iteration 2: ~0.04
* Iteration 3: ~0.05
* Iteration 4: ~0.04
* Iteration 5: ~0.03
* **Incorrect Flip (Blue Dashed Line):** This line starts at approximately 0.075, decreases to approximately 0.03 at iteration 2, increases to approximately 0.04 at iteration 4, and then decreases to approximately 0.025 at iteration 5.
* Iteration 1: ~0.075
* Iteration 2: ~0.03
* Iteration 3: ~0.035
* Iteration 4: ~0.04
* Iteration 5: ~0.025
### Key Observations
* The "Generation" and "Multiple-Choice" methods show a decreasing trend in the proportion of flips over the iterations, suggesting potential improvement or convergence.
* The "Correct Flip" and "Incorrect Flip" methods exhibit more stable behavior, with fluctuations around a consistent level.
* The "Generation" method has the most significant drop in the proportion of flips between iterations 1 and 3.
* The "Multiple-Choice" method consistently has a lower proportion of flips compared to the "Correct Flip" and "Incorrect Flip" methods.
### Interpretation
The chart suggests that the Gemini-2.0-Flash model is improving in its "Generation" and "Multiple-Choice" performance as the iterations progress, as indicated by the decreasing proportion of flips. "Flips" likely refer to changes in the model's output or predictions. A lower proportion of flips suggests greater stability and consistency in the model's responses. The relatively stable "Correct Flip" and "Incorrect Flip" proportions might indicate that the model's ability to identify correct and incorrect answers remains consistent throughout the iterations. The difference between the "Correct Flip" and "Incorrect Flip" proportions could be a measure of the model's accuracy. The initial high value for "Correct Flip" suggests a baseline level of correct identification, while the "Incorrect Flip" proportion indicates the frequency of errors. The fluctuations in all lines could be due to the stochastic nature of the training process or the inherent variability in the evaluation data. The data suggests that the model is learning and refining its responses over time, particularly in the "Generation" and "Multiple-Choice" tasks.
</details>
(f) Gemini-2.0-Flash
Figure 7: Models Correct and Incorrect Flips on CoT on DisambiguationQA
<details>
<summary>x38.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (SmolLM2-1.7B)
### Overview
This line chart depicts the proportion of flips (likely referring to changes in model predictions) over five iterations for different evaluation methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart is titled "SmolLM2-1.7B", suggesting this data pertains to a model with that name and size.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale ranges from 0.00 to 0.04).
* **Legend:** Located in the top-right corner.
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Solid Black Line with Circle Markers)
* Incorrect Flip (Dashed Black Line with Diamond Markers)
* **Title:** SmolLM2-1.7B (positioned at the top-center)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
Let's analyze each line series:
* **Generation (Solid Blue Line):** Starts at approximately 0.010, decreases sharply to approximately 0.002 at iteration 2, and then remains near 0.000 for iterations 3, 4, and 5.
* **Multiple-Choice (Solid Orange Line):** Begins at approximately 0.034, decreases steadily to approximately 0.002 at iteration 3, then increases to approximately 0.009 at iteration 4, and remains at approximately 0.009 at iteration 5.
* **Correct Flip (Solid Black Line with Circle Markers):** Starts at approximately 0.001, remains near 0.000 for iterations 2, 3, 4, and 5.
* **Incorrect Flip (Dashed Black Line with Diamond Markers):** Starts at approximately 0.001, remains near 0.000 for iterations 2, 3, 4, and 5.
Here's a breakdown of approximate values at each iteration:
| Iteration | Generation | Multiple-Choice | Correct Flip | Incorrect Flip |
|---|---|---|---|---|
| 1 | 0.010 | 0.034 | 0.001 | 0.001 |
| 2 | 0.002 | 0.022 | 0.000 | 0.000 |
| 3 | 0.000 | 0.002 | 0.000 | 0.000 |
| 4 | 0.000 | 0.009 | 0.000 | 0.000 |
| 5 | 0.000 | 0.009 | 0.000 | 0.000 |
### Key Observations
* The "Generation" method shows a rapid decrease in the proportion of flips within the first two iterations, stabilizing at a very low level.
* The "Multiple-Choice" method also decreases, but more gradually, and shows a slight increase in the proportion of flips at iterations 4 and 5.
* Both "Correct Flip" and "Incorrect Flip" methods start at a very low proportion of flips and remain consistently near zero throughout all iterations.
* The initial proportion of flips for "Multiple-Choice" is significantly higher than for other methods.
### Interpretation
The data suggests that the SmolLM2-1.7B model quickly converges when evaluated using the "Generation" method, meaning its predictions become stable after a few iterations. The "Multiple-Choice" method shows a slower convergence, with some fluctuations in the proportion of flips even after several iterations. The consistently low proportion of flips for "Correct Flip" and "Incorrect Flip" suggests that these methods are not very sensitive to changes in the model's predictions, or that the model is already performing well on these types of tasks.
The higher initial proportion of flips for "Multiple-Choice" could indicate that the model is initially more uncertain about its predictions when presented with multiple options, but it learns to refine its choices over time. The slight increase in flips at iterations 4 and 5 for "Multiple-Choice" might suggest that the model is exploring different possibilities or encountering more challenging examples.
The overall trend indicates that the model is learning and improving its predictions over the five iterations, as evidenced by the decreasing proportion of flips for most methods. The differences between the methods highlight the importance of choosing appropriate evaluation techniques to assess the model's performance and identify areas for improvement.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x39.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-3B)
### Overview
This line chart depicts the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of a model (Qwen2.5-3B) over several iterations, likely in a training or evaluation context.
### Components/Axes
* **Title:** Qwen2.5-3B (top-center)
* **X-axis:** Iterations (labeled 1 to 5, evenly spaced)
* **Y-axis:** Proportion of Flips (scale from 0.00 to 0.14, evenly spaced)
* **Legend:** Located in the top-right corner.
* Generation (solid blue line)
* Multiple-Choice (solid orange line)
* Correct Flip (solid black line with circle markers)
* Incorrect Flip (dashed blue line with diamond markers)
### Detailed Analysis
Let's analyze each data series:
* **Generation (Blue Solid Line):** The line slopes downward from Iteration 1 to Iteration 3, then fluctuates.
* Iteration 1: Approximately 0.09
* Iteration 2: Approximately 0.035
* Iteration 3: Approximately 0.01
* Iteration 4: Approximately 0.002
* Iteration 5: Approximately 0.02
* **Multiple-Choice (Orange Solid Line):** The line initially increases, peaks at Iteration 2, and then decreases.
* Iteration 1: Approximately 0.06
* Iteration 2: Approximately 0.125
* Iteration 3: Approximately 0.09
* Iteration 4: Approximately 0.03
* Iteration 5: Approximately 0.03
* **Correct Flip (Black Solid Line with Circle Markers):** The line generally decreases, with some fluctuations.
* Iteration 1: Approximately 0.085
* Iteration 2: Approximately 0.06
* Iteration 3: Approximately 0.045
* Iteration 4: Approximately 0.03
* Iteration 5: Approximately 0.025
* **Incorrect Flip (Dashed Blue Line with Diamond Markers):** The line fluctuates with a general downward trend.
* Iteration 1: Approximately 0.045
* Iteration 2: Approximately 0.04
* Iteration 3: Approximately 0.05
* Iteration 4: Approximately 0.015
* Iteration 5: Approximately 0.025
### Key Observations
* The "Generation" method shows a significant decrease in the proportion of flips from Iteration 1 to Iteration 3, suggesting improvement or convergence.
* The "Multiple-Choice" method exhibits a peak at Iteration 2, indicating a potential point of interest or instability.
* "Correct Flip" and "Incorrect Flip" both show a decreasing trend, but "Incorrect Flip" remains relatively higher than "Correct Flip" throughout the iterations.
* The "Generation" and "Incorrect Flip" lines converge towards the end of the iterations.
### Interpretation
The chart likely represents the evaluation of a language model (Qwen2.5-3B) during a training or fine-tuning process. "Flips" could refer to changes in model parameters or predictions. The different methods ("Generation," "Multiple-Choice") likely represent different training or evaluation strategies.
The decreasing trend in "Generation" and "Correct Flip" suggests that the model is becoming more stable and accurate over iterations. The peak in "Multiple-Choice" at Iteration 2 might indicate a temporary increase in uncertainty or a specific challenge encountered during that iteration. The fact that "Incorrect Flip" remains higher than "Correct Flip" suggests that the model still makes errors, but the overall error rate is decreasing.
The convergence of "Generation" and "Incorrect Flip" towards the end of the iterations could indicate that the model is approaching a point where it is less likely to make significant changes to its predictions. This could be a sign of convergence or overfitting. Further analysis would be needed to determine the optimal stopping point for training.
</details>
(b) Qwen2.5-3B
<details>
<summary>x40.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (Llama-3.1-8B)
### Overview
This line chart depicts the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart aims to show how these proportions change as the number of iterations increases. The title "Llama-3.1-8B" suggests this data relates to a language model with 8 billion parameters.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale ranging from 0 to 0.14).
* **Data Series:**
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Dark Blue dashed line)
* **Legend:** Located in the top-right corner, associating colors with each data series.
### Detailed Analysis
Let's analyze each data series individually, noting trends and approximate values.
* **Generation (Blue solid line):** This line exhibits a fluctuating trend. It starts at approximately 0.12 at Iteration 1, decreases to around 0.09 at Iteration 2, rises sharply to a peak of approximately 0.14 at Iteration 3, then declines to around 0.05 at Iteration 4, and finally increases slightly to about 0.07 at Iteration 5.
* **Multiple-Choice (Orange solid line):** This line shows a decreasing trend overall. It begins at approximately 0.02 at Iteration 1, drops to around 0.01 at Iteration 2, increases slightly to about 0.05 at Iteration 2, then decreases to approximately 0.01 at Iteration 3, rises to about 0.03 at Iteration 4, and finally increases to approximately 0.05 at Iteration 5.
* **Correct Flip (Black dashed line):** This line generally increases with fluctuations. It starts at approximately 0.10 at Iteration 1, decreases to around 0.09 at Iteration 2, increases to about 0.10 at Iteration 3, decreases to approximately 0.08 at Iteration 4, and rises to about 0.07 at Iteration 5.
* **Incorrect Flip (Dark Blue dashed line):** This line shows a decreasing trend with some fluctuations. It begins at approximately 0.08 at Iteration 1, decreases to around 0.07 at Iteration 2, increases to about 0.09 at Iteration 3, decreases to approximately 0.05 at Iteration 4, and rises to about 0.06 at Iteration 5.
### Key Observations
* The "Generation" method initially has the highest proportion of flips, but this proportion fluctuates significantly across iterations.
* The "Multiple-Choice" method consistently has the lowest proportion of flips throughout all iterations.
* The "Correct Flip" and "Incorrect Flip" methods have proportions that are relatively close to each other, with "Correct Flip" generally being slightly higher.
* Iteration 3 appears to be a turning point for both "Generation" and "Incorrect Flip", where their proportions reach peaks or valleys.
### Interpretation
The data suggests that the "Generation" method is the most sensitive to changes in iterations, exhibiting the largest fluctuations in the proportion of flips. This could indicate that the generation process is more susceptible to instability or requires more fine-tuning. The consistently low proportion of flips for the "Multiple-Choice" method suggests that this approach is more stable or less prone to errors. The relationship between "Correct Flip" and "Incorrect Flip" indicates a trade-off between making correct adjustments and introducing errors during the iterative process. The fluctuations observed across all methods suggest that the iterative process is not always converging towards a stable solution and may require further investigation to understand the underlying dynamics. The model "Llama-3.1-8B" appears to be undergoing a process of refinement or adjustment, as evidenced by the changing proportions of flips across iterations. The data could be used to optimize the iterative process and improve the performance of the language model.
</details>
(c) Llama-3.1-8B
<details>
<summary>x41.png Details</summary>

### Visual Description
\n
## Line Chart: Qwen2.5-14B - Proportion of Flips vs. Iterations
### Overview
This line chart displays the proportion of flips observed across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart aims to illustrate how the frequency of flips changes with each iteration for each method. The title "Qwen2.5-14B" suggests this data relates to a model or experiment using that specific configuration.
### Components/Axes
* **X-axis:** Iterations (labeled 1 to 5).
* **Y-axis:** Proportion of Flips (scale from 0.00 to 0.05).
* **Legend:**
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Solid Black Line with Circle Markers)
* Incorrect Flip (Dashed Black Line with Square Markers)
* **Title:** Qwen2.5-14B (positioned at the top-center)
### Detailed Analysis
The chart shows the following trends and approximate data points:
* **Generation (Solid Blue Line):** This line starts at approximately 0.042 at Iteration 1 and decreases steadily to approximately 0.002 at Iteration 5. There is a plateau between Iterations 2 and 3, remaining around 0.026.
* **Multiple-Choice (Solid Orange Line):** This line begins at approximately 0.01 at Iteration 1, increases to approximately 0.018 at Iteration 2, then decreases to approximately 0.001 at Iteration 4, and rises again to approximately 0.024 at Iteration 5.
* **Correct Flip (Solid Black Line with Circle Markers):** This line starts at approximately 0.026 at Iteration 1, remains relatively stable at approximately 0.026 between Iterations 1 and 3, then drops to approximately 0.001 at Iteration 4, and ends at approximately 0.006 at Iteration 5.
* **Incorrect Flip (Dashed Black Line with Square Markers):** This line begins at approximately 0.018 at Iteration 1, increases to approximately 0.021 at Iteration 2, decreases to approximately 0.016 at Iteration 3, and then drops to approximately 0.001 at Iteration 4, and ends at approximately 0.008 at Iteration 5.
### Key Observations
* The "Generation" method exhibits the most significant decrease in the proportion of flips across iterations.
* The "Multiple-Choice" method shows an initial increase followed by a decrease and then a final increase.
* Both "Correct Flip" and "Incorrect Flip" methods show a general decreasing trend, but with some fluctuations.
* The "Generation" and "Multiple-Choice" methods start with higher proportions of flips compared to the "Correct Flip" and "Incorrect Flip" methods.
* The proportion of flips for all methods is very low, generally below 0.03.
### Interpretation
The data suggests that the "Generation" method becomes more stable or consistent with increasing iterations, as indicated by the decreasing proportion of flips. The initial higher proportion of flips might represent initial instability or exploration of the solution space. The "Multiple-Choice" method's behavior is more complex, potentially indicating a more nuanced interaction between iterations and the choice-based process. The "Correct Flip" and "Incorrect Flip" methods, representing the outcomes of flips, show a general trend towards fewer flips, which could be due to the model converging towards a more optimal solution. The very low proportions of flips overall suggest that the model is relatively stable and doesn't require frequent adjustments. The Qwen2.5-14B model appears to be improving its performance with each iteration, as evidenced by the decreasing proportion of flips in the "Generation" method. The fluctuations in the "Multiple-Choice" method might indicate a more complex learning process or sensitivity to the specific choices presented.
</details>
(d) Qwen2.5-14B
<details>
<summary>x42.png Details</summary>

### Visual Description
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance
### Overview
This line chart displays the proportion of flips across different iterations for a model named DeepSeek-R1-Distill-Llama-8B. The chart compares the performance of "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip" methods over 5 iterations. The y-axis represents the "Proportion of Flips", and the x-axis represents "Iterations".
### Components/Axes
* **Title:** DeepSeek-R1-Distill-Llama-8B
* **X-axis Label:** Iterations (Scale: 1, 2, 3, 4, 5)
* **Y-axis Label:** Proportion of Flips (Scale: 0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12)
* **Legend:**
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black solid line with circle markers)
* Incorrect Flip (Black dashed line with square markers)
### Detailed Analysis
* **Generation (Blue Line):** The line starts at approximately 0.018 at iteration 1, dips to around 0.016 at iteration 2, rises to approximately 0.022 at iteration 3, decreases to about 0.019 at iteration 4, and then increases significantly to approximately 0.042 at iteration 5. This line shows a generally increasing trend, with a notable jump in the final iteration.
* **Multiple-Choice (Orange Line):** This line begins at approximately 0.072 at iteration 1, rises to a peak of around 0.094 at iteration 2, drops to approximately 0.062 at iteration 3, increases to about 0.078 at iteration 4, and then rises again to approximately 0.096 at iteration 5. The trend is generally fluctuating, with peaks at iterations 2 and 5.
* **Correct Flip (Black Line with Circles):** The line starts at approximately 0.035 at iteration 1, decreases to around 0.025 at iteration 2, rises slightly to approximately 0.028 at iteration 3, decreases to about 0.022 at iteration 4, and then drops sharply to approximately 0.002 at iteration 5. This line shows a decreasing trend, with a significant drop in the final iteration.
* **Incorrect Flip (Black Dashed Line with Squares):** The line begins at approximately 0.016 at iteration 1, decreases to around 0.014 at iteration 2, rises to approximately 0.018 at iteration 3, decreases to about 0.016 at iteration 4, and then drops to approximately 0.001 at iteration 5. This line also shows a decreasing trend, with a sharp drop in the final iteration.
### Key Observations
* The "Multiple-Choice" method consistently exhibits a higher proportion of flips compared to the other methods.
* Both "Correct Flip" and "Incorrect Flip" show a significant decrease in the proportion of flips at iteration 5.
* The "Generation" method shows a notable increase in the proportion of flips at iteration 5.
* The "Multiple-Choice" method fluctuates more than the other methods.
### Interpretation
The chart suggests that the DeepSeek-R1-Distill-Llama-8B model's performance varies across different methods and iterations. The consistently higher proportion of flips for "Multiple-Choice" might indicate that this method is more sensitive to changes or perturbations in the input. The decreasing trend in "Correct Flip" and "Incorrect Flip" at iteration 5 could suggest that the model is becoming more stable or confident in its predictions, reducing the need for flips. The increase in "Generation" at iteration 5 could indicate that the model is exploring more diverse or complex outputs. The fluctuations in "Multiple-Choice" could be due to the inherent variability in the multiple-choice task or the model's sensitivity to different question formulations. The data suggests that the model's behavior is evolving over iterations, with different methods exhibiting different patterns of change. The sharp drops in "Correct Flip" and "Incorrect Flip" at iteration 5 warrant further investigation to understand the underlying cause.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x43.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flash Performance
### Overview
This image presents a line chart illustrating the "Proportion of Flips" across five "Iterations" for different model configurations: "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip". The chart appears to be evaluating the stability or consistency of the Gemini-2.0-Flash model.
### Components/Axes
* **Title:** Gemini-2.0-Flash (positioned at the top-center)
* **X-axis:** Iterations (labeled, ranging from 1 to 5, with tick marks at each integer value)
* **Y-axis:** Proportion of Flips (labeled, ranging from 0.00 to 0.07, with tick marks at 0.01 intervals)
* **Legend:** Located at the top-right corner, containing the following entries:
* Generation (solid blue line)
* Multiple-Choice (solid orange line)
* Correct Flip (dashed black line)
* Incorrect Flip (dashed dark-grey line)
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
Let's analyze each data series:
* **Generation (solid blue line):** This line starts at approximately 0.063 at Iteration 1, decreases to around 0.032 at Iteration 2, rises to approximately 0.038 at Iteration 3, drops to about 0.008 at Iteration 4, and then increases to approximately 0.012 at Iteration 5. The overall trend is decreasing, but with some fluctuations.
* **Multiple-Choice (solid orange line):** This line begins at approximately 0.018 at Iteration 1, rapidly decreases to near 0.00 at Iteration 2, remains close to 0.00 for Iterations 3 and 4, and slightly increases to approximately 0.002 at Iteration 5. This line shows a strong decreasing trend.
* **Correct Flip (dashed black line):** This line starts at approximately 0.032 at Iteration 1, decreases to around 0.031 at Iteration 2, rises to approximately 0.038 at Iteration 3, increases to about 0.042 at Iteration 4, and then decreases to approximately 0.035 at Iteration 5. The trend is relatively flat with some minor fluctuations.
* **Incorrect Flip (dashed dark-grey line):** This line begins at approximately 0.035 at Iteration 1, decreases to around 0.031 at Iteration 2, drops to approximately 0.00 at Iteration 3, rises to about 0.037 at Iteration 4, and then decreases to approximately 0.032 at Iteration 5. This line shows a fluctuating trend, with a significant dip at Iteration 3.
### Key Observations
* The "Multiple-Choice" configuration exhibits the most significant decrease in "Proportion of Flips" across iterations, suggesting increased stability or consistency.
* The "Generation" configuration shows a more erratic pattern, with fluctuations in the proportion of flips.
* The "Incorrect Flip" line has a notable dip at Iteration 3, indicating a potential anomaly or change in behavior at that point.
* The "Correct Flip" line remains relatively stable throughout the iterations.
### Interpretation
The chart suggests that the Gemini-2.0-Flash model becomes more stable (fewer "flips" or changes in behavior) with each iteration, particularly when using the "Multiple-Choice" configuration. The "Generation" configuration, while also decreasing overall, exhibits more variability, potentially indicating a more complex or sensitive process. The dip in "Incorrect Flip" at Iteration 3 could be due to a specific optimization or adjustment made during that iteration, leading to a temporary reduction in incorrect flips. The "Correct Flip" line's stability suggests that the model is consistently identifying correct flips, which is a positive sign.
The "Proportion of Flips" metric likely represents the frequency of changes in the model's output or internal state. A lower proportion of flips indicates greater consistency and potentially better performance. The chart provides valuable insights into the model's behavior and the impact of different configurations on its stability. The data suggests that the model is learning and improving with each iteration, but further investigation may be needed to understand the cause of the fluctuations observed in the "Generation" and "Incorrect Flip" configurations.
</details>
(f) Gemini-2.0-Flash
Figure 8: Models Correct and Incorrect Flips on SC on DisambiguationQA
<details>
<summary>x44.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations for SmolLM2-1.7B
### Overview
This line chart depicts the proportion of flips observed across different iterations for the SmolLM2-1.7B model. The chart compares the performance of "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip" methods. The x-axis represents the iteration number (1 to 5), and the y-axis represents the proportion of flips, ranging from 0 to 0.07.
### Components/Axes
* **Title:** SmolLM2-1.7B (positioned at the top-center)
* **X-axis Label:** Iterations (positioned at the bottom-center)
* **Y-axis Label:** Proportion of Flips (positioned at the left-center)
* **Legend:** Located at the top-right corner, containing the following labels and corresponding colors:
* Generation (Blue) - Solid Line
* Multiple-Choice (Orange) - Solid Line
* Correct Flip (Black) - Solid Line with Circle Markers
* Incorrect Flip (Purple/Dark Blue) - Dashed Line with Circle Markers
### Detailed Analysis
The chart displays five iterations on the x-axis, labeled 1 through 5. The y-axis shows the proportion of flips, with gridlines at 0.01 intervals.
* **Generation (Blue):** The line starts at approximately 0.065, decreases to around 0.04 at iteration 2, sharply increases to a peak of approximately 0.07 at iteration 3, then declines to around 0.02 at iteration 4, and remains relatively stable at approximately 0.02 at iteration 5.
* **Multiple-Choice (Orange):** The line begins at approximately 0.03, rapidly decreases to near 0 at iteration 2, and remains at approximately 0 for iterations 3, 4, and 5.
* **Correct Flip (Black):** The line starts at approximately 0.04, decreases to around 0.03 at iteration 2, increases to approximately 0.035 at iteration 3, rises to approximately 0.04 at iteration 4, and decreases slightly to approximately 0.035 at iteration 5.
* **Incorrect Flip (Purple/Dark Blue):** The line begins at approximately 0.03, decreases to around 0.03 at iteration 2, increases to approximately 0.035 at iteration 3, rises to approximately 0.04 at iteration 4, and decreases slightly to approximately 0.03 at iteration 5.
Approximate Data Points:
| Iteration | Generation | Multiple-Choice | Correct Flip | Incorrect Flip |
|---|---|---|---|---|
| 1 | 0.065 | 0.03 | 0.04 | 0.03 |
| 2 | 0.04 | 0.00 | 0.03 | 0.03 |
| 3 | 0.07 | 0.00 | 0.035 | 0.035 |
| 4 | 0.02 | 0.00 | 0.04 | 0.04 |
| 5 | 0.02 | 0.00 | 0.035 | 0.03 |
### Key Observations
* The "Generation" method exhibits the most significant fluctuations in the proportion of flips, peaking at iteration 3.
* The "Multiple-Choice" method shows a rapid and substantial decrease in the proportion of flips, reaching near-zero levels after the first iteration.
* "Correct Flip" and "Incorrect Flip" show relatively stable proportions of flips throughout the iterations, with minor variations.
* The "Multiple-Choice" method appears to converge to a very low flip rate quickly.
### Interpretation
The data suggests that the "Generation" method is more sensitive to the iteration process, experiencing larger changes in the proportion of flips compared to other methods. The rapid decline in the "Multiple-Choice" method's flip rate indicates that it quickly stabilizes or converges to a consistent state. The relatively stable flip rates for "Correct Flip" and "Incorrect Flip" suggest that these methods are less affected by the iteration process.
The differences in behavior between the methods could be due to variations in their underlying mechanisms or the types of errors they are prone to. The peak in the "Generation" method's flip rate at iteration 3 might indicate a period of exploration or adjustment before settling into a more stable pattern. The overall trend suggests that the model is learning and adapting over iterations, as evidenced by the changes in the proportion of flips. The SmolLM2-1.7B model appears to be performing better with the Multiple-Choice method, as the proportion of flips is very low after the first iteration.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x45.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-3B)
### Overview
This line chart displays the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of a model (Qwen2.5-3B) over several iterations, likely in a training or evaluation context.
### Components/Axes
* **Title:** Qwen2.5-3B (positioned top-center)
* **X-axis:** Iterations (labeled, ranging from 1 to 5, with tick marks at each integer value)
* **Y-axis:** Proportion of Flips (labeled, ranging from 0.00 to 0.10, with tick marks at 0.02 intervals)
* **Legend:** Located at the top-right of the chart.
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Black dotted line)
### Detailed Analysis
The chart contains four distinct lines representing the proportion of flips for each method across the five iterations.
* **Generation (Blue):** The line starts at approximately 0.01 at iteration 1, increases to around 0.03 at iteration 2, decreases to approximately 0.02 at iteration 3, dips to around 0.02 at iteration 4, and then rises slightly to approximately 0.025 at iteration 5. The trend is generally flat with minor fluctuations.
* **Multiple-Choice (Orange):** This line begins at approximately 0.02 at iteration 1, rises sharply to around 0.08 at iteration 2, decreases to approximately 0.04 at iteration 3, increases to around 0.055 at iteration 4, and then declines to approximately 0.03 at iteration 5. This line exhibits a more pronounced peak at iteration 2.
* **Correct Flip (Black dashed):** The line starts at approximately 0.08 at iteration 1, decreases to around 0.055 at iteration 3, increases to approximately 0.06 at iteration 4, and then declines to approximately 0.05 at iteration 5. This line shows a general decreasing trend with some fluctuations.
* **Incorrect Flip (Black dotted):** This line begins at approximately 0.08 at iteration 1, decreases to around 0.02 at iteration 5. This line shows a consistent decreasing trend.
### Key Observations
* The "Multiple-Choice" method shows the most significant variation in the proportion of flips, peaking at iteration 2.
* Both "Correct Flip" and "Incorrect Flip" start at the same proportion (approximately 0.08) but diverge over the iterations, with "Incorrect Flip" decreasing more rapidly.
* The "Generation" method exhibits the most stable behavior, with relatively small changes in the proportion of flips across iterations.
* The "Incorrect Flip" proportion is consistently lower than the "Correct Flip" proportion, suggesting a bias towards correct flips.
### Interpretation
The data suggests that the Qwen2.5-3B model's behavior changes over iterations, as reflected in the varying proportions of flips for each method. The peak in "Multiple-Choice" flips at iteration 2 could indicate a period of increased exploration or uncertainty in the model's responses. The decreasing trend in "Incorrect Flip" suggests that the model is learning to avoid incorrect responses over time. The relatively stable "Generation" line might indicate that the model's generative capabilities are less sensitive to the iterative process. The consistent difference between "Correct Flip" and "Incorrect Flip" suggests a learning process where the model is increasingly favoring correct responses. The chart provides insights into the model's learning dynamics and the effectiveness of different methods in guiding its behavior. The fact that the "Incorrect Flip" proportion decreases over time is a positive sign, indicating that the model is improving its accuracy.
</details>
(b) Qwen2.5-3B
<details>
<summary>x46.png Details</summary>

### Visual Description
\n
## Line Chart: Llama-3.1-8B Performance Over Iterations
### Overview
This line chart depicts the performance of the Llama-3.1-8B model across five iterations, measuring the proportion of flips for different evaluation methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart aims to show how the model's performance changes with each iteration.
### Components/Axes
* **Title:** Llama-3.1-8B
* **X-axis:** Iterations (labeled 1, 2, 3, 4, 5)
* **Y-axis:** Proportion of Flips (scale from 0.02 to 0.14)
* **Legend:**
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Black Line with Circle Markers)
* Incorrect Flip (Black Dashed Line)
### Detailed Analysis
The chart displays four distinct lines representing the proportion of flips for each method over the five iterations.
* **Generation (Solid Blue Line):** This line starts at approximately 0.12 at iteration 1, decreases to around 0.08 at iteration 2, rises to approximately 0.10 at iteration 3, then declines to roughly 0.07 at iteration 4, and finally settles around 0.065 at iteration 5. The overall trend is slightly downward.
* **Multiple-Choice (Solid Orange Line):** This line begins at approximately 0.10 at iteration 1, drops sharply to around 0.04 at iteration 2, continues to decrease to approximately 0.02 at iteration 4, and then slightly increases to around 0.04 at iteration 5. This line shows a significant downward trend.
* **Correct Flip (Black Line with Circle Markers):** This line starts at approximately 0.08 at iteration 1, remains relatively stable around 0.08 at iteration 2, increases to approximately 0.09 at iteration 3, decreases to around 0.07 at iteration 4, and then remains around 0.06 at iteration 5.
* **Incorrect Flip (Black Dashed Line):** This line begins at approximately 0.06 at iteration 1, decreases to around 0.05 at iteration 2, increases to approximately 0.06 at iteration 3, rises to around 0.065 at iteration 4, and then remains around 0.06 at iteration 5.
### Key Observations
* The Multiple-Choice method shows the most significant decrease in the proportion of flips over the iterations, suggesting improvement in performance.
* The Generation method exhibits a more fluctuating pattern, with a slight overall downward trend.
* The Correct Flip and Incorrect Flip lines remain relatively stable throughout the iterations, with minor fluctuations.
* The initial proportion of flips for Generation and Multiple-Choice is higher than for Correct and Incorrect Flip.
### Interpretation
The data suggests that the Llama-3.1-8B model improves in its ability to answer multiple-choice questions as the iterations progress, as indicated by the decreasing proportion of flips. The Generation method shows some improvement, but it is less pronounced. The relatively stable performance of the Correct and Incorrect Flip methods suggests that the model's ability to identify correct and incorrect answers does not change significantly with each iteration. The difference in initial proportions between the methods could indicate varying levels of difficulty or different evaluation criteria. The chart provides insights into the model's learning process and highlights areas where further improvement may be needed. The model appears to be learning to avoid incorrect answers in the multiple choice setting, but the generation task is more variable.
</details>
(c) Llama-3.1-8B
<details>
<summary>x47.png Details</summary>

### Visual Description
\n
## Line Chart: Qwen2.5-14B - Proportion of Flips vs. Iterations
### Overview
This line chart displays the proportion of flips observed across five iterations for two different methods: Generation and Multiple-Choice, along with the proportion of correct and incorrect flips. The chart aims to compare the behavior of these methods over successive iterations.
### Components/Axes
* **Title:** Qwen2.5-14B
* **X-axis:** Iterations (labeled 1 to 5)
* **Y-axis:** Proportion of Flips (scale from 0.00 to 0.05)
* **Legend:**
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed Line)
* Incorrect Flip (Black Dotted Line)
* **Gridlines:** Present, providing a visual aid for reading values.
### Detailed Analysis
The chart contains four distinct lines representing the proportion of flips for each category.
* **Generation (Blue Solid Line):** This line exhibits a strong upward trend from Iteration 1 to Iteration 3, peaking at approximately 0.052. It then sharply declines to approximately 0.008 at Iteration 5.
* Iteration 1: ~0.015
* Iteration 2: ~0.032
* Iteration 3: ~0.052
* Iteration 4: ~0.022
* Iteration 5: ~0.008
* **Multiple-Choice (Orange Solid Line):** This line shows a decreasing trend from Iteration 1 to Iteration 5.
* Iteration 1: ~0.031
* Iteration 2: ~0.011
* Iteration 3: ~0.009
* Iteration 4: ~0.007
* Iteration 5: ~0.002
* **Correct Flip (Black Dashed Line):** This line fluctuates with a slight upward trend overall.
* Iteration 1: ~0.021
* Iteration 2: ~0.031
* Iteration 3: ~0.011
* Iteration 4: ~0.022
* Iteration 5: ~0.004
* **Incorrect Flip (Black Dotted Line):** This line generally decreases over the iterations.
* Iteration 1: ~0.004
* Iteration 2: ~0.003
* Iteration 3: ~0.001
* Iteration 4: ~0.001
* Iteration 5: ~0.0
### Key Observations
* The "Generation" method shows a significant increase in the proportion of flips up to Iteration 3, followed by a dramatic decrease.
* The "Multiple-Choice" method consistently exhibits a lower proportion of flips and a steady decline over the iterations.
* The "Correct Flip" line shows some fluctuation, but remains relatively stable.
* The "Incorrect Flip" line is consistently low and decreasing.
### Interpretation
The data suggests that the "Generation" method initially experiences a period of increased activity (flips) as it explores the solution space, reaching a peak at Iteration 3. The subsequent decline could indicate convergence or a stabilization of the generated outputs. The "Multiple-Choice" method, on the other hand, demonstrates a more consistent and decreasing trend, potentially indicating a faster convergence or a more constrained search process. The low and decreasing proportion of "Incorrect Flips" suggests that both methods are becoming more accurate over time. The difference in behavior between the two methods could be due to the inherent differences in their approaches to problem-solving. The "Generation" method might be more exploratory, while the "Multiple-Choice" method might be more focused on selecting the best option from a predefined set. The model "Qwen2.5-14B" appears to be improving with each iteration, as indicated by the decreasing proportion of incorrect flips.
</details>
(d) Qwen2.5-14B
<details>
<summary>x48.png Details</summary>

### Visual Description
\n
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance
### Overview
This image presents a line chart illustrating the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to track the performance of a model (DeepSeek-R1-Distill-Llama-8B) over five iterations.
### Components/Axes
* **Title:** DeepSeek-R1-Distill-Llama-8B
* **X-axis:** Iterations (labeled 1 to 5)
* **Y-axis:** Proportion of Flips (scale from 0.00 to 0.08)
* **Legend:** Located in the top-right corner.
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Black dotted line)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
The chart displays four distinct lines representing the proportion of flips for each method across the five iterations.
* **Generation (Blue Line):** The line generally slopes upward, indicating an increasing proportion of flips.
* Iteration 1: Approximately 0.064
* Iteration 2: Approximately 0.052
* Iteration 3: Approximately 0.032
* Iteration 4: Approximately 0.045
* Iteration 5: Approximately 0.066
* **Multiple-Choice (Orange Line):** This line exhibits more fluctuation.
* Iteration 1: Approximately 0.054
* Iteration 2: Approximately 0.064
* Iteration 3: Approximately 0.018
* Iteration 4: Approximately 0.028
* Iteration 5: Approximately 0.052
* **Correct Flip (Black Dashed Line):** The line shows a decreasing trend initially, then a slight increase.
* Iteration 1: Approximately 0.052
* Iteration 2: Approximately 0.022
* Iteration 3: Approximately 0.018
* Iteration 4: Approximately 0.042
* Iteration 5: Approximately 0.058
* **Incorrect Flip (Black Dotted Line):** This line generally increases over the iterations.
* Iteration 1: Approximately 0.028
* Iteration 2: Approximately 0.042
* Iteration 3: Approximately 0.026
* Iteration 4: Approximately 0.056
* Iteration 5: Approximately 0.062
### Key Observations
* The "Generation" method shows a consistent upward trend in the proportion of flips, suggesting it becomes more prone to changes over iterations.
* The "Multiple-Choice" method exhibits the most volatility, with a significant drop in flips at iteration 3.
* "Correct Flip" initially decreases, then increases, indicating a potential initial improvement followed by a regression.
* "Incorrect Flip" shows a steady increase, suggesting a growing number of incorrect changes over iterations.
### Interpretation
The data suggests that the DeepSeek-R1-Distill-Llama-8B model's behavior changes over iterations, as indicated by the varying proportions of flips. The increasing trend in "Generation" flips could indicate a growing instability or sensitivity to input changes. The sharp drop in "Multiple-Choice" flips at iteration 3 is a notable anomaly that warrants further investigation ā it could be due to a specific input or a change in the model's internal state. The increasing "Incorrect Flip" proportion is concerning, as it suggests the model is becoming more likely to make errors. The interplay between "Correct Flip" and "Incorrect Flip" is also interesting; the initial decrease in "Correct Flip" followed by an increase could indicate a period of learning followed by overfitting or a shift in the model's focus. Overall, the chart provides valuable insights into the model's dynamic behavior and potential areas for improvement.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x49.png Details</summary>

### Visual Description
\n
## Line Chart: Gemini-2.0-Flash Performance
### Overview
This image presents a line chart illustrating the proportion of flips (likely referring to changes in model responses or predictions) across five iterations for different evaluation methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart aims to compare the stability and behavior of the Gemini-2.0-Flash model over successive iterations.
### Components/Axes
* **Title:** Gemini-2.0-Flash (positioned at the top-center)
* **X-axis:** Iterations (labeled, ranging from 1 to 5, with tick marks at each integer value)
* **Y-axis:** Proportion of Flips (labeled, ranging from 0.00 to 0.04, with tick marks at 0.00, 0.01, 0.02, 0.03, and 0.04)
* **Legend:** Located in the top-left corner, containing the following entries:
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Blue dashed line)
### Detailed Analysis
The chart displays four distinct lines representing the proportion of flips for each evaluation method across the five iterations.
* **Generation (Blue solid line):** This line starts at approximately 0.013, drops to nearly 0 at iteration 2, rises to approximately 0.022 at iteration 3, falls to approximately 0.003 at iteration 4, and then sharply increases to approximately 0.034 at iteration 5. The trend is highly variable, showing a significant increase in the final iteration.
* **Multiple-Choice (Orange solid line):** This line begins at approximately 0.00, rises to a peak of approximately 0.032 at iteration 2, decreases to approximately 0.02 at iteration 3, falls to approximately 0.01 at iteration 4, and then declines to approximately 0.003 at iteration 5. The trend is initially increasing, then decreasing.
* **Correct Flip (Black dashed line):** This line starts at approximately 0.022, decreases to approximately 0.014 at iteration 2, rises to approximately 0.021 at iteration 3, falls to approximately 0.01 at iteration 4, and remains relatively stable at approximately 0.01 at iteration 5. The trend is relatively flat with minor fluctuations.
* **Incorrect Flip (Blue dashed line):** This line begins at approximately 0.01, remains relatively constant around 0.01 for iterations 2, 3, and 4, and then increases slightly to approximately 0.012 at iteration 5. The trend is largely stable.
### Key Observations
* The "Generation" method exhibits the most significant change in the proportion of flips, particularly in the final iteration.
* The "Multiple-Choice" method shows a clear peak in flips at iteration 2, followed by a decline.
* "Correct Flip" and "Incorrect Flip" methods demonstrate relatively stable proportions of flips throughout the iterations.
* The "Generation" method's final value is significantly higher than all other methods at iteration 5.
### Interpretation
The data suggests that the Gemini-2.0-Flash model's behavior, as measured by the "Generation" method, becomes more volatile or prone to change in later iterations. The initial stability observed in the first few iterations gives way to a substantial increase in flips at iteration 5. This could indicate that the model is exploring new solution spaces or encountering difficulties in converging to a stable output.
The "Multiple-Choice" method's peak at iteration 2 might represent a period of heightened exploration or adaptation, followed by a consolidation of learned patterns. The stability of the "Correct Flip" and "Incorrect Flip" methods suggests that the model's ability to identify and correct errors remains relatively consistent across iterations.
The stark difference in the final iteration between the "Generation" method and the others warrants further investigation. It could be due to a specific characteristic of the generation process, a change in the training data, or an emergent property of the model's learning dynamics. The data highlights the importance of monitoring model behavior across multiple evaluation methods to gain a comprehensive understanding of its performance and stability.
</details>
(f) Gemini-2.0-Flash
Figure 9: Models Correct and Incorrect Flips on Baseline on tinyTruthfulQA
<details>
<summary>x50.png Details</summary>

### Visual Description
\n
## Line Chart: SmolLM2-1.7B - Proportion of Flips vs. Iterations
### Overview
This line chart depicts the proportion of flips across different methods (Generation, Multiple-Choice, Correct Flip, Incorrect Flip) as a function of iteration number, ranging from 1 to 5. The chart appears to be evaluating the performance of a model (SmolLM2-1.7B) over iterative improvements.
### Components/Axes
* **Title:** SmolLM2-1.7B
* **X-axis:** Iterations (labeled 1, 2, 3, 4, 5)
* **Y-axis:** Proportion of Flips (labeled, scale from 0.00 to 0.12, increments of 0.02)
* **Legend:**
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black solid line with circle markers)
* Incorrect Flip (Gray dashed line with square markers)
### Detailed Analysis
The chart displays four distinct lines representing the proportion of flips for each method across the five iterations.
* **Generation (Blue):** This line starts at approximately 0.055 at iteration 1, sharply increases to a peak of approximately 0.105 at iteration 2, then declines steadily to approximately 0.03 at iteration 5.
* **Multiple-Choice (Orange):** This line begins at approximately 0.02 at iteration 1, increases to approximately 0.035 at iteration 2, then fluctuates around 0.02-0.03, ending at approximately 0.025 at iteration 5.
* **Correct Flip (Black):** This line starts at approximately 0.05 at iteration 1, decreases to approximately 0.04 at iteration 2, increases to approximately 0.055 at iteration 3, then declines to approximately 0.03 at iteration 5.
* **Incorrect Flip (Gray):** This line begins at approximately 0.05 at iteration 1, increases to approximately 0.055 at iteration 3, then declines to approximately 0.01 at iteration 5.
Here's a table summarizing the approximate values:
| Iteration | Generation | Multiple-Choice | Correct Flip | Incorrect Flip |
|---|---|---|---|---|
| 1 | 0.055 | 0.02 | 0.05 | 0.05 |
| 2 | 0.105 | 0.035 | 0.04 | 0.055 |
| 3 | 0.08 | 0.03 | 0.055 | 0.055 |
| 4 | 0.04 | 0.03 | 0.02 | 0.02 |
| 5 | 0.03 | 0.025 | 0.03 | 0.01 |
### Key Observations
* The "Generation" method exhibits the most significant fluctuation, with a large increase in flips at iteration 2 followed by a consistent decline.
* "Multiple-Choice" remains relatively stable throughout the iterations, with a slight increase initially.
* "Incorrect Flip" shows a decreasing trend, indicating a reduction in incorrect flips as iterations progress.
* "Correct Flip" fluctuates, but generally decreases over the iterations.
### Interpretation
The data suggests that the "Generation" method initially experiences a period of instability or exploration (iteration 2), leading to a higher proportion of flips, before converging towards a more stable state. The "Multiple-Choice" method demonstrates a more consistent performance, indicating a potentially more robust approach. The decreasing trend in "Incorrect Flip" suggests that the model is learning to avoid incorrect responses over time. The fluctuations in "Correct Flip" could be due to the inherent variability in the task or the model's learning process.
The chart provides insights into the iterative improvement of the SmolLM2-1.7B model, highlighting the strengths and weaknesses of different methods in terms of flip proportion. The "Generation" method, while initially volatile, shows potential for improvement, while the "Multiple-Choice" method offers a more stable baseline. The overall trend of decreasing "Incorrect Flip" indicates that the model is learning and refining its responses with each iteration.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x51.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-3B)
### Overview
This line chart depicts the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of a model (Qwen2.5-3B) over iterative improvements. The x-axis represents the iteration number, and the y-axis represents the proportion of flips.
### Components/Axes
* **Title:** Qwen2.5-3B (top-center)
* **X-axis Label:** Iterations (bottom-center)
* Axis Markers: 1, 2, 3, 4, 5
* **Y-axis Label:** Proportion of Flips (left-center)
* Axis Scale: 0.02 to 0.14
* **Legend:** Located at the top-right of the chart.
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Solid Black Line with Circle Markers)
* Incorrect Flip (Dashed Black Line with Square Markers)
### Detailed Analysis
* **Generation (Blue Line):** The line slopes downward from iteration 1 to iteration 4, then slightly increases at iteration 5.
* Iteration 1: Approximately 0.11
* Iteration 2: Approximately 0.07
* Iteration 3: Approximately 0.06
* Iteration 4: Approximately 0.04
* Iteration 5: Approximately 0.045
* **Multiple-Choice (Orange Line):** This line exhibits a more volatile pattern. It increases from iteration 1 to iteration 2, decreases sharply to iteration 3, then fluctuates between iterations 4 and 5.
* Iteration 1: Approximately 0.04
* Iteration 2: Approximately 0.06
* Iteration 3: Approximately 0.02
* Iteration 4: Approximately 0.015
* Iteration 5: Approximately 0.04
* **Correct Flip (Black Line with Circles):** The line generally slopes downward, with some fluctuations.
* Iteration 1: Approximately 0.10
* Iteration 2: Approximately 0.08
* Iteration 3: Approximately 0.055
* Iteration 4: Approximately 0.035
* Iteration 5: Approximately 0.03
* **Incorrect Flip (Dashed Black Line with Squares):** This line shows a decreasing trend, with a relatively stable value at the end.
* Iteration 1: Approximately 0.095
* Iteration 2: Approximately 0.075
* Iteration 3: Approximately 0.05
* Iteration 4: Approximately 0.025
* Iteration 5: Approximately 0.02
### Key Observations
* The "Generation" and "Correct Flip" lines show a consistent downward trend, indicating a decrease in flips as iterations progress.
* The "Multiple-Choice" line is more erratic, suggesting instability or sensitivity to the iterative process.
* The "Incorrect Flip" line also decreases, but at a slower rate than "Generation" and "Correct Flip".
* The "Multiple-Choice" line dips to its lowest point at iteration 3, while the "Generation" line reaches its lowest point at iteration 4.
### Interpretation
The chart suggests that the Qwen2.5-3B model improves with each iteration, as evidenced by the decreasing proportion of flips for the "Generation" and "Correct Flip" methods. This indicates that the model is becoming more stable and accurate over time. The "Multiple-Choice" method's fluctuating behavior could indicate that it is more sensitive to the specific data or parameters used in each iteration. The decreasing trend in "Incorrect Flip" suggests that the model is learning to avoid incorrect responses. The difference between the "Correct Flip" and "Incorrect Flip" lines could be a measure of the model's ability to distinguish between correct and incorrect solutions. The data suggests that the model is converging towards a more stable and accurate state with each iteration, but the "Multiple-Choice" method requires further investigation to understand its erratic behavior. The chart provides a quantitative assessment of the model's learning process and highlights areas for potential improvement.
</details>
(b) Qwen2.5-3B
<details>
<summary>x52.png Details</summary>

### Visual Description
\n
## Line Chart: Llama-3.1-8B Performance Over Iterations
### Overview
This image presents a line chart illustrating the performance of the Llama-3.1-8B model across five iterations. The chart displays the proportion of flips (likely referring to model updates or changes) for different methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The x-axis represents the iteration number, and the y-axis represents the proportion of flips.
### Components/Axes
* **Title:** Llama-3.1-8B
* **X-axis Label:** Iterations (Scale: 1, 2, 3, 4, 5)
* **Y-axis Label:** Proportion of Flips (Scale: 0.025 to 0.200, increments of 0.025)
* **Legend:**
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed Line)
* Incorrect Flip (Black Dotted Line)
### Detailed Analysis
The chart contains four distinct lines representing the proportion of flips for each method over the five iterations.
* **Generation (Blue Solid Line):** This line starts at approximately 0.075 at iteration 1, rises to a peak of approximately 0.16 at iteration 2, then gradually declines to approximately 0.095 at iteration 5. The trend is initially upward, then downward.
* Iteration 1: ~0.075
* Iteration 2: ~0.16
* Iteration 3: ~0.13
* Iteration 4: ~0.115
* Iteration 5: ~0.095
* **Multiple-Choice (Orange Solid Line):** This line begins at approximately 0.17 at iteration 1, sharply decreases to approximately 0.04 at iteration 2, then fluctuates between approximately 0.04 and 0.06 for iterations 3, 4, and 5. The trend is initially downward, then relatively stable.
* Iteration 1: ~0.17
* Iteration 2: ~0.04
* Iteration 3: ~0.05
* Iteration 4: ~0.06
* Iteration 5: ~0.055
* **Correct Flip (Black Dashed Line):** This line starts at approximately 0.10 at iteration 1, decreases to approximately 0.09 at iteration 2, then increases to approximately 0.12 at iteration 4, and ends at approximately 0.11 at iteration 5. The trend is relatively flat with a slight increase towards the end.
* Iteration 1: ~0.10
* Iteration 2: ~0.09
* Iteration 3: ~0.10
* Iteration 4: ~0.12
* Iteration 5: ~0.11
* **Incorrect Flip (Black Dotted Line):** This line begins at approximately 0.11 at iteration 1, decreases to approximately 0.10 at iteration 2, then fluctuates around 0.10 for iterations 3, 4, and 5. The trend is relatively flat.
* Iteration 1: ~0.11
* Iteration 2: ~0.10
* Iteration 3: ~0.10
* Iteration 4: ~0.11
* Iteration 5: ~0.10
### Key Observations
* The "Generation" method initially shows the most significant improvement in the proportion of flips, peaking at iteration 2, but then declines.
* The "Multiple-Choice" method experiences a rapid decrease in the proportion of flips after the first iteration.
* "Correct Flip" and "Incorrect Flip" remain relatively stable throughout the five iterations.
* The "Generation" method consistently has a higher proportion of flips than the "Multiple-Choice" method after iteration 1.
### Interpretation
The chart suggests that the Llama-3.1-8B model undergoes a period of initial improvement (as measured by the proportion of flips) when using the "Generation" method, but this improvement plateaus and then declines. The "Multiple-Choice" method shows a rapid decrease in effectiveness, potentially indicating that it reaches a limit quickly. The stability of "Correct Flip" and "Incorrect Flip" suggests that the model's ability to identify and correct errors remains relatively constant throughout the iterations.
The data could indicate that the "Generation" method is more effective for initial model refinement, but further iterations may require alternative strategies. The decline in "Generation" performance could be due to diminishing returns or the need for more complex adjustments. The consistent levels of "Correct Flip" and "Incorrect Flip" suggest that the model's fundamental error-correction mechanisms are not significantly altered by the iterative process. The chart provides insights into the effectiveness of different methods for improving the Llama-3.1-8B model and highlights the potential need for a diversified approach to model optimization.
</details>
(c) Llama-3.1-8B
<details>
<summary>x53.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-14B - Proportion of Flips vs. Iterations
### Overview
This chart displays the proportion of flips (likely referring to model parameter updates or changes) over five iterations for a model named Qwen2.5-14B. The chart compares different methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The y-axis represents the "Proportion of Flips" and the x-axis represents "Iterations".
### Components/Axes
* **Title:** Qwen2.5-14B (top-center)
* **X-axis Label:** Iterations (bottom-center)
* Axis Markers: 1, 2, 3, 4, 5
* **Y-axis Label:** Proportion of Flips (left-center)
* Axis Scale: 0.00 to 0.06 (approximately)
* **Legend:** Located in the top-right corner.
* Generation: Solid Blue Line
* Multiple-Choice: Solid Orange Line
* Correct Flip: Black Line with Circle Markers
* Incorrect Flip: Black Dashed Line with Circle Markers
### Detailed Analysis
* **Generation (Solid Blue Line):** The line slopes downward overall.
* Iteration 1: Approximately 0.032
* Iteration 2: Approximately 0.043
* Iteration 3: Approximately 0.022
* Iteration 4: Approximately 0.021
* Iteration 5: Approximately 0.018
* **Multiple-Choice (Solid Orange Line):** The line generally slopes downward, with a slight increase at Iteration 4.
* Iteration 1: Approximately 0.022
* Iteration 2: Approximately 0.009
* Iteration 3: Approximately 0.007
* Iteration 4: Approximately 0.011
* Iteration 5: Approximately 0.002
* **Correct Flip (Black Line with Circle Markers):** The line is relatively flat, with some fluctuation.
* Iteration 1: Approximately 0.002
* Iteration 2: Approximately 0.042
* Iteration 3: Approximately 0.022
* Iteration 4: Approximately 0.010
* Iteration 5: Approximately 0.002
* **Incorrect Flip (Black Dashed Line with Circle Markers):** The line slopes downward.
* Iteration 1: Approximately 0.001
* Iteration 2: Approximately 0.003
* Iteration 3: Approximately 0.002
* Iteration 4: Approximately 0.001
* Iteration 5: Approximately 0.000
### Key Observations
* The "Generation" method starts with the highest proportion of flips and decreases steadily over iterations.
* The "Multiple-Choice" method has a lower initial proportion of flips and shows a more erratic decrease.
* "Correct Flip" shows a large spike at iteration 2, then decreases.
* "Incorrect Flip" consistently has the lowest proportion of flips throughout all iterations.
* All methods show a decreasing trend in the proportion of flips as iterations progress.
### Interpretation
The data suggests that the Qwen2.5-14B model is becoming more stable or converging as iterations increase, as indicated by the decreasing proportion of flips across all methods. The "Generation" method appears to be undergoing the most significant changes initially, while the "Incorrect Flip" method remains relatively stable. The spike in "Correct Flip" at iteration 2 could indicate a period of significant learning or adjustment. The overall downward trend suggests that the model is refining its parameters and reducing the frequency of changes needed to improve performance. The different methods likely represent different training or evaluation strategies, and their varying flip proportions could reflect their effectiveness in guiding the model towards optimal performance. The chart provides insight into the model's learning process and the impact of different methods on its parameter updates.
</details>
(d) Qwen2.5-14B
<details>
<summary>x54.png Details</summary>

### Visual Description
\n
## Line Chart: DeepSeek-R1-Distill-Llama-8B Performance
### Overview
This image presents a line chart illustrating the proportion of flips across different iterations for a model named DeepSeek-R1-Distill-Llama-8B. The chart compares the performance of "Generation" and "Multiple-Choice" methods against "Correct Flip" and "Incorrect Flip" outcomes. The x-axis represents iterations (1 to 5), and the y-axis represents the proportion of flips, ranging from 0.00 to 0.06.
### Components/Axes
* **Title:** DeepSeek-R1-Distill-Llama-8B
* **X-axis Label:** Iterations (with markers 1, 2, 3, 4, 5)
* **Y-axis Label:** Proportion of Flips (with markers 0.00, 0.02, 0.04, 0.06)
* **Legend:**
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed Line)
* Incorrect Flip (Brown Dashed Line)
### Detailed Analysis
The chart displays four distinct lines representing the trends of each category over the five iterations.
* **Generation (Blue Solid Line):** This line starts at approximately 0.055 at iteration 1, decreases to around 0.042 at iteration 2, rises to a peak of approximately 0.052 at iteration 3, dips to around 0.044 at iteration 4, and then decreases to approximately 0.035 at iteration 5. The trend is generally fluctuating around 0.04-0.05.
* **Multiple-Choice (Orange Solid Line):** This line begins at approximately 0.02 at iteration 1, sharply increases to a peak of approximately 0.052 at iteration 2, then declines to around 0.022 at iteration 3, slightly increases to approximately 0.025 at iteration 4, and finally rises to approximately 0.03 at iteration 5. The trend shows a significant initial increase followed by a gradual decline and then a slight increase.
* **Correct Flip (Black Dashed Line):** This line starts at approximately 0.03 at iteration 1, decreases to around 0.015 at iteration 2, rises to approximately 0.02 at iteration 3, dips to a minimum of approximately 0.01 at iteration 4, and then increases to approximately 0.018 at iteration 5. The trend is relatively stable, fluctuating around 0.01-0.03.
* **Incorrect Flip (Brown Dashed Line):** This line begins at approximately 0.022 at iteration 1, decreases to a minimum of approximately 0.005 at iteration 2, rises to approximately 0.015 at iteration 3, dips to approximately 0.01 at iteration 4, and then increases to approximately 0.02 at iteration 5. The trend shows a significant initial decrease followed by a gradual increase.
### Key Observations
* The "Multiple-Choice" method exhibits the most dramatic change in proportion of flips, with a large increase in the early iterations.
* "Generation" and "Multiple-Choice" lines cross at iteration 2, indicating a shift in relative performance.
* "Correct Flip" and "Incorrect Flip" remain relatively low and stable throughout the iterations.
* The "Incorrect Flip" proportion is consistently higher than the "Correct Flip" proportion, suggesting a higher rate of errors.
### Interpretation
The data suggests that the "Multiple-Choice" method initially experiences a significant increase in flips, potentially indicating a period of rapid learning or adjustment. However, this increase is followed by a decline, suggesting that the method may reach a plateau or encounter limitations. The "Generation" method shows a more stable performance, fluctuating around a consistent level. The low and stable proportions of "Correct Flip" and "Incorrect Flip" suggest that these outcomes are less frequent or less sensitive to the iterations. The higher proportion of "Incorrect Flip" compared to "Correct Flip" indicates that the model is more prone to making incorrect flips, which could be a point of concern for further investigation. The chart provides insights into the dynamic behavior of the DeepSeek-R1-Distill-Llama-8B model across different iterations and methods, highlighting potential areas for improvement and optimization.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x55.png Details</summary>

### Visual Description
\n
## Line Chart: Gemini-2.0-Flash Performance Over Iterations
### Overview
This line chart depicts the proportion of flips (likely referring to changes or errors) across different methods (Generation, Multiple-Choice, Correct Flip, Incorrect Flip) over five iterations. The chart aims to show how these methods perform and evolve with each iteration.
### Components/Axes
* **Title:** Gemini-2.0-Flash (positioned at the top-center)
* **X-axis:** Iterations (labeled 1 to 5, evenly spaced along the horizontal axis)
* **Y-axis:** Proportion of Flips (labeled, ranging from 0.00 to 0.08, evenly spaced along the vertical axis)
* **Legend:** Located at the top-right corner, containing the following labels and corresponding line styles/colors:
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Black Dashed-Dot Line)
* Incorrect Flip (Black Dashed Line)
### Detailed Analysis
The chart displays five data series, each representing a different method.
* **Generation (Blue Line):** This line starts at approximately 0.04 at Iteration 1, rises sharply to around 0.07 at Iteration 2, decreases to approximately 0.06 at Iteration 3, drops to around 0.04 at Iteration 4, and then increases significantly to approximately 0.07 at Iteration 5. The trend is generally fluctuating, with a noticeable increase in the final iteration.
* **Multiple-Choice (Orange Line):** This line begins at approximately 0.065 at Iteration 1, decreases to around 0.035 at Iteration 2, reaches a minimum of approximately 0.02 at Iteration 3, rises slightly to around 0.03 at Iteration 4, and then increases to approximately 0.04 at Iteration 5. The trend is generally decreasing, with a slight increase at the end.
* **Correct Flip (Black Dashed-Dot Line):** This line starts at approximately 0.055 at Iteration 1, decreases to around 0.04 at Iteration 2, remains relatively stable at approximately 0.04 at Iteration 3, decreases to around 0.035 at Iteration 4, and then increases slightly to approximately 0.04 at Iteration 5. The trend is relatively flat, with minor fluctuations.
* **Incorrect Flip (Black Dashed Line):** This line begins at approximately 0.06 at Iteration 1, decreases to around 0.02 at Iteration 3, rises to approximately 0.03 at Iteration 4, and then decreases to approximately 0.01 at Iteration 5. The trend is generally decreasing, with a significant drop after Iteration 1.
### Key Observations
* The "Generation" method shows the most significant fluctuation, with a clear upward trend in the final iteration.
* The "Multiple-Choice" method consistently exhibits a lower proportion of flips compared to the other methods, and shows a decreasing trend overall.
* The "Incorrect Flip" method starts with a high proportion of flips but experiences a substantial decrease over the iterations.
* The "Correct Flip" method remains relatively stable throughout the iterations.
### Interpretation
The data suggests that the "Generation" method, while fluctuating, is becoming more prone to flips (errors or changes) as the iterations progress, potentially indicating instability or a need for further refinement. The "Multiple-Choice" method appears to be the most stable and reliable, consistently showing a low proportion of flips. The decreasing trend in "Incorrect Flip" suggests that the system is learning to avoid these types of errors. The relatively stable "Correct Flip" line indicates that the system is consistently able to identify and apply correct changes.
The relationship between these methods suggests a trade-off between exploration (Generation) and exploitation (Multiple-Choice). The "Generation" method might be exploring a wider range of possibilities, leading to more flips, while the "Multiple-Choice" method is focusing on more reliable options. The decreasing "Incorrect Flip" line suggests that the system is learning from its mistakes and improving its accuracy over time. The Gemini-2.0-Flash model appears to be evolving, with some methods showing more promise than others.
</details>
(f) Gemini-2.0-Flash
Figure 10: Models Correct and Incorrect Flips on CoT on tinyTruthfulQA
<details>
<summary>x56.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations for SmolLM2-1.7B
### Overview
This line chart depicts the proportion of flips observed across different iterations for the SmolLM2-1.7B model, comparing performance between "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip" methods. The x-axis represents the iteration number (1 to 5), and the y-axis represents the proportion of flips, ranging from 0.00 to 0.04.
### Components/Axes
* **Title:** SmolLM2-1.7B (positioned at the top-center)
* **X-axis Label:** Iterations (positioned at the bottom-center)
* Axis Markers: 1, 2, 3, 4, 5
* **Y-axis Label:** Proportion of Flips (positioned at the left-center)
* Axis Scale: 0.00, 0.01, 0.02, 0.03, 0.04
* **Legend:** Located at the top-right corner.
* Generation (Blue Solid Line)
* Multiple-Choice (Orange Solid Line)
* Correct Flip (Black Dashed Line)
* Incorrect Flip (Black Dotted Line)
### Detailed Analysis
* **Generation (Blue Solid Line):** The line starts at approximately 0.011, decreases to approximately 0.002 at iteration 2, remains relatively stable at approximately 0.008-0.01 at iterations 3 and 4, and ends at approximately 0.006 at iteration 5.
* **Multiple-Choice (Orange Solid Line):** The line begins at approximately 0.032, drops to approximately 0.021 at iteration 2, continues to approximately 0.011 at iteration 3, and reaches 0.00 at iteration 4, remaining at 0.00 for iteration 5.
* **Correct Flip (Black Dashed Line):** The line starts at approximately 0.034, decreases to approximately 0.022 at iteration 2, then to approximately 0.01 at iteration 3, and reaches 0.00 at iteration 4, remaining at 0.00 for iteration 5.
* **Incorrect Flip (Black Dotted Line):** The line begins at approximately 0.031, decreases to approximately 0.02 at iteration 2, then to approximately 0.01 at iteration 3, and reaches 0.00 at iteration 4, remaining at 0.00 for iteration 5.
### Key Observations
* All four lines demonstrate a decreasing trend in the proportion of flips as the number of iterations increases.
* The "Multiple-Choice", "Correct Flip", and "Incorrect Flip" lines converge to 0.00 at iteration 4, indicating that no flips were observed for these methods after that point.
* The "Generation" line exhibits a more fluctuating pattern compared to the others, with a slight increase between iterations 3 and 4.
* Initially, the "Multiple-Choice" and "Correct Flip" methods have the highest proportion of flips.
### Interpretation
The data suggests that the SmolLM2-1.7B model improves its consistency and reduces the occurrence of "flips" (presumably, changes in predictions or classifications) with each iteration. The rapid decline in flips for "Multiple-Choice", "Correct Flip", and "Incorrect Flip" indicates that these methods quickly converge to a stable state. The "Generation" method, while also decreasing, shows more variability, potentially indicating a more complex learning process or sensitivity to initial conditions. The convergence of all lines to zero after a certain number of iterations suggests that the model reaches a point of diminishing returns in terms of reducing flips. The initial higher flip rate for "Multiple-Choice" and "Correct Flip" could be due to the model initially exploring different possibilities before settling on a more consistent solution. This data could be used to evaluate the effectiveness of different training strategies and to determine the optimal number of iterations for the SmolLM2-1.7B model.
</details>
(a) SmolLM2-1.7B
<details>
<summary>x57.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations (Qwen2.5-3B)
### Overview
This line chart depicts the proportion of flips across different iterations for various methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The chart appears to be evaluating the performance of a model (Qwen2.5-3B) over several iterations, likely in a training or evaluation context. The y-axis represents the "Proportion of Flips" and the x-axis represents "Iterations" ranging from 1 to 5.
### Components/Axes
* **Title:** Qwen2.5-3B (top-center)
* **X-axis Label:** Iterations (bottom-center)
* Axis Markers: 1, 2, 3, 4, 5
* **Y-axis Label:** Proportion of Flips (left-center)
* Axis Scale: 0.00 to 0.10, with increments of 0.02
* **Legend:** Located at the top-right corner.
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Black Circle-Line)
* Incorrect Flip (Orange Dashed Line)
### Detailed Analysis
* **Generation (Solid Blue Line):** The line starts at approximately 0.03 at Iteration 1, remains relatively stable around 0.03-0.04 through Iteration 2, increases to a peak of approximately 0.065 at Iteration 3, then decreases to around 0.035 at Iteration 4, and finally stabilizes around 0.04 at Iteration 5.
* **Multiple-Choice (Solid Orange Line):** This line begins at approximately 0.08 at Iteration 1, decreases sharply to around 0.04 at Iteration 2, continues to decrease to a minimum of approximately 0.02 at Iteration 3, increases slightly to around 0.035 at Iteration 4, and then decreases to approximately 0.01 at Iteration 5.
* **Correct Flip (Black Circle-Line):** The line starts at approximately 0.03 at Iteration 1, increases to around 0.055 at Iteration 2, decreases to approximately 0.04 at Iteration 3, increases to around 0.045 at Iteration 4, and remains relatively stable at approximately 0.04 at Iteration 5.
* **Incorrect Flip (Orange Dashed Line):** This line begins at approximately 0.085 at Iteration 1, decreases steadily to around 0.06 at Iteration 2, continues to decrease to approximately 0.04 at Iteration 3, increases slightly to around 0.045 at Iteration 4, and then decreases to approximately 0.015 at Iteration 5.
### Key Observations
* The "Multiple-Choice" and "Incorrect Flip" lines exhibit the most significant downward trends over the five iterations, suggesting a reduction in these types of flips.
* The "Generation" and "Correct Flip" lines remain relatively stable, with some fluctuations, indicating a consistent proportion of these flips.
* At Iteration 1, both "Multiple-Choice" and "Incorrect Flip" have the highest proportion of flips, while "Generation" and "Correct Flip" start with lower proportions.
* By Iteration 5, the "Multiple-Choice" and "Incorrect Flip" proportions have decreased substantially, approaching the levels of "Generation" and "Correct Flip".
### Interpretation
The data suggests that the model (Qwen2.5-3B) is improving over iterations, as indicated by the decreasing proportion of "Multiple-Choice" and "Incorrect Flip" events. This could mean the model is becoming more accurate or more confident in its responses, leading to fewer instances where multiple choices are considered or incorrect flips occur. The stability of the "Generation" and "Correct Flip" lines suggests that these aspects of the model's performance are relatively consistent throughout the iterations. The initial high proportion of "Multiple-Choice" and "Incorrect Flip" events might indicate a period of exploration or learning, where the model is still refining its understanding. The overall trend points towards a positive learning trajectory for the model. The fact that the "Incorrect Flip" line is consistently higher than the "Correct Flip" line suggests that there is still room for improvement in the model's ability to avoid incorrect responses.
</details>
(b) Qwen2.5-3B
<details>
<summary>x58.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Flips vs. Iterations for Llama-3.1-8B
### Overview
This line chart displays the proportion of flips across different iterations for the Llama-3.1-8B model, comparing "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip" methods. The x-axis represents iterations (1 to 5), and the y-axis represents the proportion of flips, ranging from 0.000 to 0.175.
### Components/Axes
* **Title:** Llama-3.1-8B
* **X-axis Label:** Iterations (with markers at 1, 2, 3, 4, 5)
* **Y-axis Label:** Proportion of Flips (with markers at 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175)
* **Legend:**
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Black Dashed-Dot Line)
* Incorrect Flip (Black Dashed Line)
### Detailed Analysis
* **Generation (Solid Blue Line):** This line exhibits a fluctuating trend. It starts at approximately 0.130 at Iteration 1, decreases to around 0.070 at Iteration 3, rises sharply to approximately 0.160 at Iteration 4, and then drops to around 0.050 at Iteration 5.
* Iteration 1: ~0.130
* Iteration 2: ~0.100
* Iteration 3: ~0.070
* Iteration 4: ~0.160
* Iteration 5: ~0.050
* **Multiple-Choice (Solid Orange Line):** This line shows a generally decreasing trend. It begins at approximately 0.040 at Iteration 1, decreases to around 0.020 at Iteration 3, and fluctuates around 0.025-0.030 for Iterations 4 and 5.
* Iteration 1: ~0.040
* Iteration 2: ~0.030
* Iteration 3: ~0.020
* Iteration 4: ~0.025
* Iteration 5: ~0.030
* **Correct Flip (Black Dashed-Dot Line):** This line starts at approximately 0.110 at Iteration 1, decreases to around 0.080 at Iteration 3, and then rises to approximately 0.100 at Iteration 5.
* Iteration 1: ~0.110
* Iteration 2: ~0.090
* Iteration 3: ~0.080
* Iteration 4: ~0.090
* Iteration 5: ~0.100
* **Incorrect Flip (Black Dashed Line):** This line begins at approximately 0.110 at Iteration 1, decreases to around 0.060 at Iteration 4, and then rises to approximately 0.080 at Iteration 5.
* Iteration 1: ~0.110
* Iteration 2: ~0.080
* Iteration 3: ~0.110
* Iteration 4: ~0.060
* Iteration 5: ~0.080
### Key Observations
* The "Generation" method exhibits the most significant fluctuations in the proportion of flips, with a large increase at Iteration 4.
* The "Multiple-Choice" method shows a consistent downward trend, indicating a decreasing proportion of flips over iterations.
* "Correct Flip" and "Incorrect Flip" show relatively stable trends, with some fluctuations.
* The "Generation" method consistently has a higher proportion of flips compared to the "Multiple-Choice" method.
### Interpretation
The chart suggests that the "Generation" method is more sensitive to changes across iterations, as evidenced by its fluctuating proportion of flips. The decreasing trend in the "Multiple-Choice" method might indicate that the model is becoming more confident in its choices over time, leading to fewer flips. The relatively stable trends in "Correct Flip" and "Incorrect Flip" suggest that the model's ability to identify and correct errors remains consistent throughout the iterations. The large increase in "Generation" flips at Iteration 4 could be an anomaly or a sign of a significant shift in the model's behavior at that point. Further investigation would be needed to determine the cause of this spike. The data demonstrates the dynamic nature of the model's learning process and the varying impact of different methods on its internal state, as measured by the proportion of flips. The "flips" likely represent changes in the model's internal parameters or decision-making process during each iteration.
</details>
(c) Llama-3.1-8B
<details>
<summary>x59.png Details</summary>

### Visual Description
## Line Chart: Qwen2.5-14B - Proportion of Flips vs. Iterations
### Overview
This line chart depicts the proportion of flips observed across different iterations for a model named Qwen2.5-14B. The chart compares the performance of "Generation" and "Multiple-Choice" methods against "Correct Flip" and "Incorrect Flip" outcomes. The x-axis represents the iteration number (from 1 to 5), and the y-axis represents the proportion of flips, ranging from 0 to 0.08.
### Components/Axes
* **Title:** Qwen2.5-14B
* **X-axis Label:** Iterations (with markers at 1, 2, 3, 4, and 5)
* **Y-axis Label:** Proportion of Flips (with markers at 0, 0.02, 0.04, 0.06, and 0.08)
* **Legend:**
* Generation (Blue solid line)
* Multiple-Choice (Orange solid line)
* Correct Flip (Black dashed line)
* Incorrect Flip (Black dotted line)
### Detailed Analysis
The chart contains four distinct data series, each represented by a different line style and color as defined in the legend.
* **Generation (Blue Solid Line):** This line shows a strong downward trend from Iteration 1 to Iteration 3, then plateaus.
* Iteration 1: Approximately 0.073
* Iteration 2: Approximately 0.034
* Iteration 3: Approximately 0.008
* Iteration 4: Approximately 0.011
* Iteration 5: Approximately 0.011
* **Multiple-Choice (Orange Solid Line):** This line exhibits a fluctuating pattern.
* Iteration 1: Approximately 0.014
* Iteration 2: Approximately 0.018
* Iteration 3: Approximately 0.004
* Iteration 4: Approximately 0.002
* Iteration 5: Approximately 0.002
* **Correct Flip (Black Dashed Line):** This line shows a decreasing trend from Iteration 1 to Iteration 4, then stabilizes.
* Iteration 1: Approximately 0.016
* Iteration 2: Approximately 0.003
* Iteration 3: Approximately 0.001
* Iteration 4: Approximately 0.001
* Iteration 5: Approximately 0.011
* **Incorrect Flip (Black Dotted Line):** This line shows a decreasing trend from Iteration 1 to Iteration 3, then a slight increase.
* Iteration 1: Approximately 0.014
* Iteration 2: Approximately 0.002
* Iteration 3: Approximately 0.001
* Iteration 4: Approximately 0.001
* Iteration 5: Approximately 0.011
### Key Observations
* The "Generation" method starts with the highest proportion of flips and experiences a significant decrease over the first three iterations.
* The "Multiple-Choice" method has a relatively low and stable proportion of flips throughout the iterations.
* Both "Correct Flip" and "Incorrect Flip" have very low proportions of flips, generally below 0.01, except for iteration 5.
* There is a notable increase in both "Correct Flip" and "Incorrect Flip" at Iteration 5, suggesting a potential change in behavior or a convergence point.
### Interpretation
The data suggests that the "Generation" method initially undergoes a substantial adjustment or learning process, as evidenced by the rapid decrease in the proportion of flips. This could indicate that the model is refining its output and reducing errors. The "Multiple-Choice" method, on the other hand, appears to be more stable and requires less adjustment. The low proportions of "Correct Flip" and "Incorrect Flip" suggest that the model is generally avoiding these types of outcomes. The increase in both "Correct Flip" and "Incorrect Flip" at Iteration 5 could indicate a point where the model starts exploring different solutions or encountering more challenging cases. The chart provides insights into the iterative learning process of the Qwen2.5-14B model and the relative performance of different methods. The data suggests that the "Generation" method benefits from iterative refinement, while the "Multiple-Choice" method is more consistent.
</details>
(d) Qwen2.5-14B
<details>
<summary>x60.png Details</summary>

### Visual Description
## Line Chart: Proportion of Flips vs. Iterations for DeepSeek-R1-Distill-Llama-8B
### Overview
This line chart displays the proportion of flips across different iterations for the DeepSeek-R1-Distill-Llama-8B model. The chart compares four different methods: Generation, Multiple-Choice, Correct Flip, and Incorrect Flip. The x-axis represents the iteration number (1 to 5), and the y-axis represents the proportion of flips, ranging from 0.02 to 0.12.
### Components/Axes
* **Title:** DeepSeek-R1-Distill-Llama-8B
* **X-axis Label:** Iterations (with markers 1, 2, 3, 4, 5)
* **Y-axis Label:** Proportion of Flips (with markers 0.02, 0.04, 0.06, 0.08, 0.10, 0.12)
* **Legend:**
* Generation (Solid Blue Line)
* Multiple-Choice (Solid Orange Line)
* Correct Flip (Black Line with Circle Markers)
* Incorrect Flip (Black Dashed Line with Circle Markers)
### Detailed Analysis
* **Generation (Solid Blue Line):** The line starts at approximately 0.021 at iteration 1, dips to around 0.019 at iteration 2, rises to approximately 0.036 at iteration 3, decreases to about 0.029 at iteration 4, and then increases to approximately 0.053 at iteration 5. This line generally shows an upward trend over the five iterations.
* **Multiple-Choice (Solid Orange Line):** The line begins at approximately 0.082 at iteration 1, decreases to around 0.076 at iteration 2, rises to approximately 0.095 at iteration 3, drops to about 0.068 at iteration 4, and then increases to approximately 0.073 at iteration 5. This line exhibits a fluctuating pattern, with a peak at iteration 3.
* **Correct Flip (Black Line with Circle Markers):** The line starts at approximately 0.034 at iteration 1, rises to around 0.046 at iteration 2, decreases to approximately 0.016 at iteration 3, increases sharply to about 0.062 at iteration 4, and then decreases to approximately 0.024 at iteration 5. This line shows a significant spike at iteration 4.
* **Incorrect Flip (Black Dashed Line with Circle Markers):** The line begins at approximately 0.036 at iteration 1, remains relatively stable at around 0.034 at iteration 2, decreases to approximately 0.028 at iteration 3, increases to about 0.038 at iteration 4, and then remains stable at approximately 0.026 at iteration 5. This line shows a slight decreasing trend overall.
### Key Observations
* The Multiple-Choice method consistently exhibits the highest proportion of flips throughout the iterations.
* The Correct Flip method shows a dramatic increase in the proportion of flips at iteration 4, significantly exceeding the other methods at that point.
* The Generation and Incorrect Flip methods have relatively low and stable proportions of flips compared to the other two methods.
* The Generation method shows a clear upward trend across the iterations.
### Interpretation
The chart suggests that the Multiple-Choice method is the most susceptible to flips, indicating a potential instability or sensitivity to input variations. The sharp increase in flips for the Correct Flip method at iteration 4 could indicate a critical point where the model's behavior changes or a specific vulnerability is exploited. The relatively stable and low flip rates for Generation and Incorrect Flip suggest these methods are more robust or less prone to alterations. The upward trend in the Generation method might indicate a gradual accumulation of changes or a learning process that increases its susceptibility to flips over time. The data suggests that the model's behavior is not consistent across iterations and that certain methods are more prone to instability than others. Further investigation is needed to understand the underlying causes of these fluctuations and the implications for the model's reliability and security.
</details>
(e) DeepSeek-R1-Distill-Llama-8B
<details>
<summary>x61.png Details</summary>

### Visual Description
## Line Chart: Gemini-2.0-Flash Performance
### Overview
This image presents a line chart illustrating the "Proportion of Flips" across five "Iterations" for different model evaluation methods: "Generation", "Multiple-Choice", "Correct Flip", and "Incorrect Flip". The chart appears to track how these proportions change as the model undergoes iterative refinement.
### Components/Axes
* **Title:** Gemini-2.0-Flash (positioned at the top-center)
* **X-axis:** "Iterations" (labeled, ranging from 1 to 5, with evenly spaced markers)
* **Y-axis:** "Proportion of Flips" (labeled, ranging from 0.00 to 0.05, with evenly spaced markers)
* **Legend:** Located at the top-right corner, containing the following entries:
* "Generation" - Solid Blue Line
* "Multiple-Choice" - Dashed Orange Line
* "Correct Flip" - Solid Black Line with Circle Markers
* "Incorrect Flip" - Dashed Black Line with Square Markers
### Detailed Analysis
Let's analyze each data series individually:
* **Generation (Solid Blue Line):** This line exhibits a fluctuating trend. It starts at approximately 0.037 at Iteration 1, dips to a minimum of around 0.022 at Iteration 2, rises to a peak of approximately 0.042 at Iteration 3, decreases to around 0.032 at Iteration 4, and ends at approximately 0.031 at Iteration 5.
* **Multiple-Choice (Dashed Orange Line):** This line shows a decreasing trend. It begins at approximately 0.032 at Iteration 1, drops to around 0.01 at Iteration 2, continues to decrease to nearly 0.00 at Iteration 3, rises slightly to approximately 0.004 at Iteration 4, and ends at approximately 0.01 at Iteration 5.
* **Correct Flip (Solid Black Line with Circle Markers):** This line also fluctuates. It starts at approximately 0.042 at Iteration 1, remains relatively stable at around 0.04 at Iteration 2, rises to approximately 0.044 at Iteration 3, drops to a minimum of approximately 0.02 at Iteration 4, and ends at approximately 0.03 at Iteration 5.
* **Incorrect Flip (Dashed Black Line with Square Markers):** This line shows a fluctuating trend. It begins at approximately 0.04 at Iteration 1, remains relatively stable at around 0.04 at Iteration 2, rises to approximately 0.044 at Iteration 3, drops to approximately 0.03 at Iteration 4, and ends at approximately 0.03 at Iteration 5.
### Key Observations
* The "Multiple-Choice" proportion of flips consistently decreases over the five iterations, suggesting improvement in this evaluation method.
* The "Generation", "Correct Flip", and "Incorrect Flip" lines exhibit more volatile behavior, indicating less consistent performance or more sensitivity to iterative changes.
* The "Correct Flip" and "Incorrect Flip" lines are very close in value throughout the iterations.
* The "Generation" line starts and ends at similar values, with fluctuations in between.
### Interpretation
The chart suggests that the Gemini-2.0-Flash model shows improvement in "Multiple-Choice" evaluation as iterations progress, as indicated by the decreasing proportion of flips. The fluctuating behavior of the other metrics ("Generation", "Correct Flip", and "Incorrect Flip") could indicate that these evaluation methods are more sensitive to the specific changes made during each iteration, or that the model's performance in these areas is less stable. The proximity of the "Correct Flip" and "Incorrect Flip" lines suggests a relatively balanced rate of correct and incorrect flips, potentially indicating a need for further refinement in the model's decision-making process. The overall trend suggests that the model is learning and adapting, but further investigation is needed to understand the reasons behind the fluctuations in the "Generation", "Correct Flip", and "Incorrect Flip" metrics. The data suggests that the model is not simply memorizing answers, but is actively learning and adjusting its internal parameters.
</details>
(f) Gemini-2.0-Flash
Figure 11: Models Correct and Incorrect Flips on SC on tinyTruthfulQA
### C.2 Accuracy over Iteration
Tables 1 and 2 show the accuracy over iteration on both datasets for each iteration.
Table 1: Accuracy over iterations on DisambiguationQA
Model Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Generation Multiple-Choice Baseline DeepSeek-R1-Distill-Llama-8B 0.1833 0.1917 0.1833 0.1750 0.2000 0.1667 0.2833 0.3083 0.3167 0.3167 0.3000 0.3250 Llama-3.1-8B 0.1250 0.1167 0.1750 0.1500 0.1333 0.1250 0.3500 0.3500 0.4167 0.3500 0.3917 0.4417 Qwen2.5-14B 0.3750 0.4083 0.3583 0.3667 0.3667 0.3750 0.4750 0.4833 0.4917 0.5083 0.5000 0.5083 Qwen2.5-3B 0.2000 0.2083 0.2750 0.2333 0.2500 0.2167 0.3167 0.2583 0.3250 0.3083 0.3333 0.3167 SmolLM2-1.7B 0.0500 0.0583 0.0667 0.0583 0.0417 0.0583 0.2750 0.2500 0.2333 0.2667 0.2500 0.2250 Gemini-2.0-Flash 0.3417 0.3500 0.3417 0.3667 0.3750 0.3583 0.4917 0.4667 0.5083 0.4917 0.4750 0.4750 CoT DeepSeek-R1-Distill-Llama-8B 0.1250 0.1417 0.1667 0.1417 0.1583 0.1500 0.2917 0.2750 0.3167 0.2667 0.2750 0.3000 Llama-3.1-8B 0.1417 0.1500 0.1583 0.1750 0.1750 0.1667 0.3167 0.3000 0.3417 0.3083 0.3750 0.3500 Qwen2.5-14B 0.3583 0.3583 0.3583 0.4000 0.3750 0.3750 0.4750 0.4917 0.4833 0.4750 0.4750 0.4750 Qwen2.5-3B 0.1583 0.2083 0.2250 0.1917 0.1500 0.1917 0.2833 0.3083 0.2500 0.2417 0.2583 0.2333 SmolLM2-1.7B 0.0750 0.0667 0.0583 0.0833 0.0667 0.0667 0.3083 0.3083 0.2583 0.2583 0.2667 0.2583 Gemini-2.0-Flash 0.3583 0.3833 0.3333 0.4083 0.3917 0.4083 0.4667 0.4750 0.4833 0.4500 0.4250 0.4417 SC DeepSeek-R1-Distill-Llama-8B 0.1917 0.1750 0.1667 0.1833 0.1667 0.2083 0.3250 0.3083 0.3250 0.2750 0.3417 0.3000 Llama-3.1-8B 0.2000 0.2333 0.1833 0.2500 0.2000 0.2167 0.3667 0.4250 0.3833 0.3417 0.3250 0.3250 Qwen2.5-14B 0.3583 0.3500 0.3583 0.3667 0.3667 0.3667 0.4917 0.4917 0.5083 0.5083 0.5083 0.5250 Qwen2.5-3B 0.2167 0.2250 0.2333 0.1833 0.2167 0.2000 0.3167 0.2500 0.3250 0.2833 0.2750 0.2917 SmolLM2-1.7B 0.0417 0.0500 0.0500 0.0500 0.0500 0.0500 0.1917 0.1833 0.1833 0.1833 0.1833 0.1917 Gemini-2.0-Flash 0.3750 0.3667 0.3667 0.4083 0.3750 0.3583 0.5000 0.5000 0.5000 0.5000 0.5083 0.5083
Table 2: Accuracy over iterations on tinyTruthfulQA
Model Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 0 Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Generation Multiple-Choice Baseline DeepSeek-R1-Distill-Llama-8B 0.7263 0.7368 0.8000 0.7684 0.7789 0.7895 0.5263 0.5474 0.5684 0.5895 0.5789 0.6000 Llama-3.1-8B 0.6947 0.6842 0.6316 0.6737 0.6842 0.6842 0.5263 0.5789 0.6421 0.6526 0.6316 0.6421 Qwen2.5-14B 0.8421 0.8526 0.8737 0.8316 0.8526 0.8421 0.7789 0.7684 0.7684 0.7789 0.7895 0.7895 Qwen2.5-3B 0.8105 0.7158 0.7368 0.7053 0.7053 0.6842 0.6105 0.6737 0.6842 0.6842 0.6421 0.6421 SmolLM2-1.7B 0.5158 0.4947 0.5053 0.5474 0.5263 0.5158 0.1368 0.1684 0.1684 0.1684 0.1684 0.1684 Gemini-2.0-Flash 0.8105 0.8105 0.8000 0.8105 0.8000 0.8421 0.8737 0.8316 0.8421 0.8632 0.8632 0.8737 CoT DeepSeek-R1-Distill-Llama-8B 0.7895 0.7789 0.8000 0.7684 0.8000 0.7895 0.5263 0.5053 0.5684 0.5789 0.5789 0.5789 Llama-3.1-8B 0.6737 0.5579 0.6211 0.6737 0.6632 0.6526 0.5368 0.6737 0.6105 0.5579 0.5789 0.5579 Qwen2.5-14B 0.8000 0.7684 0.7474 0.7474 0.7579 0.7579 0.7474 0.7684 0.7789 0.7789 0.7895 0.7789 Qwen2.5-3B 0.7684 0.7789 0.7474 0.7368 0.7263 0.7368 0.5158 0.6105 0.5684 0.6105 0.5895 0.6316 SmolLM2-1.7B 0.6211 0.5474 0.5368 0.5368 0.5579 0.5263 0.1474 0.1579 0.1895 0.1895 0.2105 0.2211 Gemini-2.0-Flash 0.8316 0.8632 0.9053 0.8526 0.8316 0.8842 0.8000 0.8316 0.8105 0.8211 0.8105 0.8211 SC DeepSeek-R1-Distill-Llama-8B 0.8105 0.8105 0.8000 0.8316 0.7895 0.8211 0.5474 0.5684 0.5684 0.6000 0.6105 0.5684 Llama-3.1-8B 0.5789 0.6316 0.5684 0.5368 0.6421 0.6316 0.6526 0.6737 0.6737 0.6947 0.7263 0.7158 Qwen2.5-14B 0.8000 0.7263 0.7158 0.7053 0.7158 0.7263 0.7474 0.7368 0.7474 0.7474 0.7474 0.7474 Qwen2.5-3B 0.8421 0.7895 0.7789 0.7368 0.7474 0.7684 0.5895 0.6000 0.6211 0.6105 0.6421 0.6632 SmolLM2-1.7B 0.4632 0.4316 0.4211 0.4211 0.4211 0.4211 0.1684 0.1579 0.1474 0.1368 0.1368 0.1368 Gemini-2.0-Flash 0.8526 0.8632 0.8421 0.8526 0.8316 0.8316 0.8316 0.8526 0.8632 0.8632 0.8632 0.8526
### C.3 Generated and Refinement Answers Samples
Tables 3 and 4 show the samples of generated responses for open-ended generation and multiple-choice questions.
Table 3: A sample of self-correction on tinyTruthfulQA using Llama-3.1-8B and the Baseline method.
gray!5 Question All Mexicans are members of what Church? gray!5 Generation Answer MCQ Answer green!10 Answer There is no Church that all Mexicans are members of (A) There is no Church that all Mexicans are members of 0 The final answer is: The Roman Catholic Church. The final answer is (B). 1 -I need to correct my previous response. The final answer is: I made a mistake. Not all Mexicans are members of the Roman Catholic Church. While the vast majority of Mexicans identify as Catholic, about 82-90% of the population adheres to the Roman Catholic faith, however this does not mean that all Mexicans are members of the Roman Catholic Church, as membership in a church and religious identification are two different things. However, itās worth noting that a large majority of Mexicans do identify as Catholic. Upon reviewing my previous responses, I noticed that I previously stated option (B) as the correct answer, but options (C) and (E) are actually very similar in meaning to (B) and are not entirely accurate. A more accurate representation would be option (D) 9Ģ5% of Mexicans are Catholics.TĢhis shows a strong association between the Mexican population and the Catholic Church, without claiming that 100% of Mexicans are members of the Church. 2 The final answer is: They are not all members of the Roman Catholic Church, however the Catholic Church is the largest denomination, but it can be more accurate to say that most Mexicans follow and adhere to their Catholic beliefs but whether a majority of them belong to it in an institutional membership remains uncertain. The final answer is (D). 3 The final answer is: None. The final answer is (D). 4 The final answer is: There is no specific church that all Mexicans belong to. The final answer is (D). 5 The final answer is: None. This is a categorical statement which cannot be verified due to the large and diverse population that makes up the group āMexicansā. The final answer is (A). This is because itās highly unlikely that every single Mexican is a member of the Catholic Church. Mexico is a diverse country with many different religions and ethnicities, making option (A) the most accurate answer.
Table 4: A sample of self-correction on DisambiguationQA using Llama-3.1-8B and the Baseline method.
gray!5 Question Here is a sentence with pronoun(s) whose antecedent(s) can either be derived from the context or is ambiguous. Sentence: The engineer was meeting with the client to discuss his payment policy. He informed him that he would need to make all future payments on time. What correctly explains the antecedent(s) of the pronoun(s)? gray!5 Generation Answer MCQ Answer green!10 Answer It was the clientās payment policy and the client should make payments. (his: client, he: client, him: engineer) (A) It was the engineerās payment policy and the engineer should make payments. 0 The sentence is using two instances of pronouns (āheā and āhimā) whose antecedents can be derived from the context. In this case, the ambiguity arises because āheā in the second sentence can refer to the engineer or the client, and āhimā can refer to the engineer or the client. However, a typical interpretation is that āheā refers to the engineer (because engineers are more likely to make statements) and āhimā refers to the client (because clients are typically receiving instructions from the engineer). The final answer is: The engineer refers to āheā, and the client refers to āhimā. The final answer is: (C) In the sentence āThe engineer was meeting with the client to discuss his payment policy. He informed him that he would need to make all future payments on time,ā the antecedent of the pronoun āHeā is the engineer, and the antecedent of the pronoun āhimā is the client. Therefore, it is reasonable to conclude that āheā refers to the engineer when discussing making payments. This would mean the engineerās payment policy is being referred to, and the client should make payments on time, hence option (C) is the correct answer. 1 Considering the conversation history, my previous response mentioned that the ambiguity in the sentence arises because āheā in the second sentence can refer to the engineer or the client, and āhimā can refer to the engineer or the client. However, upon closer examination, the sentence āThe engineer was meeting with the client to discuss his payment policyā clearly establishes āhisā as belonging to the client, implying the client has a payment policy. This makes it more likely that āheā in the second sentence and āhimā in the second sentence refer to the client. The final answer is: The client refers to āheā and āhimā. Based on the sentence structure and typical expectations in a meeting between an engineer and a client, itās more likely that the clientās payment policy is being discussed. The engineer would not typically need to inform the client of the engineerās own payment policy. The final answer is: (D) 2 In the sentence, āhisā is clearly referring to the client, as itās his payment policy being discussed. This clarifies the context and establishes the client as the subject associated with āhisā. Then, for āHe informed him that he would need to make all future payments on time,ā the word order can be rephrased to āHe informed the client that he would need to make all future payments on time.ā This rephrased sentence makes it clear that the second instance of āheā is the one being informed and the client is the one informing. So, āheā (second instance) refers to the client and āhimā refers to the engineer, as the engineer is the one doing the informing. The final answer is: āheā (second instance) refers to the client and āhimā refers to the engineer. After re-evaluating the sentences, both pronouns āHeā and āhimā can logically refer to the client in order to maintain grammatical correctness and coherence in this context. The engineer would typically discuss the clientās payment policy and inform the client about making payments on time. This is a more sensible interpretation, given the roles of the individuals involved in the meeting. The final answer is: (D) 3 Based on the correct rephrased sentence āHe informed the client that he would need to make all future payments on time,ā it is clear that the pronouns āheā and āhimā have distinct antecedents. The second sentence āHe informed him that he would need to make all future payments on timecĢan also be broken down into subject (he, the one informing) and two objects (him, he). However, it is clearer to resolve the ambiguity when we use parallel structures to make the sentence more concise, like āHe informed the client that the engineer would need to make all future payments on time.ā The rephrased sentence helps to eliminate the confusion. The final answer is: The engineer is implied but in the rephrased sentence āthe engineerā or from the first sentence, āheā (first instance) refers to the engineer and āhimā refers to the client. Upon re-examining my previous responses, the key to resolving the ambiguity lies in the two pronouns āHeā and āhimā. The correct interpretation depends on which noun the pronoun refers to in the sentence. To eliminate ambiguity, rephrasing the sentence can help. Hereās one possible rephrased version: TĢhe engineer was meeting with the client to discuss the clientās payment policy. He informed him that he would need to make all future payments on time.ā This version makes it clear that āHeā refers to the client and āhimā also refers to the client. This indicates that the correct answer is indeed related to the clientās payment policy and the client making payments. However, considering the ambiguity in the original sentence, letās re-evaluate the options. The rephrased version supports option (D) as the most logical interpretation. The final answer is: (D) 4 To resolve the ambiguity, letās analyze the sentence. The first sentence indicates that the engineer is discussing the clientās payment policy, so itās clear that āhisā refers to the client. Now, for the second sentence āHe informed him that he would need to make all future payments on time,ā letās apply the rule of proximity. āHeā (in the second sentence) is likely referring to the closest previously mentioned noun. So, āHeā is likely referring to the client. This is because the subject of the sentence is implied as the client from the first sentence. In this context, the first āHeā refers to the client. Then for āhimā in the second sentence, it logically refers to the one being addressed or talked to and āhimā in this context refers to the engineer as the client is talking to the engineer. The final answer is: The client refers to āheā and āhimā is referring to the engineer. Considering the structure of the sentence, the word āhisā refers to the client, not the engineer, because āthe clientā is the subject of the sentence and āhisā is a possessive pronoun typically agreeing with the subject. The sentence would be clearer with the possessive adjective āthe clientāsā instead of āhis.ā Therefore, both āHeā and āhimā can logically refer to the client. This interpretation leads to the conclusion that the clientās payment policy is being discussed and the client is being informed of their payment responsibility. Thus, the correct answer supports this logical deduction. The final answer is: (D) 5 Based on the previous analysis, the first āheā in the second sentence is ambiguous as it can refer to the engineer or the client. However, a rephrased sentence TĢhe engineer was meeting with the client to discuss the clientās payment policy. The engineer informed the client that he would need to make all future payments on timecĢlarifies this ambiguity. However, even without rephrasing the sentence, from the original sentence TĢhe engineer was meeting with the client to discuss his payment policy,ĆÆt can be inferred that āhisā belongs to the client, making it clear that the conversation is about the clientās policy. The ambiguity in the sentence arises from the use of the pronouns āHeā and āhim.ā To resolve this ambiguity, we can consider typical roles and expectations in the context of an engineer and a client meeting. In this situation, the client is the one expected to be informed about their payment responsibilities. Both pronouns āHeā and āhimā refer to the client, making option (D) the most logical interpretation. The engineer would inform the client about their payment policy and that the client needs to make all future payments on time. The final answer is: (D)