2507.02778

Model: healer-alpha-free

# Self-Correction Bench: Uncovering and Addressing the Self-Correction Blind Spot in Large Language Models **Authors**: - Ken Tsui (Independent Researcher) ## Abstract Although large language models (LLMs) have transformed AI, they still make mistakes and can explore unproductive reasoning paths. Self-correction capability is essential for deploying LLMs in safety-critical applications. We uncover a systematic failure: LLMs cannot correct errors in their own outputs while successfully correcting identical errors from external sources - a limitation we term the Self-Correction Blind Spot. To study this phenomenon, we introduce Self-Correction Bench, an evaluation framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 open-source non-reasoning models, we find an average 64.5% blind spot rate. We provide multiple lines of evidence suggesting this limitation may be influenced by training data: human demonstrations rarely include error-correction sequences (favoring error-free responses), whereas reinforcement learning (RL) trained models learn error correction via outcome feedback. Remarkably, appending a minimal “ Wait ” prompt activates a 89.3% reduction in blind spots, suggesting dormant capabilities that require triggering. Our work highlights a critical limitation potentially influenced by training distribution and offers a practical approach to enhance LLM reliability and trustworthiness - vital for safety-critical domains. ## 1 Introduction Large Language Models (LLMs) have rapidly advanced natural language processing, achieving state-of-the-art results on a diverse range of tasks (OpenAI et al., 2024; Anthropic, 2024; Gemini Team, 2025; Yang et al., 2025; Meta, 2025; DeepSeek-AI et al., 2025a). However, despite their impressive capabilities, LLMs are known to exhibit unpredictable failures and generate inaccurate information (Maynez et al., 2020; Huang et al., 2025; Bang et al., 2023; Shi et al., 2023), or explore an unproductive reasoning path and commit to it. A particularly concerning issue is that LLMs can make errors even in simple tasks (Nezhurina et al., 2025), despite having the necessary underlying knowledge to provide the correct solutions, raising reliability concerns that hinder deployment in critical applications. Studying LLM self-correction behavior in natural settings is challenging due to their inherent accuracy; the rarity of naturally occurring errors makes systematic evaluation challenging. To address this, we construct Self-Correction Bench by systematically injecting error into the LLM reasoning traces, enabling us to test self-correction in reproducible scenarios and quantifying performance reliably. Our results reveal that LLMs fail to correct their own errors (64.5% average failure rate), but reliably fix identical errors from external sources. We refer to this phenomenon as the Self-Correction Blind Spot. This rules out knowledge deficiency as the root cause - instead, the blind spot stems from a lack of activation for self-correction. Strikingly, appending a simple “ Wait ” reduces the blind spot by 89.3%, confirming minimum prompt can unlock latent correction abilities. We provide a behavioral explanation for why “ Wait ” works, supported by systematic analysis of correction marker patterns in post-training data. We also analyze why RL-trained reasoning models do not have such blind spot. Our contributions are threefold. - Discovery of Self-Correction Blind Spot: a systematic failure of LLMs to correct their own errors despite competency on external ones — potentially influenced by post-training data biases where human demonstrations rarely include self-correction sequences, unlike RL models that learn error-correction through outcome feedback. - Self-Correction Bench: a controlled evaluation framework with error-injected reasoning traces for fair cross-model comparison. - Effective intervention: appending “ Wait ” reduces blind spots by 89.3% - demonstrating that activation is the limiting factor. These results advance both our understanding of LLM reasoning flaws and provide a practical solution to improve their reliability in real-world use. ## 2 Related work Intrinsic self-correction in LLMs. Recent work has highlighted intrinsic self-correction, where LLMs generate feedback on their own outputs (Shinn et al., 2023; Madaan et al., 2023; Kim et al., 2023; Kamoi et al., 2024a), as a way to improve performance, but critical limitations persist. Self-feedback quality is limited by the model’s existing knowledge, and without oracle labels, LLMs struggle to correct errors (Huang et al., 2024): prior studies attribute this to poor error localization (Tyen et al., 2024) and detection (Kamoi et al., 2024b). Most approaches rely on multi-step prompting, whereas we focus on single-pass inference (self-correction in one completion) and study limitations from a cognitive perspective - rather than just knowledge bounds. Related work using RL for self-correction (Kumar et al., 2025) or training signals from ground truth (DeepSeek-AI et al., 2025a) contrasts with our test-time, no-finetuning approach. Prompt injection for evaluation. Traditional prompt injection research focuses on adversarial manipulation (e.g., attackers injecting malicious instructions to distort outputs) (Wei et al., 2023; Liu et al., 2024). Controlled error injection to evaluate self-correction is underexplored. For example, Lanham et al. (2023) injected mistakes into reasoning chains to measure consistency between steps and conclusions, but not self-correction capability. Our work advances this by systematically injecting errors across task complexities to reveal uncharacterized blind spots in how LLMs correct themselves. Hallucination snowballing. Zhang et al. (2024) demonstrate that once LLMs hallucinate, subsequent tokens often align with the initial error, a “snowball” effect, suggesting inherent limits to self-correction during generation. We explain this phenomenon by identifying Self-Correction Blind Spot: LLMs reliably correct errors in external inputs, but fail to correct errors in their own outputs. This distinction is critical to understanding why snowballing persists. Test-time interventions. Recent efforts have shifted compute from training to test time (Snell et al., 2025), yielding improved performance (e.g. Muennighoff et al. (2025) appends “Wait” to force longer reasoning traces). While these methods work, the reason of improvement remains understudied. We provide a behavioral explanation by testing unfinetuned models: interventions activate otherwise inactive self-correction mechanisms, therefore improving performance in tasks where models might make error. Cognitive bias in LLM. LLMs exhibit human-like cognitive biases (Koo et al., 2024; Echterhoff et al., 2024; Jones and Steinhardt, 2022), and we link the bias blind spot (the tendency to overlook one’s own biases) (Pronin et al., 2002) to impaired self-correction. This connects high-level cognitive limitations to the fine-grained failure mode we characterize. Our work integrates these threads into a systematic methodology for testing self-correction, reveals that LLMs suffer from a fundamental blind spot (inability to correct their own errors), and demonstrates how test-time interventions can activate dormant self-correction mechanisms without finetuning. ## 3 Conceptual motivation Building on these insights, we now formalize the theoretical framework underlying our empirical investigation. We provide conceptual motivation for our empirical study, focusing on error states, self-correction mechanisms, and their measurement. ### 3.1 Error and self-correction: the case for marginalization Autoregressive LLMs cannot guarantee every generated token is correct as the number of token grows, resulting in hallucination (Maynez et al., 2020), snowballing errors (Zhang et al., 2024), or unproductive reasoning path or execution flaws. Thus, self-correction is necessary for robustness: models must reverse errors to produce a correct answer. Please note that a correct answer does not require all previously generated tokens to be correct, as one might be concerned only with the final answer. To formalize this, let $\mathcal{E}=\{e_{0},e_{1},\dots,e_{k}\}$ denote a set of mutually exclusive and collectively exhaustive discrete error states, where $e_{0}$ represents the “no error” state, and $e_{1},\dots,e_{k}$ represent distinct error conditions. For each state $e_{i}\in\mathcal{E}$ , let $R_{e_{i}}$ denote the response set. The probability of a model, $M$ , giving a correct answer can be marginalized over error states: $$ \displaystyle P_{M}(r_{correct})=\sum_{e\in\mathcal{E}}P_{M}(e)\cdot P_{M}(r_{correct}|e)=\sum_{e\in\mathcal{E}}\sum_{r_{m}\in R_{e}}P_{M}(e)\cdot P_{M}(r_{m}|e)\cdot P_{M}(r_{correct}|r_{m},e), \tag{1} $$ where $r_{m}$ is the model’s response, and $P_{M}(r_{correct}|r_{m},e)$ captures self-correction of $r_{m}$ . Here, $P_{M}(r_{correct})$ depends critically on self-correction: even with frequent errors, high $P_{M}(r_{correct}|r_{m},e)$ can yield strong performance. Error-free generation is a special case of this framework - not the only path to correctness. ### 3.2 External and internal self-correction, and Self-Correction Blind Spot We distinguish self-correction by error source: 1. Internal correction: Metacognitive monitoring of the model’s own initial response $r_{m}$ . 1. External correction: Evaluation of errors in the user prompt $r_{u}$ . This distinction is motivated by the cognitive bias, “bias blind spot”. Pronin et al. (2002) show that humans are able to identify cognitive biases in others while failing to see those same biases in themselves, suggesting LLMs trained on human demonstration might share this limitation. To quantify this, we define the Self-Correction Blind Spot as: $$ \displaystyle\text{Self-Correction Blind Spot}=\begin{cases}1-\frac{P_{M}(r_{correct}|r_{m},e)}{P_{M}(r_{correct}|r_{u},e)}&\text{if }P_{M}(r_{correct}|r_{u},e)>0\\ 0&\text{if }P_{M}(r_{correct}|r_{u},e)=0\end{cases} \tag{2} $$ A value of 1 indicates a total blind spot: the model can correct external errors but not its own. By design, the Self-Correction Blind Spot isolates confounding factors, including internal model knowledge. ### 3.3 Controlled error injection: measuring self-correction in practice The marginalization framework (Equation 1) is intractable in practice: $P_{M}(e)$ , the true probability of error states, is unobservable as LLMs operate over infinite prompt spaces. To solve this, we introduce controlled error $e_{controlled}$ . For internal correction, we inject an incorrect partial response into the model’s “own” output (omitting stop tokens to allow continuation/self-correction); for external correction, we inject the same error into the user prompt instead. We empirically estimate $P_{M}(r_{correct}|r_{m},e_{controlled})$ and $P_{M}(r_{correct}|r_{u},e_{controlled})$ by measuring $P_{M}(r_{correct})$ when controlled error is injected in $r_{m}$ and $r_{u}$ respectively, which forms the basis of our Self-Correction Bench. This framework - marginalizing over errors, distinguishing internal/external correction, and using controlled injection - provides the conceptual underpinning for our empirical analysis. ## 4 Self-Correction Bench Having established the theoretical foundation, we introduce a benchmark to measure Self-Correction Blind Spot in LLMs across varying task complexities, enabling diagnosis of where correction fails. By progressing from simple to realistic errors, we isolate confounding factors such as knowledge limits or reasoning complexity. Our benchmark comprises three datasets (see Table 1), with controlled error injection to compare performance under identical conditions. For each dataset, we systematically inject identical error into both model response ( $r_{m}$ ) and user prompt ( $r_{u}$ ), allowing empirical estimation of both correction probabilities $P_{M}(r_{correct}|r_{m},e_{controlled})$ and $P_{M}(r_{correct}|r_{u},e_{controlled})$ under identical error conditions. Unlike Zheng et al. (2025) and Song et al. (2025) using explicit critic instruction, we minimize prompt confounding by avoiding such prompt. Figure 1 illustrates how we inject error in user message (i.e. external error) and in model (i.e. internal error), respectively, via chat template. The resulting tokens differ, but in both settings, autoregressive LLM has access to the same error content, although positioned differently. <|start_header_id |>user<|end_header_id|>\n\n What is the answer of 1 + 1? <|eot_id|><|start_header_id|>assistant <|end_header_id |>\n\n The answer is 3. <|eot_id|> Error Injection in Model <|start_header_id |>user<|end_header_id|>\n\n What is the answer of 1 + 1? The answer is 3. <|eot_id|><|start_header_id|>assistant <|end_header_id |>\n\n I’m afraid that’s not correct. The answer to the equation 1 + 1 is actually 2, not 3. Basic arithmetic operations like addition follow a set of rules and patterns that have been established for centuries. In this case, when you add 1 to 1, you are counting two units, which equals 2.\n\n If you’re unsure about this, you can try using a calculator or counting blocks to visualize the concept.<|eot_id|> Error Injection in User Message Figure 1: Example of error injection. Grey color shows model completion. Above: Error injection in model; Below: Error injection in user message Table 1: Dataset comparison | Dataset | Complexity | Realism of Error | Reasoning | Size | | --- | --- | --- | --- | --- | | SCLI5 | Low | Low | N | 286 | | GSM8K-SC | Medium | Medium | Y | 1,313 | | PRM800K-SC | High | High | Y | 448 | ### 4.1 Self-correct Like I am 5 (SCLI5) SCLI5 isolates basic correction by introducing simple answer errors (e.g., off-by-one, flip) to trivial tasks (i.e. no reasoning required, just answer recall). Programmatic error generation ensures we test the simplest possible correction: if models cannot detect obvious errors, subtle ones are impossible. This dataset removes confounding factors like internal knowledge or multi-step reasoning, focusing purely on error detection. The composition of the task is shown in Table 6. ### 4.2 GSM8K-SC Built from Cobbe et al. (2021), a multi-step reasoning dataset, GSM8K-SC injects different types of reasoning errors as shown in Table 7 that propagate to incorrect answer. We use ‘gpt-4.1-2025-04-14’ (OpenAI, 2025) to generate controlled errors and ‘gemini-2.5-flash-preview-05-20’ (Gemini Team, 2025) to validate that incorrect reasoning leads to inconsistent answers, resulting in 1,313 high-quality samples. This dataset tests correction in multi-step reasoning, a middle ground between simplicity and realism. The prompt can be found in Appendix D.1. ### 4.3 PRM800K-SC PRM800K (Lightman et al., 2024), derived from a subset of MATH (Hendrycks et al., 2021), provides step-by-step annotations of multi-step reasoning. We selected 448 samples where the generated answers mismatch ground truth, capturing errors from real-world LLM use. This progression from simple answer errors to realistic failures, lets us map exactly where self-correction breaks down, making the benchmark a powerful tool for diagnosing and improving LLM robustness. ## 5 Experiment ### 5.1 Experiment setup We evaluated a wide range of open-source LLMs, as close-source models lack support for fine-grained control of prefix inject critical for our methodology. We apply model-specific chat templates using ‘transformers’ library (Wolf et al., 2020). We leverage the DeepInfra https://deepinfra.com/ completion API with 0.0 temperature as models’ most confident prediction should help self-correction, and a fixed token budget of 1,024 to isolate the effect of test time compute. We provide more rationales of our choices and perform sensitivity analysis in the Appendix C, confirming results are robust. Evaluation. We use ‘gemini-2.5-flash-preview-05-20’ to compare LLMs’ completion against the ground-truth answer. We instruct the model to output in JSON format. Due to the objectivity of the task and the provision of ground truth in the prompt, we do not believe there is significant bias. The prompt is provided in the Appendix D.2. We manually review 100 samples for each dataset to ensure evaluation quality. Metrics. We evaluate if LLMs can self-correct and arrive at the ground-truth answer given an error. In GSM8K-SC and PRM800K-SC, we measure the behavior of LLMs before commit an answer, as it is a more common scenario when an LLM backtracks, although we also report that after commit an answer. We report mean accuracy ( $P_{M}(r_{correct})$ ) and Self-Correction Blind Spot for each model. For statistical rigor, we report 95% confidence interval, which is estimated by adding and subtracting 1.96 * standard error of mean (SEM) from mean. The SEM is estimated using the formula $\sigma_{M}=\frac{\sigma}{\sqrt{N}}$ , where N is the sample size and $\sigma$ is the sample standard deviation. ### 5.2 Result Table 2: Mean accuracy and 95% confidence interval of models at temperature 0.0 | Model | SCLI5 | GSM8K-SC | PRM800K-SC | | --- | --- | --- | --- | | Llama-4-Maverick-17B-128E-Instruct-FP8 (Meta, 2025) | 0.948 ± 0.026 | 0.416 ± 0.027 | 0.455 ± 0.046 | | DeepSeek-V3-0324 (DeepSeek-AI et al., 2025b) | 0.825 ± 0.044 | 0.399 ± 0.026 | 0.475 ± 0.046 | | Qwen2.5-72B-Instruct (Qwen et al., 2025) | 0.92 ± 0.032 | 0.58 ± 0.027 | 0.154 ± 0.033 | | Llama-4-Scout-17B-16E-Instruct (Meta, 2025) | 0.976 ± 0.018 | 0.24 ± 0.023 | 0.263 ± 0.041 | | Llama-3.3-70B-Instruct (Meta, 2024) | 0.538 ± 0.058 | 0.275 ± 0.024 | 0.246 ± 0.04 | | Qwen3-235B-A22B footnotemark: (Yang et al., 2025) | 0.563 ± 0.058 | 0.073 ± 0.014 | 0.348 ± 0.044 | | phi-4 (Abdin et al., 2024) | 0.808 ± 0.046 | 0.076 ± 0.014 | 0.092 ± 0.027 | | Qwen2.5-7B-Instruct (Qwen et al., 2025) | 0.559 ± 0.058 | 0.19 ± 0.021 | 0.141 ± 0.032 | | Qwen2-7B-Instruct (Yang et al., 2024) | 0.601 ± 0.057 | 0.078 ± 0.014 | 0.058 ± 0.022 | | Qwen3-14B footnotemark: (Yang et al., 2025) | 0.004 ± 0.007 | 0.092 ± 0.016 | 0.254 ± 0.04 | | Qwen3-30B-A3B footnotemark: (Yang et al., 2025) | 0.056 ± 0.027 | 0.061 ± 0.013 | 0.194 ± 0.037 | | Llama-3.1-8B-Instruct (Grattafiori et al., 2024) | 0.136 ± 0.04 | 0.019 ± 0.007 | 0.02 ± 0.013 | | Qwen3-32B footnotemark: (Yang et al., 2025) | 0.004 ± 0.007 | 0.05 ± 0.012 | 0.083 ± 0.026 | | Mistral-Small-24B-Instruct-2501 (Team, 2025) | 0.042 ± 0.023 | 0.011 ± 0.006 | 0.016 ± 0.012 | - Qwen3 series models use non-thinking mode. In Table 2, we summarizes mean accuracy and 95% confidence interval of state-of-the-art non-reasoning LLMs. We observe notably low accuracy for SCLI5 in some models. We observe moderate to strong positive correlations between SCLI5, GSM8K-SC and PRM800K-SC (see Figure 5), suggesting that there is a limitation of LLMs to self-correct across task complexities. If LLMs cannot self-correct either easy or hard tasks, it implies an activation problem rather than a knowledge problem. In Figure 6, we show some models (e.g. Qwen3-32B, LLama3.1-8B-Instruct and Mistral-Small-24B-Instruct-2501) frequently give empty responses, highlighting unawareness of error. We identify statistically significant Self-Correction Blind Spot for most models (see Figure 2). The blind spot, on average, 64.5%, exists across models, regardless of model sizes. We observe moderate correlation across datasets (see Figure 3), indicating a fundamental rather than task-specific limitation. On average, when a model has committed an answer, it has a much higher blind spot to recognize its own error, a finding similar to Zhang et al. (2024). <details> <summary>images/blind_spot_summary_default_non_reasoning.png Details</summary> ![5b67ea4b](/v1/image/5b67ea4b0879a8ed1989df8c30f2fd8d6b0e6ae41835b8534a1a709374e477d7) ### Visual Description ## [Bar Chart]: Blind Spot summary across datasets - 95% Confidence Intervals ### Overview The image is a bar chart displaying **"Self-Correction Blind Spot"** values (y-axis) across 14 AI models (x-axis) for five datasets. Each bar includes a 95% confidence interval (error bar) to show variability. The chart evaluates how models perform in self-correction across different datasets, with "Before commit answer" and "After commit answer" variants for two datasets (GSM8K-SC, PRM800K-SC) and a single variant for SCL15. ### Components/Axes - **Title**: *"Blind Spot summary across datasets - 95% Confidence Intervals"* - **Y-axis**: Label = *"Self-Correction Blind Spot"*; Scale = 0.0 to 1.0 (increments of 0.2). - **X-axis**: Label = *"Models"*; 14 model names (listed below). - **Legend** (top-left): - SCL15 (blue) - GSM8K-SC (Before commit answer) (orange) - GSM8K-SC (After commit answer) (light orange) - PRM800K-SC (Before commit answer) (green) - PRM800K-SC (After commit answer) (light green) ### Detailed Analysis (Data Points) Approximate values (mean ± 95% CI) for each model and dataset: | Model Name | SCL15 (blue) | GSM8K-SC (Before) | GSM8K-SC (After) | PRM800K-SC (Before) | PRM800K-SC (After) | | :--- | :--- | :--- | :--- | :--- | :--- | | Llama-4-Maverick-17B-128E-Instruct-FP8 | ~0.03 (0.01–0.05) | ~0.51 (0.48–0.54) | ~0.83 (0.80–0.86) | ~0.36 (0.33–0.39) | ~0.76 (0.73–0.79) | | DeepSeek-V3-0324 | ~0.12 (0.09–0.15) | ~0.57 (0.54–0.60) | ~0.79 (0.76–0.82) | ~0.39 (0.36–0.42) | ~0.82 (0.79–0.85) | | Qwen2.5-72B-Instruct | ~0.08 (0.05–0.11) | ~0.39 (0.36–0.42) | ~0.37 (0.34–0.40) | ~0.75 (0.72–0.78) | ~0.70 (0.67–0.73) | | Llama-4-Scout-17B-16E-Instruct-FP8-dynamic | ~0.01 (0.00–0.02) | ~0.73 (0.70–0.76) | ~0.96 (0.93–0.99) | ~0.55 (0.52–0.58) | ~0.92 (0.89–0.95) | | Llama-3.3-70B-Instruct | ~0.46 (0.43–0.49) | ~0.69 (0.66–0.72) | ~0.99 (0.96–1.02) | ~0.32 (0.29–0.35) | ~0.60 (0.57–0.63) | | Qwen3-235B-A22B | ~0.44 (0.41–0.47) | ~0.92 (0.89–0.95) | ~0.99 (0.96–1.02) | ~0.56 (0.53–0.59) | ~0.88 (0.85–0.91) | | Phi-4 | ~0.19 (0.16–0.22) | ~0.92 (0.89–0.95) | ~0.99 (0.96–1.02) | ~0.87 (0.84–0.90) | ~0.86 (0.83–0.89) | | Qwen2.5-7B-Instruct | ~0.18 (0.15–0.21) | ~0.79 (0.76–0.82) | ~0.87 (0.84–0.90) | ~0.76 (0.73–0.79) | ~0.87 (0.84–0.90) | | Qwen2-7B-Instruct | ~0.13 (0.10–0.16) | ~0.91 (0.88–0.94) | ~0.98 (0.95–1.01) | ~0.91 (0.88–0.94) | ~0.93 (0.90–0.96) | | Qwen3-14B | ~1.00 (0.97–1.03) | ~0.89 (0.86–0.92) | ~0.99 (0.96–1.02) | ~0.64 (0.61–0.67) | ~0.81 (0.78–0.84) | | Qwen3-30B-A3B | ~0.94 (0.91–0.97) | ~0.93 (0.90–0.96) | ~1.00 (0.97–1.03) | ~0.75 (0.72–0.78) | ~0.94 (0.91–0.97) | | Llama-3.1-8B-Instruct | ~0.83 (0.80–0.86) | ~0.97 (0.94–1.00) | ~1.00 (0.97–1.03) | ~0.89 (0.86–0.92) | ~0.94 (0.91–0.97) | | Qwen3-32B | ~1.00 (0.97–1.03) | ~0.94 (0.91–0.97) | ~1.00 (0.97–1.03) | ~0.89 (0.86–0.92) | ~0.96 (0.93–0.99) | | Mistral-Small-24B-Instruct-2501 | ~0.96 (0.93–0.99) | ~0.99 (0.96–1.02) | ~1.00 (0.97–1.03) | ~0.96 (0.93–0.99) | ~0.97 (0.94–1.00) | ### Key Observations 1. **SCL15 (blue)**: Most models have low values (≤0.46), except Qwen3-14B, Qwen3-30B-A3B, Llama-3.1-8B-Instruct, Qwen3-32B, and Mistral-Small-24B-Instruct-2501 (≥0.83). 2. **GSM8K-SC (Before/After)**: "After commit answer" (light orange) consistently outperforms "Before" (orange) for most models, indicating improved self-correction after finalizing answers. 3. **PRM800K-SC (Before/After)**: "After commit answer" (light green) often exceeds "Before" (green), mirroring GSM8K-SC’s trend. 4. **Error Bars**: Narrow intervals (e.g., Qwen3-235B-A22B) suggest consistent results; wider intervals (e.g., Llama-4-Scout-17B-16E-Instruct-FP8-dynamic for PRM800K-SC Before) indicate variability. 5. **Model Performance**: Larger models (e.g., Qwen3-235B-A22B, Llama-3.3-70B-Instruct) and smaller models (e.g., Qwen3-14B, Mistral-Small) show high blind spot values, especially post-commit. ### Interpretation The chart measures **self-correction blind spots** (how well models correct errors) across datasets. Key insights: - **Dataset Impact**: GSM8K-SC and PRM800K-SC show improved performance after committing answers, suggesting finalizing responses enhances self-correction. - **Model Architecture**: Larger models (e.g., Qwen3-235B) and smaller models (e.g., Qwen3-14B) perform well, indicating architecture/training (not just size) drives self-correction. - **SCL15 Anomaly**: SCL15 has low values for most models, suggesting it measures a distinct (or more challenging) aspect of self-correction. This data guides model development by identifying which models/datasets excel at self-correction, informing future evaluations and improvements. </details> Figure 2: Self-Correction Blind Spot and 95% confidence interval across models <details> <summary>images/blind_spot_correlation_bca_non_reasoning.png Details</summary> ![699a0701](/v1/image/699a07010834f27ca2a5d7cc14d0aa7168793a6caeb81deb7419de1eb4968d60) ### Visual Description \n ## [Chart Set]: Blind Spot Correlation Analysis ### Overview The image displays three horizontally arranged charts analyzing the correlation of "Blind Spot Scores" across three different evaluation datasets: SCLI5, GSM8K-SC (BCA), and PRM800K-SC (BCA). The analysis includes a correlation matrix heatmap and two scatter plots with regression lines, comparing pairs of these datasets. The charts appear to evaluate the performance consistency of various Large Language Models (LLMs) across different benchmarks. ### Components/Axes **Chart 1 (Left): Blind Spot correlation matrix** * **Type:** Heatmap. * **Axes:** Both the X and Y axes are labeled with the three datasets: `SCLI5`, `GSM8K-SC (BCA)`, and `PRM800K-SC (BCA)`. * **Legend/Color Bar:** A vertical color bar on the right side indicates the correlation coefficient scale, ranging from `-1.00` (blue) to `1.00` (dark red), with `0.00` at the midpoint (light gray/white). * **Data Labels:** Each cell in the 3x3 matrix contains the correlation coefficient value. **Chart 2 (Middle): Blind Spot correlation: SCLI5 vs GSM8K-SC (BCA) (r = 0.667)** * **Type:** Scatter plot with regression line. * **X-Axis:** `SCLI5 Blind Spot Score`. Scale ranges from `0.0` to `1.0`. * **Y-Axis:** `GSM8K-SC (BCA) Blind Spot Score`. Scale ranges from `0.0` to `1.0`. * **Data Series:** Blue circular data points, each labeled with an LLM model name. * **Lines:** * A red dashed line represents the linear regression fit. * A gray dotted line represents the line of perfect correlation (y=x). * **Legend:** Located in the bottom-right corner, labeling the gray dotted line as "Perfect correlation". **Chart 3 (Right): Blind Spot correlation: GSM8K-SC (BCA) vs PRM800K-SC (BCA) (r = 0.619)** * **Type:** Scatter plot with regression line. * **X-Axis:** `GSM8K-SC (BCA) Blind Spot Score`. Scale ranges from `0.3` to `1.0`. * **Y-Axis:** `PRM800K-SC (BCA) Blind Spot Score`. Scale ranges from `0.3` to `1.0`. * **Data Series:** Green circular data points, each labeled with an LLM model name. * **Lines:** * A red dashed line represents the linear regression fit. * A gray dotted line represents the line of perfect correlation (y=x). * **Legend:** Located in the top-left corner, labeling the gray dotted line as "Perfect correlation". ### Detailed Analysis **Chart 1: Correlation Matrix Data** The matrix shows the pairwise Pearson correlation coefficients (r) between the three datasets. * **SCLI5 vs. GSM8K-SC (BCA):** r = 0.67 * **SCLI5 vs. PRM800K-SC (BCA):** r = 0.41 * **GSM8K-SC (BCA) vs. PRM800K-SC (BCA):** r = 0.62 * **Self-correlation (Diagonal):** All are 1. **Chart 2: SCLI5 vs. GSM8K-SC (BCA) Scatter Plot Data** *Trend Verification:* The blue data points show a general upward trend from left to right, indicating a positive correlation. The red regression line slopes upward. The overall correlation is noted as r = 0.667. *Approximate Data Points (Model: [SCLI5 Score, GSM8K-SC Score]):* * Qwen2.5-72B-Instruct: [~0.15, ~0.38] * Llama-4-Maverick-...: [~0.05, ~0.51] * DeepSeek-V3-0324: [~0.12, ~0.57] * Llama-4-Scout-17B...: [~0.02, ~0.73] * Qwen2.5-7B-Instruct: [~0.18, ~0.80] * Qwen2-7B-Instruct: [~0.15, ~0.91] * Llama-3.3-70B-Ins...: [~0.45, ~0.69] * Qwen3-235B-A22B: [~0.42, ~0.92] * Llama-3.1...: [~0.82, ~0.97] * Qwen3-32B: [~0.95, ~0.93] * Qwen3-14B: [~0.98, ~0.89] * Mistral-Small-24B...: [~0.92, ~0.95] **Chart 3: GSM8K-SC (BCA) vs. PRM800K-SC (BCA) Scatter Plot Data** *Trend Verification:* The green data points show a general upward trend, though with more scatter than the middle chart. The red regression line slopes upward. The overall correlation is noted as r = 0.619. *Approximate Data Points (Model: [GSM8K-SC Score, PRM800K-SC Score]):* * Llama-3.3-70B-Ins...: [~0.68, ~0.32] * Llama-4-Maverick-...: [~0.51, ~0.36] * DeepSeek-V3-0324: [~0.57, ~0.39] * Llama-4-Scout-17B...: [~0.73, ~0.55] * Qwen3-235B-A22B: [~0.92, ~0.56] * Qwen3-14B: [~0.89, ~0.64] * Qwen2.5-72B-Instruct: [~0.38, ~0.75] * Qwen2.5-7B-Instruct: [~0.78, ~0.76] * Qwen3-30B-A3B: [~0.93, ~0.75] * Qwen2-7B-Instruct: [~0.91, ~0.90] * Phi-4...: [~0.93, ~0.87] * Qwen3-32B: [~0.95, ~0.91] * Mistral-Small-24B...: [~0.98, ~0.98] ### Key Observations 1. **Correlation Strength:** The strongest correlation (r=0.67) is between SCLI5 and GSM8K-SC (BCA). The weakest (r=0.41) is between SCLI5 and PRM800K-SC (BCA). 2. **Model Performance Clusters:** In the scatter plots, models tend to cluster. For example, in Chart 2, several high-performing models (Qwen3-32B, Mistral-Small-24B, Llama-3.1) are grouped in the top-right corner, indicating high scores on both benchmarks. Conversely, models like Qwen2.5-72B-Instruct and Llama-4-Maverick-... are in the lower-left, indicating lower scores on both. 3. **Notable Outliers:** In Chart 3, `Qwen2.5-72B-Instruct` is a significant outlier. It has a relatively low GSM8K-SC score (~0.38) but a high PRM800K-SC score (~0.75), placing it far above the regression line. This suggests its performance profile differs markedly between these two benchmarks. 4. **Regression vs. Perfect Correlation:** In both scatter plots, the red regression line is less steep than the gray perfect correlation line. This indicates that, on average, as the score on the X-axis benchmark increases, the score on the Y-axis benchmark also increases, but not at a 1:1 rate. Models with very high X-axis scores tend to have slightly lower Y-axis scores than perfect correlation would predict. ### Interpretation This set of charts investigates how consistently LLMs perform across different "Blind Spot" evaluation benchmarks, which likely test for specific reasoning failures or knowledge gaps. * **What the data suggests:** The moderate positive correlations (0.41 to 0.67) indicate that a model's tendency to have "blind spots" is somewhat consistent across different test suites, but far from perfectly aligned. A model that excels (or fails) in one benchmark's blind spot evaluation is only moderately likely to show a similar pattern in another. * **Relationship between elements:** The heatmap provides a high-level summary of the relationships, which are then visualized in detail by the scatter plots. The scatter plots reveal the individual model behaviors that create the aggregate correlation numbers. The regression lines model the general trend, while the perfect correlation line serves as a reference for ideal consistency. * **Why it matters:** This analysis is crucial for benchmarking and model evaluation. It suggests that: 1. **No single benchmark is sufficient:** Since correlations are not near 1.0, evaluating a model on only one of these datasets would give an incomplete picture of its "blind spot" profile. 2. **Benchmark design matters:** The differing correlations imply the benchmarks (SCLI5, GSM8K-SC, PRM800K-SC) are testing related but distinct capabilities or failure modes. The weaker link between SCLI5 and PRM800K-SC suggests they may have less in common methodologically or in the skills they assess. 3. **Model-specific insights:** Outliers like `Qwen2.5-72B-Instruct` in the third chart are particularly interesting. They highlight models with unique performance signatures, which could be due to specific training data, architecture, or optimization that makes them proficient in one type of reasoning (tested by PRM800K-SC) while leaving them vulnerable in another (tested by GSM8K-SC). </details> Figure 3: left: Blind spot correlation matrix middle: Scatter plot between SCLI5 vs GSM8K-SC right: Scatter plot between GSM8K-SC vs PRM800K-SC BCA: Before commit an answer ## 6 Analysis ### 6.1 How do LLMs self-correct? Analysis of model responses reveals that external errors trigger 179.5% and 73.6% more correction markers Correction markers include “ Wait, “ But ”, “ However ”, “ No ”, “ Hold on ”, “ Hang on ”, “ Alternatively ”, “ Hmm ”. in GSM8K-SC and PRM800K-SC respectively. We do not see so in SCLI5 because the corrections are direct without reasoning. This finding leads us to perform a causal intervention. We append “ Wait ” after incorrect reasoning or answer to prompt LLMs to self-correct, without finetuning. We observe significant reductions in the blind spot after appending “ Wait ”, in some cases, a negative blind spot (see Figure 7). Averaging across models and datasets, the reduction amounts to 89.3%, and the macro average of mean accuracy increases by 156.0% (see Figure 4). <details> <summary>images/error_injection_model_macro_averages_non_reasoning_no_wait_vs_wait.png Details</summary> ![7f5a26e0](/v1/image/7f5a26e0d78414143124d1709a403c87cc0b1bc3fde238b46e0c7c4f3b41c506) ### Visual Description ## Bar Chart: Macro Average Accuracy Comparison (Original vs. Appended Wait) ### Overview This is a grouped bar chart comparing the macro average accuracy of 14 different large language models under two conditions: "Original" and "Appended Wait." The chart demonstrates a consistent and significant increase in accuracy for every model when the "Wait" condition is applied. ### Components/Axes * **Title:** "Macro average accuracy increases from original to appended Wait" (centered at the top). * **Y-Axis:** Labeled "Macro average accuracy." The scale runs from 0.0 to 0.8 with major tick marks at 0.0, 0.2, 0.4, 0.6, and 0.8. * **X-Axis:** Labeled "Models." It lists 14 distinct model names, rotated for readability. * **Legend:** Located in the top-right corner. It defines two data series: * **Original:** Represented by red/coral-colored bars. * **Appended Wait:** Represented by dark brown/maroon-colored bars. * **Data Labels:** Each bar has its exact numerical value displayed directly above it. ### Detailed Analysis The chart presents paired data for each model. The "Appended Wait" bar is taller than the "Original" bar in every single case. **Model-by-Model Data Extraction (Original, Appended Wait):** | Model | Original | Appended Wait | | :--- | :--- | :--- | | Llama-4-Maverick-17B-128E-Instruct-FP8 | 0.606 | 0.842 | | DeepSeek-V3-0324 | 0.567 | 0.902 | | Qwen2.5-72B-Instruct | 0.551 | 0.770 | | Llama-4-Scout-17B-16E-Instruct-FP8-dynamic | 0.493 | 0.764 | | Llama-3.3-70B-Instruct | 0.353 | 0.727 | | Qwen3-235B-A22B | 0.328 | 0.856 | | Phi-4 | 0.325 | 0.701 | | Qwen2.5-7B-Instruct | 0.297 | 0.670 | | Qwen2-7B-Instruct | 0.246 | 0.586 | | Qwen3-14B | 0.117 | 0.868 | | Qwen3-30B-A3B | 0.104 | 0.860 | | Llama-3.1-8B-Instruct | 0.058 | 0.524 | | Qwen3-32B | 0.045 | 0.793 | | Mistral-Small-24B-Instruct-2501 | 0.023 | 0.666 | ### Key Observations 1. **Universal Improvement:** All 14 models show a higher macro average accuracy in the "Appended Wait" condition compared to the "Original" condition. 2. **Magnitude of Increase:** The improvement is substantial. The smallest absolute increase is approximately +0.236 (Llama-4-Maverick), while the largest is approximately +0.751 (Qwen3-14B). 3. **Performance Reversal:** Models with very low original accuracy (e.g., Qwen3-14B at 0.117, Qwen3-32B at 0.045) often achieve some of the highest accuracies in the "Appended Wait" condition (0.868 and 0.793, respectively), suggesting the intervention is particularly effective for these models. 4. **Top Performers:** In the "Appended Wait" condition, the highest accuracy is achieved by DeepSeek-V3-0324 (0.902), followed closely by Qwen3-14B (0.868) and Qwen3-30B-A3B (0.860). ### Interpretation The data provides strong, consistent evidence that the "Appended Wait" technique or condition significantly boosts the macro average accuracy of the evaluated language models. The effect is not marginal; it transforms performance, especially for models that initially performed poorly. This suggests that the "Wait" mechanism likely addresses a fundamental limitation in the models' original inference or reasoning process. It could be implementing a form of "chain-of-thought" prompting, a deliberate pause for computation, or a correction step that allows models to self-verify or refine their outputs before finalizing them. The dramatic gains for lower-performing models indicate that this mechanism may help overcome specific failure modes or instabilities in their base architecture or training. The chart effectively argues for the "Appended Wait" method as a highly reliable and impactful enhancement for a wide variety of model architectures and sizes. </details> Figure 4: Macro average accuracy by non-reasoning model increases from original to appended “ Wait ” This evidence leads us to believe that “ Wait ” and similar correction markers serve as a strong conditioning token that shift the model’s probability distribution toward self-evaluation sequences - it artificially triggers the correction pathway that external errors naturally activate. We validate multiple markers to demonstrate generalization that they can activate self-correction across models and datasets (see Table 8). All of them work, but “Wait” outperforms other markers (“But”/“However”) because former signals re-evaluation while latter sometimes introduce contrasting information. Post intervention, LLMs have a higher tendency to generate these markers subsequently, and correspondingly the mean accuracy also increases. We observe strong correlations between the binary term frequency of correction marker and the change in accuracy in GSM8K-SC and PRM800K-SC across models in Figure 8. ### 6.2 Reasoning models Reasoning models exhibit a small, even negative, Self-Correction Blind Spot in Figure 9, unlike non-reasoning models. The mean accuracy is reported in Figure 10. Interestingly, appending “ Wait ” to base model without finetuning can almost match the performance of finetuned/ RL trained model in some models (see Table 3). This helps us understand one of the gaps between non-reasoning models and reasoning models - reasoning models are much better at self-correcting their own error (higher $P_{M}(r_{correct}|r_{m},e)$ ) than non-reasoning models, leading to better performance ( $P_{M}(r_{correct})$ ) in reasoning tasks requiring trial and error. However, correction markers can narrow the gap. Correction markers are exactly what reasoning models start with when given an internal error before arriving at correct response (see Table 4). Table 3: Macro average of mean accuracy of base model vs appending “ Wait ” vs reasoning model | Base Model | Reasoning Model | Base Model | Appending “ Wait ” | Reasoning Model | | --- | --- | --- | --- | --- | | DeepSeek-V3-0324 | DeepSeek-R1-0528 | 0.578 | 0.918 | 0.908 | | phi-4 | phi-4-reasoning-plus | 0.325 | 0.704 | 0.707 | | Qwen3-14B footnotemark: | Qwen3-14B footnotemark: | 0.121 | 0.884 | 0.843 | | Qwen3-32B footnotemark: | Qwen3-32B footnotemark: | 0.046 | 0.791 | 0.894 | | Qwen3-30B-A3B footnotemark: | Qwen3-30B-A3B footnotemark: | 0.102 | 0.869 | 0.845 | | Qwen3-235B-A22B footnotemark: | Qwen3-235B-A22B footnotemark: | 0.335 | 0.865 | 0.876 | - Non-thinking mode - Thinking mode Table 4: Most common first word and relative frequency generated by reasoning models | Model | SCLI5 | GSM8K-SC | PRM800K-SC | | --- | --- | --- | --- | | QwQ-32B | (‘Wait,’, 0.377) | (‘Wait,’, 0.725) | (‘Wait,’, 0.768) | | Qwen3-14B (thinking) | (‘In’, 1.0) | (‘Wait,’, 0.38) | (‘Therefore,’, 0.219) | | Qwen3-32B (thinking) | (‘After’, 1.0) | (‘The’, 0.288) | (‘I’, 0.189) | | Qwen3-30B-A3B (thinking) | (‘Wait,’, 0.312) | (‘Therefore,’, 0.25) | (‘So’, 0.195) | | Qwen3-235B-A22B (thinking) | (‘**Step-by-step’, 0.292) | (‘Wait,’, 0.198) | (‘Therefore,’, 0.256) | | DeepSeek-R1-0528 | (‘No,’, 0.324) | (‘But’, 0.267) | (‘But’, 0.486) | | gemma-3-12b-it | (‘The’, 0.284) | (‘The’, 0.239) | (‘Alternatively,’, 0.205) | | gemma-3-27b-it | (‘Here’s’, 0.31) | (‘Let’, 0.256) | (‘However,’, 0.292) | | phi-4-reasoning-plus | (‘Wait,’, 0.861) | (‘Wait,’, 0.677) | (‘However,’, 0.217) | It is also worth noting that although Qwen3 models fuse thinking mode and non-thinking mode by continual finetuning via a united chat template after GRPO (Shao et al., 2024), non-thinking mode still suffers from blind spot, unlike in thinking mode, as the chat template conditions the model into different distributions. ### 6.3 Correction markers in post-training data These differences in reasoning models’ behavior prompted us to investigate the root cause in post-training data composition. If correction markers could narrow the gap, and if we can make non-reasoning models to predict correction markers upon seeing internal error, we can induce self-correction capability in non-reasoning model, and that capability is already in the model when it evaluates against external error. Motivated by this logic, we further investigate correction marker density of open source supervised finetuning datasets (Table 5). Data analysis reveals the statistical foundation of this phenomenon. The 95% percentile correction markers frequency of non-reasoning datasets (e.g., OpenAssistant We use the highest-human-rated paths of conversation tree provided in ‘timdettmers/openassistant-guanaco’., OpenHermes2.5,etc ) is 1. In contrast, reasoning datasets, generated by reasoning models, (e.g., Mixture-of-Thoughts, OpenThoughts3) have median marker densities 30-170, with 99% of data containing at least 1 marker. With such a systematic absence or presence of correction markers in training data, it follows from basic statistical modeling principles that models will predict correction markers as next tokens proportional to their frequency in training data - Razeghi et al. (2022) and Merullo et al. (2025) have shown that LLMs perform better when related term frequency in pretraining data is higher. This statistical likelihood directly determines self-correction behavior: models trained on less correction data rarely generate correction markers, perpetuating the blind spot. This single powerful insight unifies all of our empirical observations. Table 5: Descriptive statistics of correction markers in post training dataset | Dataset | 1% | 5% | 10% | 25% | 50% | 75% | 90% | 95% | 99% | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | OpenAssistant (Köpf et al., 2023) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | | OpenHermes2.5 (Teknium, 2023) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | | Infinity-Instruct-7M (Li et al., 2025) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | | UltraFeedback (Cui et al., 2024) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | | Tulu3-sft-olmo-2-mixture (Lambert et al., 2025) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | | s1K-1.1 (Muennighoff et al., 2025) | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 5 | 9 | | Mixture-of-Thoughts (Face, 2025) | 1 | 3 | 5 | 10 | 30 | 76 | 147 | 202 | 273 | | OpenThoughts3-1.2M (Guha et al., 2025) | 14 | 66 | 96 | 132 | 170 | 213 | 253 | 278 | 326 | ## 7 Discussion Benefit of error and self-correction data. LLMs are known to exhibit cognitive bias (Koo et al., 2024; Echterhoff et al., 2024; Jones and Steinhardt, 2022). Self-Correction Blind Spot bears resemblance to bias blind spot of human, where capability of self-correction is relatively limited. We hypothesize two root causes: First, supervised fine-tuning and reinforcement learning from human rely on human demonstrations/preferences (Ouyang et al., 2022), which strongly favor polished, error-free responses over those with errors and self-correction. Second, even synthetic instruction data (Teknium, 2023; Li et al., 2025) or AI feedback (Cui et al., 2024) reward models ultimately learn from human preferences, inheriting this artifact. Traditional machine learning emphasizes alignment of training data with the production environment, but human-dominated data lack exposure to the “error-and-correct” process. Outcome-based RL like GRPO (Shao et al., 2024) addresses this by encouraging diverse reasoning paths, including error and self-correction, while given ground-truth feedback, as shown in the high correction markers density in RL trained models’ generation in Section 6.3. This complements error-free human demonstration and preference, making models more robust to errors (consistent with work on learning from mistakes (An et al., 2024) and critique finetuning (Wang et al., 2025)) and better at backtracking. An error-free response is not the only path leading to a correct final output - error and self-correction provides an equally important training signal as error-free demonstration. Understanding cognitive behavior via markers. Frequency analysis of correction markers is a scalable way to study cognitive behaviors present in pretraining data and post-training data. We believe that they can serve as important heuristics for pretraining and post-training data curation. ## 8 Conclusion and limitation In this work, we identified and systematically measured the Self-Correction Blind Spot: non-reasoning LLMs fail to correct 64.5% of errors in their own outputs while successfully correcting identical external errors. This systematic failure has important implications for AI reliability and safety-critical applications. Our controlled error injection methodology, while not perfectly capturing natural error, demonstrates the generality of this phenomenon across error types - from artificial (SCLI5) to realistic (PRM800K-SC) scenario. This approach isolates self-correction capabilities from confounding factors and ensures cross-model comparability. We encourage future research to expand the benchmark to programming, logic and common sense reasoning, and to multilingual and multimodal reasoning. ## Reproducibility statement Our experiments utilize various open source models, close source models, and datasets. Self-Correction Bench is available in Hugging Face (click here). Our codes for constructing datasets, running the experiment, and building tables and graphs are released in Github (click here). ## Acknowledgement We thank the open source community for making this research possible through shared datasets, models, and libraries. We are particularly grateful to the teams behind the datasets used in our evaluation: GSM8K, PRM800K, and the various instruction tuning datasets we analyzed. We acknowledge the model developers who have made their work publicly available, including the teams at DeepSeek, Google (Gemma), Meta (Llama), Microsoft (Phi), Mistral and Qwen. We also thank the developers of the computational infrastructure and libraries that enabled our experiments, including the transformers and datasets library (Hugging Face), DeepInfra API, Google API and OpenAI API. ## References - OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774. - Anthropic (2024) Anthropic. The claude 3 model family: Opus, sonnet, haiku, Mar 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. - Gemini Team (2025) Google Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities., June 2025. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf. - Yang et al. (2025) An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. - Meta (2025) Meta. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, Apr 2025. URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/. - DeepSeek-AI et al. (2025a) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025a. URL https://arxiv.org/abs/2501.12948. - Maynez et al. (2020) Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173/. - Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, January 2025. ISSN 1558-2868. doi: 10.1145/3703155. URL http://dx.doi.org/10.1145/3703155. - Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.45. URL https://aclanthology.org/2023.ijcnlp-main.45/. - Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023. - Nezhurina et al. (2025) Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models, 2025. URL https://arxiv.org/abs/2406.02061. - Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vAElhFcKW6. - Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=S37hOerQLB. - Kim et al. (2023) Geunwoo Kim, Pierre Baldi, and Stephen Marcus McAleer. Language models can solve computer tasks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=M6OmjAZ4CX. - Kamoi et al. (2024a) Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can LLMs actually correct their own mistakes? a critical survey of self-correction of LLMs. Transactions of the Association for Computational Linguistics, 12:1417–1440, 2024a. doi: 10.1162/tacl˙a˙00713. URL https://aclanthology.org/2024.tacl-1.78/. - Huang et al. (2024) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet, 2024. URL https://arxiv.org/abs/2310.01798. - Tyen et al. (2024) Gladys Tyen, Hassan Mansoor, Victor Carbune, Peter Chen, and Tony Mak. LLMs cannot find reasoning errors, but can correct them given the error location. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 13894–13908, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.826. URL https://aclanthology.org/2024.findings-acl.826/. - Kamoi et al. (2024b) Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Haoran Ranran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, and Rui Zhang. Evaluating LLMs at detecting errors in LLM responses. In First Conference on Language Modeling, 2024b. URL https://openreview.net/forum?id=dnwRScljXr. - Kumar et al. (2025) Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=CjwERcAU7w. - Wei et al. (2023) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: how does llm safety training fail? In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc. - Liu et al. (2024) Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In Proceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, USA, 2024. USENIX Association. ISBN 978-1-939133-44-1. - Lanham et al. (2023) Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, and Ethan Perez. Measuring faithfulness in chain-of-thought reasoning, 2023. URL https://arxiv.org/abs/2307.13702. - Zhang et al. (2024) Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A. Smith. How language model hallucinations can snowball. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 59670–59684. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/zhang24ay.html. - Snell et al. (2025) Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=4FWAwZtd2n. - Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393. - Koo et al. (2024) Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 517–545, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.29. URL https://aclanthology.org/2024.findings-acl.29/. - Echterhoff et al. (2024) Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with LLMs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 12640–12653, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.739. URL https://aclanthology.org/2024.findings-emnlp.739/. - Jones and Steinhardt (2022) Erik Jones and Jacob Steinhardt. Capturing failures of large language models via human cognitive biases. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fcO9Cgn-X-R. - Pronin et al. (2002) Emily Pronin, Daniel Y. Lin, and Lee Ross. The bias blind spot: Perceptions of bias in self versus others. Personality and Social Psychology Bulletin, 28(3):369–381, 2002. doi: 10.1177/0146167202286008. URL https://doi.org/10.1177/0146167202286008. - Zheng et al. (2025) Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning, 2025. URL https://arxiv.org/abs/2412.06559. - Song et al. (2025) Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. Prmbench: A fine-grained and challenging benchmark for process-level reward models, 2025. URL https://arxiv.org/abs/2501.03124. - Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168. - OpenAI (2025) OpenAI. Introducing gpt-4.1 in the api, Apr 2025. URL https://openai.com/index/gpt-4-1/. - Lightman et al. (2024) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi. - Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe. - Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6/. - DeepSeek-AI et al. (2025b) DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jiawei Wang, Jin Chen, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Litong Wang, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runxin Xu, Ruoyu Zhang, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Shuting Pan, T. Wang, Tao Yun, Tian Pei, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wanjia Zhao, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaokang Zhang, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xinnan Song, Xinxia Shan, Xinyi Zhou, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Yang Zhang, Yanhong Xu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Yu, Yi Zheng, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Ying Tang, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yu Wu, Yuan Ou, Yuchen Zhu, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yukun Zha, Yunfan Xiong, Yunxian Ma, Yuting Yan, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhipeng Xu, Zhiyu Wu, Zhongyu Zhang, Zhuoshu Li, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Ziyi Gao, and Zizheng Pan. Deepseek-v3 technical report, 2025b. URL https://arxiv.org/abs/2412.19437. - Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. - Meta (2024) Meta. Llama 3.3, Dec 2024. URL https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/. - Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, and Yi Zhang. Phi-4 technical report, 2024. URL https://arxiv.org/abs/2412.08905. - Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671. - Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. - Team (2025) Mistral AI Team. Mistral small 3, Jan 2025. URL https://mistral.ai/news/mistral-small-3. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300. - Razeghi et al. (2022) Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.59. URL https://aclanthology.org/2022.findings-emnlp.59/. - Merullo et al. (2025) Jack Merullo, Noah A. Smith, Sarah Wiegreffe, and Yanai Elazar. On linear representations and pretraining data frequency in language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=EDoD3DgivF. - Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Arnav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=VSJotgbPHF. - Teknium (2023) Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5. - Li et al. (2025) Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025. URL https://arxiv.org/abs/2506.11116. - Cui et al. (2024) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2024. URL https://openreview.net/forum?id=pNkOx3IVWI. - Lambert et al. (2025) Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=i1uGbfHHpH. - Face (2025) Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1. - Guha et al. (2025) Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, and Ludwig Schmidt. Openthoughts: Data recipes for reasoning models, 2025. URL https://arxiv.org/abs/2506.04178. - Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON. - An et al. (2024) Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner, 2024. URL https://arxiv.org/abs/2310.20689. - Wang et al. (2025) Yubo Wang, Xiang Yue, and Wenhu Chen. Critique fine-tuning: Learning to critique is more effective than learning to imitate, 2025. URL https://arxiv.org/abs/2501.17703. - Renze (2024) Matthew Renze. The effect of sampling temperature on problem solving in large language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.432. URL https://aclanthology.org/2024.findings-emnlp.432/. ## Appendix ## Appendix A Dataset construction Table 6: Task composition of SCLI5 | Task | Count | Error Type | Question and Answer | | --- | --- | --- | --- | | Add one | 20 | Off-by-one | Q: What is the answer of 1 + 1? A: The answer is 3. | | Subtract one | 20 | Off-by-one | Q: What is the answer of 3 - 1? A: The answer is 1. | | Next character | 52 | Off-by-one | Q: What letter comes after A? A: The answer is C. | | Previous character | 52 | Off-by-one | Q: What letter comes before C? A: The answer is A. | | Larger number | 71 | Flip | Q: Which one is smaller, 1 or 2? A: The answer is 2. | | Smaller number | 71 | Flip | Q: Which one is larger, 2 or 5? A: The answer is 2. | Table 7: Error composition of GSM8K-SC | Category | Description | | --- | --- | | Problem Representation Errors | These errors arise when the solver misunderstands or misinterprets the problem’s requirements or given information. This can involve misreading the problem statement, confusing the relationships between quantities, or failing to grasp what is being asked. | | Planning Errors | These occur when the solver devises an incorrect or incomplete strategy to tackle the problem. This might include choosing the wrong operations, setting up flawed equations, or overlooking key components of the problem. | | Execution Errors | These are mistakes made while carrying out the planned steps, such as errors in calculations, misapplication of mathematical rules, or procedural slip-ups, even if the plan itself is sound. | ## Appendix B Figures <details> <summary>images/error_injection_model_correlation_matrix_non_reasoning.png Details</summary> ![44526bbe](/v1/image/44526bbe19fc3dafc3de44a6eb926d4cefed8596557b27b2e9d2bc1d078de9ac) ### Visual Description \n ## Correlation Matrix and Scatter Plots: Model Performance Analysis ### Overview The image contains three distinct charts arranged horizontally. From left to right: a correlation matrix heatmap, a scatter plot comparing SCLI5 vs GSM8K-SC performance, and a scatter plot comparing GSM8K-SC vs PRM800K-SC performance. The overall theme is analyzing the correlation between mean accuracy scores of various language models across three different evaluation datasets. ### Components/Axes **Chart 1 (Left): Correlation Matrix** * **Title:** "Correlation matrix of mean accuracy across datasets" * **Axes Labels (Y-axis, top to bottom):** `scli5`, `gsm8k_sc`, `prm800k_sc` * **Axes Labels (X-axis, left to right):** `scli5`, `gsm8k_sc`, `prm800k_sc` * **Color Bar Legend (Right side):** A vertical gradient bar ranging from blue (-1.00) to red (1.00), with tick marks at -1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00. **Chart 2 (Middle): Scatter Plot** * **Title:** "SCLI5 vs GSM8K-SC (r = 0.724)" * **X-axis:** "SCLI5 macro average" (Scale: 0.0 to 1.0) * **Y-axis:** "GSM8K-SC macro average" (Scale: 0.0 to 1.0) * **Legend (Top-left):** Contains two entries: "Fitted line" (red dashed line) and "Ideal line" (gray dotted line). * **Data Points:** Blue circles, each labeled with a model name. **Chart 3 (Right): Scatter Plot** * **Title:** "GSM8K-SC vs PRM800K-SC (r = 0.559)" * **X-axis:** "GSM8K-SC macro average" (Scale: 0.0 to 0.6) * **Y-axis:** "PRM800K-SC macro average" (Scale: 0.0 to 0.6) * **Legend (Top-left):** Contains two entries: "Fitted line" (red dashed line) and "Ideal line" (gray dotted line). * **Data Points:** Green circles, each labeled with a model name. ### Detailed Analysis **Chart 1: Correlation Matrix** The heatmap displays Pearson correlation coefficients between the mean accuracy scores on three datasets. * **Diagonal (Self-correlation):** All values are `1` (dark red), as expected. * **Off-diagonal Values:** * `scli5` vs `gsm8k_sc`: **0.72** (medium orange-red) * `scli5` vs `prm800k_sc`: **0.49** (light orange) * `gsm8k_sc` vs `prm800k_sc`: **0.56** (medium orange) * **Interpretation:** The strongest correlation (0.72) is between SCLI5 and GSM8K-SC. The weakest correlation (0.49) is between SCLI5 and PRM800K-SC. **Chart 2: SCLI5 vs GSM8K-SC Scatter Plot** * **Trend:** The data points show a clear positive linear trend. The red "Fitted line" slopes upward from left to right, confirming the positive correlation (r=0.724). Most points lie below the gray "Ideal line" (y=x), indicating that models generally score higher on SCLI5 than on GSM8K-SC. * **Data Points (Approximate Coordinates - X:SCLI5, Y:GSM8K-SC):** * `Qwen2.5-72B-Instruct`: (~0.95, ~0.58) - Highest on both axes. * `Llama-4-Maverick-...`: (~0.90, ~0.40) * `DeepSeek-V3-0324`: (~0.85, ~0.40) * `Llama-3.3-70B-Ins...`: (~0.60, ~0.28) * `Qwen2.5-7B-Instruct`: (~0.55, ~0.19) * `Llama-4-Scout-17B...`: (~0.95, ~0.24) - Notable outlier, high SCLI5 but lower GSM8K-SC. * `Qwen2-7B-Instruct`: (~0.60, ~0.08) * `Qwen3-14B`: (~0.05, ~0.09) * `Qwen3-30B-A3B`: (~0.15, ~0.05) * `Qwen3-32B`: (~0.05, ~0.05) * `Mistral-Small-24B...`: (~0.05, ~0.01) **Chart 3: GSM8K-SC vs PRM800K-SC Scatter Plot** * **Trend:** The data points show a moderate positive linear trend. The red "Fitted line" slopes upward, confirming the correlation (r=0.559). The spread of points around the fitted line is wider than in the middle chart, indicating a noisier relationship. Most points are below the "Ideal line." * **Data Points (Approximate Coordinates - X:GSM8K-SC, Y:PRM800K-SC):** * `DeepSeek-V3-0324`: (~0.40, ~0.48) - Highest on both axes. * `Llama-4-Maverick-...`: (~0.40, ~0.46) * `Qwen3-235B-A22B`: (~0.08, ~0.35) - Notable outlier, very low GSM8K-SC but high PRM800K-SC. * `Qwen3-14B`: (~0.10, ~0.26) * `Llama-4-Scout-17B...`: (~0.25, ~0.26) * `Llama-3.3-70B-Ins...`: (~0.28, ~0.25) * `Qwen3-30B-A3B`: (~0.05, ~0.19) * `Qwen2.5-7B-Instruct`: (~0.19, ~0.14) * `Qwen2.5-72B-Instruct`: (~0.58, ~0.15) - Notable outlier, highest GSM8K-SC but relatively low PRM800K-SC. * `Qwen3-32B`: (~0.08, ~0.08) * `Qwen2-7B-Instruct`: (~0.10, ~0.06) * `Mistral-Small-24B...`: (~0.02, ~0.02) ### Key Observations 1. **Strongest Link:** Performance on SCLI5 and GSM8K-SC is most strongly correlated (r=0.724). 2. **General Underperformance:** In both scatter plots, the majority of models fall below the "Ideal line" (y=x), suggesting they achieve lower macro-average scores on the second dataset (GSM8K-SC or PRM800K-SC) compared to the first (SCLI5 or GSM8K-SC). 3. **Significant Outliers:** * `Llama-4-Scout-17B...` in the middle chart: High SCLI5 score but disproportionately lower GSM8K-SC score. * `Qwen3-235B-A22B` in the right chart: Very low GSM8K-SC score but a high PRM800K-SC score. * `Qwen2.5-72B-Instruct` in the right chart: The highest GSM8K-SC score but a relatively low PRM800K-SC score, breaking the general trend. 4. **Model Clustering:** Lower-performing models (e.g., `Mistral-Small-24B...`, `Qwen3-32B`) cluster near the origin (0,0) in both scatter plots. ### Interpretation The data suggests that the evaluation datasets (SCLI5, GSM8K-SC, PRM800K-SC) measure related but distinct capabilities of language models. The strong correlation between SCLI5 and GSM8K-SC indicates these two benchmarks may be testing similar underlying skills (potentially related to mathematical or logical reasoning, given the "GSM" in the name). The weaker correlation with PRM800K-SC implies it assesses a different dimension of model performance. The consistent pattern of models scoring lower on the second dataset in each pair could indicate that GSM8K-SC and PRM800K-SC are more difficult than SCLI5 and GSM8K-SC, respectively, for this set of models. The notable outliers are crucial: they represent models with specialized strengths or weaknesses. For example, `Qwen3-235B-A22B`'s performance profile suggests it may be uniquely optimized for the tasks in PRM800K-SC while lacking in GSM8K-SC skills. Conversely, `Qwen2.5-72B-Instruct` excels at GSM8K-SC but does not transfer that advantage to PRM800K-SC to the same degree as other top models. This analysis highlights that model evaluation is multi-faceted, and a single aggregate score can mask significant performance variations across different task types. </details> Figure 5: left: Mean accuracy correlation matrix across datasets middle: Scatter plot between SCLI5 vs GSM8K-SC right: Scatter plot between GSM8K-SC vs PRM800K-SC BCA: Before commit an answer <details> <summary>images/error_in_error_injection_model_macro_averages_non_reasoning.png Details</summary> ![9f59a356](/v1/image/9f59a35670b6906be46332846a0f9a01c72c4e9f825215e5b34a94d669903df7) ### Visual Description ## [Bar Chart]: Error and Non-Response by Dataset and Model ### Overview The chart visualizes **error rates** (y-axis, labeled "Error") and **non-response rates** (black cross-hatching) for 14 models (x-axis) across three datasets: *SCLI5* (light blue), *GSM8K-SC* (peach), and *PRM800K-SC* (light green). The y-axis ranges from 0.0 to 1.0, representing the combined rate of "error" (response with error) and "non-response" (no response). ### Components/Axes - **Title**: "Error and non-response by dataset and model" - **Y-axis**: "Error" (scale: 0.0–1.0, increments of 0.2) - **X-axis**: "Models" (14 models, left to right): 1. Llama-4-Maverick-17B-128E-Instruct-FP8 2. DeepSeek-V3-0324 3. Qwen2.5-72B-Instruct 4. Llama-4-Scout-17B-16E-Instruct-FP8-dynamic 5. Llama-3.3-70B-Instruct 6. Qwen3-235B-A22B 7. Phi-4 8. Qwen2.5-7B-Instruct 9. Qwen2-7B-Instruct 10. Qwen3-14B 11. Qwen3-30B-A3B 12. Llama-3.1-8B-Instruct 13. Qwen3-32B 14. Mistral-Small-24B-Instruct-2501 - **Legend**: - *SCLI5*: Light blue bars (error: response with error) - *GSM8K-SC*: Peach bars (error: response with error) - *PRM800K-SC*: Light green bars (error: response with error) - *Non-Response*: Black cross-hatching (no response, overlaid on bars) ### Detailed Analysis (Model-by-Model, Dataset-by-Dataset) For each model, we analyze three datasets (SCLI5, GSM8K-SC, PRM800K-SC) with two components: **solid color** (error: response with error) and **cross-hatching** (non-response: no response). The total height of each bar is the sum of error and non-response. | Model | SCLI5 (Blue) | GSM8K-SC (Peach) | PRM800K-SC (Green) | |--------------------------------|--------------------|--------------------|--------------------| | **Llama-4-Maverick-17B-128E-Instruct-FP8** | Error: ~0.05; Non-Response: ~0.00 | Error: ~0.58; Non-Response: ~0.00 | Error: ~0.54; Non-Response: ~0.00 | | **DeepSeek-V3-0324** | Error: ~0.17; Non-Response: ~0.00 | Error: ~0.60; Non-Response: ~0.00 | Error: ~0.52; Non-Response: ~0.00 | | **Qwen2.5-72B-Instruct** | Error: ~0.08; Non-Response: ~0.00 | Error: ~0.41; Non-Response: ~0.00 | Error: ~0.47; Non-Response: ~0.00 | | **Llama-4-Scout-17B-16E-Instruct-FP8-dynamic** | Error: ~0.02; Non-Response: ~0.00 | Error: ~0.76; Non-Response: ~0.00 | Error: ~0.73; Non-Response: ~0.00 | | **Llama-3.3-70B-Instruct** | Error: ~0.05; Non-Response: ~0.00 | Error: ~0.72; Non-Response: ~0.00 | Error: ~0.75; Non-Response: ~0.00 | | **Qwen3-235B-A22B** | Error: ~0.30; Non-Response: ~0.13 | Error: ~0.35; Non-Response: ~0.57 | Error: ~0.53; Non-Response: ~0.12 | | **Phi-4** | Error: ~0.19; Non-Response: ~0.01 | Error: ~0.23; Non-Response: ~0.69 | Error: ~0.68; Non-Response: ~0.23 | | **Qwen2.5-7B-Instruct** | Error: ~0.31; Non-Response: ~0.13 | Error: ~0.60; Non-Response: ~0.21 | Error: ~0.68; Non-Response: ~0.18 | | **Qwen2-7B-Instruct** | Error: ~0.27; Non-Response: ~0.13 | Error: ~0.72; Non-Response: ~0.20 | Error: ~0.40; Non-Response: ~0.54 | | **Qwen3-14B** | Error: ~0.37; Non-Response: ~0.63 | Error: ~0.61; Non-Response: ~0.30 | Error: ~0.63; Non-Response: ~0.12 | | **Qwen3-30B-A3B** | Error: ~0.37; Non-Response: ~0.57 | Error: ~0.68; Non-Response: ~0.26 | Error: ~0.67; Non-Response: ~0.13 | | **Llama-3.1-8B-Instruct** | Error: ~0.12; Non-Response: ~0.74 | Error: ~0.07; Non-Response: ~0.91 | Error: ~0.18; Non-Response: ~0.80 | | **Qwen3-32B** | Error: ~0.00; Non-Response: ~1.00 | Error: ~0.48; Non-Response: ~0.47 | Error: ~0.55; Non-Response: ~0.37 | | **Mistral-Small-24B-Instruct-2501** | Error: ~0.00; Non-Response: ~0.96 | Error: ~0.05; Non-Response: ~0.94 | Error: ~0.20; Non-Response: ~0.79 | ### Key Observations 1. **Non-Response Dominance in Later Models**: - Earlier models (left) have minimal non-response (cross-hatching), so error rates (solid color) are visible. - Later models (right) have **high non-response** (cross-hatching dominates), making total "error" (error + non-response) approach 1.0. 2. **Dataset-Specific Trends**: - *SCLI5* (blue): Low error in early models, but non-response dominates in later models (e.g., Qwen3-14B, Qwen3-32B). - *GSM8K-SC* (peach): Moderate error in early models, with non-response increasing sharply in later models (e.g., Llama-3.1-8B-Instruct, Mistral-Small-24B-Instruct-2501). - *PRM800K-SC* (green): Moderate error in early models, with non-response increasing in later models (less than GSM8K-SC). 3. **Model Generation Shift**: - Early models (e.g., Llama-4-Maverick, DeepSeek-V3) prioritize responding (low non-response) but have moderate error. - Later models (e.g., Qwen3-14B, Mistral-Small-24B-Instruct-2501) prioritize non-response (abstaining) over incorrect responses, leading to high total "error." ### Interpretation The chart reveals a **trade-off between error (response with error) and non-response (no response)** across models and datasets: - **Early Models**: Focus on responding (low non-response) but struggle with accuracy (moderate error), especially on *GSM8K-SC* and *PRM800K-SC* (reasoning-heavy datasets). - **Later Models**: Shift toward non-response (abstaining) to avoid incorrect answers, leading to high total "error" (since non-response is counted as a failure). This suggests a design choice (e.g., models trained to avoid wrong answers) or a limitation (e.g., models struggle with the dataset and fail to generate a response). - **Dataset Complexity**: *GSM8K-SC* (math reasoning) and *PRM800K-SC* (reasoning) are more challenging, driving both error and non-response. *SCLI5* (possibly simpler) sees low error in early models but high non-response in later ones, indicating a shift in model behavior. This analysis highlights how model performance (error + non-response) evolves with generation and dataset, with non-response becoming a critical factor in later models. </details> Figure 6: Summary of error and empty response across models <details> <summary>images/blind_spot_summary_wait_non_reasoning.png Details</summary> ![6ee8ac21](/v1/image/6ee8ac21f0e5e89cc988e17a8f71d2bed9eb2d0c6b7a8c0f6b2b9abef8ddfae4) ### Visual Description ## Bar Chart: Blind Spot Summary Across Datasets (Appending "Wait") - 95% Confidence Intervals ### Overview The chart displays the **Self-Correction Blind Spot** (y-axis) for 14 language models (x-axis) across 5 datasets (legend). Each model has 5 bars (one per dataset) with error bars representing 95% confidence intervals. The "blind spot" likely measures how much a model fails to correct errors (or performance differences before/after committing an answer, with "Wait" appended). ### Components/Axes - **Title**: *"Blind Spot summary across datasets (Appending 'Wait') - 95% Confidence Intervals"* - **Y-axis**: *"Self-Correction Blind Spot"* (range: -1.5 to 0.5; grid lines at -1.5, -1.0, -0.5, 0.0, 0.5). - **X-axis**: *"Models"* (14 models: Llama-4-Maverick-17B-128E-Instruct-FP8, DeepSeek-V3-0324, Qwen2.5-72B-Instruct, Llama-4-Scout-17B-16E-Instruct-FP8-dynamic, Llama-3.3-70B-Instruct, Qwen3-235B-A22B, Phi-4, Qwen2.5-7B-Instruct, Qwen2-7B-Instruct, Qwen3-14B, Qwen3-30B-A3B, Llama-3.1-8B-Instruct, Qwen3-32B, Mistral-Small-24B-Instruct-2501). - **Legend** (top-left, 5 categories): - Blue: *SCLIS (Wait)* - Orange: *GSM8K-SC (Before commit answer, Wait)* - Light orange: *GSM8K-SC (After commit answer, Wait)* - Green: *PRM800K-SC (Before commit answer, Wait)* - Light green: *PRM800K-SC (After commit answer, Wait)* ### Detailed Analysis (Model-by-Model, Dataset-by-Dataset) Below are approximate bar heights (y-axis values) and error bar ranges (95% CI) for each model: | Model | SCLIS (Wait) | GSM8K-SC (Before) | GSM8K-SC (After) | PRM800K-SC (Before) | PRM800K-SC (After) | |-------|--------------|-------------------|------------------|---------------------|--------------------| | Llama-4-Maverick-17B-128E-Instruct-FP8 | ~0.0 (±0.05) | ~0.0 (±0.05) | ~0.0 (±0.05) | ~0.05 (±0.1) | ~-0.1 (±0.15) | | DeepSeek-V3-0324 | ~0.0 (±0.05) | ~0.0 (±0.05) | ~0.0 (±0.05) | ~0.0 (±0.1) | ~-0.1 (±0.15) | | Qwen2.5-72B-Instruct | ~0.0 (±0.05) | ~0.05 (±0.1) | ~0.05 (±0.1) | ~0.3 (±0.1) | ~0.2 (±0.15) | | Llama-4-Scout-17B-16E-Instruct-FP8-dynamic | ~0.0 (±0.05) | ~0.15 (±0.1) | ~0.1 (±0.1) | ~0.05 (±0.1) | ~-0.2 (±0.2) | | Llama-3.3-70B-Instruct | ~0.0 (±0.05) | ~0.15 (±0.1) | ~0.1 (±0.1) | ~-0.3 (±0.2) | ~-1.1 (±0.4) | | Qwen3-235B-A22B | ~0.0 (±0.05) | ~0.05 (±0.1) | ~0.05 (±0.1) | ~0.1 (±0.1) | ~0.1 (±0.1) | | Phi-4 | ~0.0 (±0.05) | ~0.15 (±0.1) | ~0.15 (±0.1) | ~0.55 (±0.1) | ~0.45 (±0.1) | | Qwen2.5-7B-Instruct | ~-0.1 (±0.1) | ~0.1 (±0.1) | ~0.05 (±0.1) | ~0.25 (±0.1) | ~0.05 (±0.15) | | Qwen2-7B-Instruct | ~-0.2 (±0.1) | ~0.35 (±0.1) | ~0.05 (±0.1) | ~0.5 (±0.1) | ~0.25 (±0.15) | | Qwen3-14B | ~-0.2 (±0.1) | ~-0.05 (±0.1) | ~-0.1 (±0.1) | ~0.0 (±0.1) | ~0.05 (±0.15) | | Qwen3-30B-A3B | ~-0.1 (±0.1) | ~0.0 (±0.1) | ~-0.05 (±0.1) | ~0.1 (±0.1) | ~0.15 (±0.15) | | Llama-3.1-8B-Instruct | ~-0.15 (±0.1) | ~0.3 (±0.1) | ~0.55 (±0.1) | ~-0.1 (±0.1) | ~0.15 (±0.15) | | Qwen3-32B | ~-0.05 (±0.1) | ~0.05 (±0.1) | ~0.0 (±0.1) | ~0.3 (±0.1) | ~0.25 (±0.15) | | Mistral-Small-24B-Instruct-2501 | ~-0.05 (±0.1) | ~0.1 (±0.1) | ~0.2 (±0.1) | ~0.45 (±0.1) | ~0.4 (±0.15) | ### Key Observations - **Outlier**: *Llama-3.3-70B-Instruct* has a drastically low (negative) PRM800K-SC (After commit answer, Wait) bar (~-1.1) with a large error bar (±0.4), indicating high uncertainty. - **High Blind Spots**: Models like *Phi-4*, *Qwen2-7B-Instruct*, *Llama-3.1-8B-Instruct*, *Qwen3-32B*, and *Mistral-Small-24B-Instruct-2501* have tall PRM800K-SC (Before/After) bars, suggesting larger self-correction blind spots for these datasets. - **Low Blind Spots**: Most *SCLIS (Wait)* bars are near 0, with some negative (e.g., *Qwen2.5-7B-Instruct*, *Qwen2-7B-Instruct*), indicating smaller blind spots for this dataset. - **Dataset Trends**: *PRM800K-SC (Before/After)* generally has higher blind spots than *GSM8K-SC (Before/After)* and *SCLIS (Wait)*, suggesting this dataset is more challenging for self-correction. ### Interpretation The chart quantifies how well models self-correct errors across datasets. "Blind spot" likely measures the difference in performance before/after committing an answer (with "Wait" appended). - **Dataset Impact**: *PRM800K-SC* (Before/After) consistently shows higher blind spots, implying this dataset is more difficult for self-correction. - **Model Performance**: Larger models (e.g., *Qwen3-235B-A22B*, *Phi-4*) or specific architectures (e.g., *Llama-3.3-70B-Instruct*) have varying blind spots, with *Llama-3.3-70B-Instruct* as an outlier in *PRM800K-SC (After)*. - **Uncertainty**: Wide error bars (e.g., *Llama-3.3-70B-Instruct*) indicate less reliable estimates, while narrow bars (e.g., *Qwen3-235B-A22B*) suggest more consistent results. This data helps identify models/datasets with better (lower blind spot) or worse (higher blind spot) self-correction, guiding model selection or improvement efforts. </details> Figure 7: Self-Correction Blind Spot and 95% confidence interval across non-reasoning models after appending “ Wait ” Table 8: Mean accuracy and relative change after appending various correction markers | Correction Markers | SCLI5 | GSM8K-SC | PRM800K-SC | | --- | --- | --- | --- | | Internal Error (Baseline) | 0.499 (0%) | 0.183 (0%) | 0.200 (0%) | | External Error | 0.910 (+82.5%) | 0.881 (+382.1%) | 0.620 (+210.3%) | | “ Wait ” | 0.957 (+91.9%) | 0.796 (+335.1%) | 0.504 (+152.0%) | | “ But ” | 0.922 (+85.0%) | 0.611 (+234.2%) | 0.430 (+114.8%) | | “ However ” | 0.897 (+79.8%) | 0.602 (+229.0%) | 0.438 (+119.3%) | <details> <summary>images/correlation_plots_marker_presence_vs_accuracy_after_wait.png Details</summary> ![d1acdb3a](/v1/image/d1acdb3ab330cf495fa111a65d8bdeab5ae0aae23c68ed7499d30316c36e2c42) ### Visual Description ## Scatter Plot Series: Correlation Analysis of Correction Marker Presence vs. Accuracy Change ### Overview The image displays a series of three scatter plots arranged horizontally. The overall title is "Correlation: change in correction marker presence vs change in accuracy after appending Wait". Each individual plot represents a different dataset or experimental condition, as indicated by its subtitle. The plots visualize the relationship between the absolute change in the presence of a "correction marker" (x-axis) and the absolute change in accuracy (y-axis) following an intervention described as "appending Wait". A red dashed trend line and a calculated correlation coefficient are provided for each dataset. ### Components/Axes **Main Title:** "Correlation: change in correction marker presence vs change in accuracy after appending Wait" **Plot 1 (Left):** * **Subtitle:** "SCLI5" * **X-axis Label:** "Absolute Change in Correction Marker Presence" * **Scale:** Linear, ranging from approximately -0.4 to 0.2. Major tick marks at -0.4, -0.3, -0.2, -0.1, 0.0, 0.1, 0.2. * **Y-axis Label:** "Absolute Change in Accuracy" * **Scale:** Linear, ranging from 0.0 to 1.0. Major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0. * **Correlation Annotation:** A beige box in the top-left corner contains the text "Correlation: 0.493". * **Visual Elements:** * Blue circular data points (n ≈ 15). * A red dashed trend line with a positive slope. * A vertical gray reference line at x=0. * A horizontal gray reference line at y=0. * A light gray grid. **Plot 2 (Center):** * **Subtitle:** "GSM8K_SC" * **X-axis Label:** "Absolute Change in Correction Marker Presence" * **Scale:** Linear, ranging from approximately -0.050 to 0.150. Major tick marks at -0.050, -0.025, 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150. * **Y-axis Label:** "Absolute Change in Accuracy" * **Scale:** Linear, ranging from 0.0 to 0.8. Major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8. * **Correlation Annotation:** A beige box in the top-left corner contains the text "Correlation: 0.734". * **Visual Elements:** * Blue circular data points (n ≈ 15). * A red dashed trend line with a positive slope, steeper than in the SCLI5 plot. * A vertical gray reference line at x=0. * A horizontal gray reference line at y=0. * A light gray grid. **Plot 3 (Right):** * **Subtitle:** "PRM800K_SC" * **X-axis Label:** "Absolute Change in Correction Marker Presence" * **Scale:** Linear, ranging from 0.00 to 0.35. Major tick marks at 0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35. * **Y-axis Label:** "Absolute Change in Accuracy" * **Scale:** Linear, ranging from 0.0 to 0.5. Major tick marks at 0.0, 0.1, 0.2, 0.3, 0.4, 0.5. * **Correlation Annotation:** A beige box in the top-left corner contains the text "Correlation: 0.797". * **Visual Elements:** * Blue circular data points (n ≈ 15). * A red dashed trend line with a positive slope, the steepest of the three plots. * A vertical gray reference line at x=0. * A horizontal gray reference line at y=0. * A light gray grid. ### Detailed Analysis **SCLI5 Plot:** * **Trend:** The red dashed line shows a clear positive slope, indicating that as the absolute change in correction marker presence increases, the absolute change in accuracy also tends to increase. * **Data Distribution:** Data points are scattered widely around the trend line. Several points show a large positive change in accuracy (>0.8) with a small positive change in marker presence (~0.0 to 0.05). One notable point shows a negative change in marker presence (~-0.38) with a small positive change in accuracy (~0.18). The correlation coefficient of 0.493 suggests a moderate positive linear relationship. **GSM8K_SC Plot:** * **Trend:** The red dashed line has a steeper positive slope than the SCLI5 plot. * **Data Distribution:** Data points are more tightly clustered around the trend line compared to SCLI5. Most points fall within a narrower x-axis range (-0.05 to 0.15). The correlation coefficient of 0.734 indicates a strong positive linear relationship. **PRM800K_SC Plot:** * **Trend:** The red dashed line has the steepest positive slope of the three. * **Data Distribution:** All data points are located in the positive quadrant (x>0, y>0). The points show a clear upward trend with relatively less scatter than SCLI5. The correlation coefficient of 0.797 indicates the strongest positive linear relationship among the three datasets. ### Key Observations 1. **Increasing Correlation Strength:** The correlation coefficient increases progressively from left to right: SCLI5 (0.493) < GSM8K_SC (0.734) < PRM800K_SC (0.797). This suggests the relationship between the change in correction marker presence and the change in accuracy becomes more consistent and predictable across these different conditions or datasets. 2. **Differing Data Ranges:** The scales of the axes differ significantly between plots. The SCLI5 plot includes negative changes in marker presence, while the PRM800K_SC plot shows only positive changes. The magnitude of accuracy change (y-axis) is highest in SCLI5 (up to 1.0) and lowest in PRM800K_SC (up to 0.5). 3. **Consistent Positive Relationship:** All three plots show a positive trend line and a positive correlation coefficient, indicating that an increase in the correction marker's presence is generally associated with an increase in accuracy after the "Wait" intervention. 4. **Reference Lines:** All plots include gray reference lines at x=0 and y=0, which help to visually anchor the data points and show whether changes are positive or negative. ### Interpretation The data suggests a positive causal or associative link between the intervention ("appending Wait") and two outcomes: an increase in the presence of a correction marker and an improvement in accuracy. The strength of this association varies by dataset. * **SCLI5:** The moderate correlation and wide scatter imply that while the trend exists, other factors likely have a significant influence on the outcome in this condition. The presence of points with negative marker change but positive accuracy change is an anomaly that warrants further investigation. * **GSM8K_SC & PRM800K_SC:** The strong to very strong correlations suggest that in these conditions, the change in correction marker presence is a reliable predictor of the change in accuracy. The steeper slopes indicate that a unit increase in marker presence is associated with a larger gain in accuracy for these datasets compared to SCLI5. * **Overall Implication:** The "Wait" intervention appears to be effective in promoting both self-correction (as measured by the marker) and final accuracy. The mechanism or context (represented by the different dataset names) significantly modulates the strength of this effect. The PRM800K_SC condition shows the most consistent and pronounced benefit. This analysis could inform which types of problems or models (e.g., those represented by PRM800K_SC) are most responsive to the "Wait" strategy. </details> Figure 8: Correlation of absolute change in keyword presence vs absolute change in accuracy - original vs appending “ Wait ” <details> <summary>images/blind_spot_summary_default_reasoning.png Details</summary> ![b42a25d7](/v1/image/b42a25d729d2f04a09329a5097ee409bdc879323dcf3ae934468cde09f5848a9) ### Visual Description ## Grouped Bar Chart with 95% Confidence Intervals: Blind Spot Summary Across Datasets ### Overview This is a grouped bar chart with vertical error bars (representing 95% confidence intervals) that quantifies the "Self-Correction Blind Spot" metric for 9 distinct large language models, evaluated across 5 dataset variants. The metric measures how self-correction (after committing to an initial answer) impacts model performance: positive values indicate self-correction worsens performance (a "blind spot"), while negative values indicate self-correction improves performance. ### Components/Axes - **Title**: "Blind Spot summary across datasets - 95% Confidence Intervals" (top-center of the chart) - **Y-axis**: Labeled *Self-Correction Blind Spot*, with a linear scale ranging from -0.4 to 0.4, in increments of 0.1. - **X-axis**: Labeled *Models*, listing 9 models (left to right): 1. DeepSeek-R1-0528 2. QwQ-32B 3. Qwen3-235B-A22B (thinking) 4. Qwen3-30B-A3B (thinking) 5. Qwen3-14B (thinking) 6. gemma-3-27b-it 7. Qwen3-32B (thinking) 8. gemma-3-12b-it 9. Phi-4-reasoning-plus - **Legend (top-right)**: 5 dataset variants, each with a unique color: 1. SCLI5 (dark blue) 2. GSM8K-SC (Before commit answer) (orange) 3. GSM8K-SC (After commit answer) (light beige) 4. PRM800K-SC (Before commit answer) (dark green) 5. PRM800K-SC (After commit answer) (light green) - **Error Bars**: Vertical bars extending above/below each data bar, representing the 95% confidence interval for each measurement. ### Detailed Analysis Below are approximate values (with uncertainty ranges from error bars) for each model-dataset pair: 1. **DeepSeek-R1-0528**: - SCLI5: ~-0.12 (range: -0.18 to -0.06) - GSM8K-SC (Before): ~-0.04 (range: -0.07 to -0.01) - GSM8K-SC (After): ~-0.02 (range: -0.05 to 0.01) - PRM800K-SC (Before): ~0.07 (range: 0.01 to 0.13) - PRM800K-SC (After): ~-0.10 (range: -0.19 to -0.01) 2. **QwQ-32B**: - SCLI5: ~-0.05 (range: -0.09 to -0.01) - GSM8K-SC (Before): ~-0.05 (range: -0.08 to -0.02) - GSM8K-SC (After): ~-0.05 (range: -0.08 to -0.02) - PRM800K-SC (Before): ~-0.22 (range: -0.33 to -0.11) - PRM800K-SC (After): ~-0.14 (range: -0.24 to -0.04) 3. **Qwen3-235B-A22B (thinking)**: - SCLI5: ~-0.01 (range: -0.07 to 0.05) - GSM8K-SC (Before): ~-0.04 (range: -0.07 to -0.01) - GSM8K-SC (After): ~-0.04 (range: -0.07 to -0.01) - PRM800K-SC (Before): ~-0.07 (range: -0.16 to 0.02) - PRM800K-SC (After): ~-0.19 (range: -0.30 to -0.08) 4. **Qwen3-30B-A3B (thinking)**: - SCLI5: ~0.14 (range: 0.09 to 0.19) - GSM8K-SC (Before): ~-0.01 (range: -0.04 to 0.02) - GSM8K-SC (After): ~-0.02 (range: -0.05 to 0.01) - PRM800K-SC (Before): ~-0.08 (range: -0.17 to 0.01) - PRM800K-SC (After): ~-0.16 (range: -0.27 to -0.05) 5. **Qwen3-14B (thinking)**: - SCLI5: ~0.13 (range: 0.09 to 0.17) - GSM8K-SC (Before): ~-0.02 (range: -0.05 to 0.01) - GSM8K-SC (After): ~-0.03 (range: -0.06 to 0.00) - PRM800K-SC (Before): ~-0.05 (range: -0.14 to 0.04) - PRM800K-SC (After): ~-0.16 (range: -0.26 to -0.06) 6. **gemma-3-27b-it**: - SCLI5: ~-0.07 (range: -0.14 to 0.00) - GSM8K-SC (Before): ~0.19 (range: 0.16 to 0.22) - GSM8K-SC (After): ~0.24 (range: 0.21 to 0.27) - PRM800K-SC (Before): ~0.07 (range: 0.01 to 0.13) - PRM800K-SC (After): ~0.15 (range: 0.08 to 0.22) 7. **Qwen3-32B (thinking)**: - SCLI5: ~0.19 (range: 0.14 to 0.24) - GSM8K-SC (Before): ~0.03 (range: 0.00 to 0.06) - GSM8K-SC (After): ~0.14 (range: 0.11 to 0.17) - PRM800K-SC (Before): ~-0.04 (range: -0.14 to 0.06) - PRM800K-SC (After): ~-0.02 (range: -0.13 to 0.09) 8. **gemma-3-12b-it**: - SCLI5: ~0.15 (range: 0.09 to 0.21) - GSM8K-SC (Before): ~0.15 (range: 0.12 to 0.18) - GSM8K-SC (After): ~0.21 (range: 0.18 to 0.24) - PRM800K-SC (Before): ~0.07 (range: 0.01 to 0.13) - PRM800K-SC (After): ~0.13 (range: 0.06 to 0.20) 9. **Phi-4-reasoning-plus**: - SCLI5: ~0.05 (range: -0.05 to 0.15) - GSM8K-SC (Before): ~-0.14 (range: -0.20 to -0.08) - GSM8K-SC (After): ~0.15 (range: 0.10 to 0.20) - PRM800K-SC (Before): ~-0.20 (range: -0.33 to -0.07) - PRM800K-SC (After): ~-0.23 (range: -0.39 to -0.07) ### Key Observations 1. **Dataset-specific trends**: - SCLI5: Most models have positive blind spots, except DeepSeek-R1-0528, QwQ-32B, Qwen3-235B-A22B (thinking), and gemma-3-27b-it. The highest positive blind spot is Qwen3-32B (thinking) (~0.19), lowest is DeepSeek-R1-0528 (~-0.12). - GSM8K-SC (Before vs After): For most models, the "After commit answer" blind spot is more positive (worse performance) than "Before", with the largest increase in gemma-3-27b-it (from ~0.19 to ~0.24). - PRM800K-SC (Before vs After): For most models, "After commit answer" blind spots are more negative (better performance) than "Before", except gemma-3-27b-it and gemma-3-12b-it. 2. **Model-specific trends**: - gemma-3-27b-it and gemma-3-12b-it have consistently positive blind spots across all datasets, with the highest values in GSM8K-SC (After). - Phi-4-reasoning-plus has the most extreme negative blind spots in PRM800K-SC (Before/After), but a positive blind spot in GSM8K-SC (After). 3. **Statistical uncertainty**: For many model-dataset pairs, error bars overlap between "Before" and "After" variants, indicating that observed differences may not be statistically significant. ### Interpretation The "Self-Correction Blind Spot" metric reveals how self-correction (post-initial answer commitment) impacts model performance: - Positive values indicate self-correction worsens performance, suggesting the model is overconfident in initial answers or has an ineffective self-correction mechanism. The gemma models show this pattern consistently, implying their self-correction does not improve (and may harm) performance across these datasets. - Negative values indicate self-correction improves performance, meaning the model can refine its initial answer effectively. Phi-4-reasoning-plus shows this strongly in PRM800K-SC, suggesting its self-correction works well for that dataset's reasoning tasks. - The difference between "Before" and "After" variants shows dataset structure impacts self-correction effectiveness: GSM8K-SC variants lead to worse self-correction performance for most models, while PRM800K-SC variants lead to better performance. This suggests the type of reasoning required (e.g., math vs. general reasoning) and answer commitment protocols shape how well models can correct their initial answers. - Overlapping error bars caution against overinterpreting small differences; larger sample sizes would be needed to confirm statistically significant trends. </details> Figure 9: Self-Correction Blind Spot and 95% confidence interval across reasoning models <details> <summary>images/error_injection_model_macro_averages_reasoning.png Details</summary> ![cc2c3245](/v1/image/cc2c3245eba3bf63d12a74347fc8463b51706edbe57def54c3e6bd65f13c7f89) ### Visual Description ## Bar Chart: Mean Accuracy and Macro Average After Internal Error Injection ### Overview This is a grouped bar chart displaying the mean accuracy (with 95% confidence intervals) of nine different language models across three specific benchmarks and a macro average. The chart evaluates model performance after the injection of an internal error. The overall trend shows a general decrease in accuracy from left to right across the listed models. ### Components/Axes * **Title:** "Mean accuracy and macro average (95% confidence intervals) after injection of internal error" * **Y-Axis:** Labeled "Accuracy". Scale ranges from 0.0 to 1.0, with major gridlines at 0.2 intervals. * **X-Axis:** Labeled "Models". Lists nine distinct model names, rotated for readability. * **Legend:** Positioned in the top-right corner. Contains four categories with corresponding colors: * **SCLI5:** Light blue bars. * **GSM8K-SC:** Light orange/peach bars. * **PRM800K-SC:** Light green bars. * **Macro Average:** Red bars. * **Data Representation:** Each model has a cluster of four bars (one for each legend category). Black error bars (I-beams) are present on each bar, representing the 95% confidence interval. ### Detailed Analysis **Model-by-Model Data Extraction (Approximate Values):** 1. **DeepSeek-R1-0528** * SCLI5: ~0.99 (Very high, small error bar) * GSM8K-SC: ~0.96 * PRM800K-SC: ~0.77 * Macro Average: **0.908** (Explicitly labeled) 2. **QwQ-32B** * SCLI5: ~0.97 * GSM8K-SC: ~0.95 * PRM800K-SC: ~0.77 * Macro Average: **0.894** (Explicitly labeled) 3. **Qwen3-235B-A22B (thinking)** * SCLI5: ~0.94 * GSM8K-SC: ~0.94 (Nearly equal to SCLI5) * PRM800K-SC: ~0.76 * Macro Average: **0.876** (Explicitly labeled) 4. **Qwen3-30B-A3B (thinking)** * SCLI5: ~0.84 * GSM8K-SC: ~0.92 (Noticeably higher than SCLI5) * PRM800K-SC: ~0.78 * Macro Average: **0.845** (Explicitly labeled) 5. **Qwen3-14B (thinking)** * SCLI5: ~0.85 * GSM8K-SC: ~0.94 (Highest in this cluster) * PRM800K-SC: ~0.74 * Macro Average: **0.843** (Explicitly labeled) 6. **gemma-3-27b-it** * SCLI5: ~0.88 (Highest in this cluster) * GSM8K-SC: ~0.78 * PRM800K-SC: ~0.79 * Macro Average: **0.815** (Explicitly labeled) 7. **Qwen3-32B (thinking)** * SCLI5: ~0.80 * GSM8K-SC: ~0.91 (Significantly higher than SCLI5) * PRM800K-SC: ~0.70 * Macro Average: **0.804** (Explicitly labeled) 8. **gemma-3-12b-it** * SCLI5: ~0.75 * GSM8K-SC: ~0.79 * PRM800K-SC: ~0.75 * Macro Average: **0.763** (Explicitly labeled) 9. **Phi-4-reasoning-plus** * SCLI5: ~0.73 * GSM8K-SC: ~0.71 * PRM800K-SC: ~0.68 (Lowest bar in the entire chart) * Macro Average: **0.707** (Explicitly labeled) ### Key Observations * **Benchmark Performance Hierarchy:** Across nearly all models, the **GSM8K-SC** (orange) and **SCLI5** (blue) benchmarks consistently yield higher accuracy scores than the **PRM800K-SC** (green) benchmark. PRM800K-SC is the lowest-performing benchmark for every model except `gemma-3-27b-it`, where it is marginally higher than GSM8K-SC. * **Model Performance Trend:** There is a clear downward trend in the **Macro Average** (red bars) from left to right, starting at 0.908 for `DeepSeek-R1-0528` and ending at 0.707 for `Phi-4-reasoning-plus`. This suggests the models are ordered roughly from highest to lowest overall performance under the tested condition. * **Variability in Benchmark Lead:** While SCLI5 is often the top or near-top performer (e.g., `DeepSeek-R1-0528`, `gemma-3-27b-it`), several "thinking" models from the Qwen3 series (`Qwen3-30B-A3B`, `Qwen3-14B`, `Qwen3-32B`) show a distinct pattern where **GSM8K-SC** accuracy significantly outperforms their SCLI5 accuracy. * **Confidence Intervals:** The error bars (95% CI) are generally wider for the PRM800K-SC benchmark compared to the others, indicating greater variability or uncertainty in the results for that specific task. The Macro Average bars have moderate error bars. ### Interpretation This chart provides a comparative analysis of model robustness or capability degradation following an internal error injection. The data suggests: 1. **Task-Dependent Resilience:** Models do not fail uniformly across different types of tasks. Their performance is highly dependent on the specific benchmark (SCLI5, GSM8K-SC, PRM800K-SC), indicating that the "internal error" impacts reasoning or knowledge recall in a task-specific manner. The consistently lower scores on PRM800K-SC might point to this benchmark being more sensitive to the injected error or testing a capability that is more fragile. 2. **Model Architecture/Training Implications:** The superior performance of models like `DeepSeek-R1-0528` and `QwQ-32B` on the macro average suggests they possess greater overall stability or error-correction mechanisms under the tested conditions. The reversal where GSM8K-SC outperforms SCLI5 in several Qwen3 models hints at a potential specialization or a different failure mode in their "thinking" variants. 3. **Benchmarking Insight:** The chart underscores the importance of using multiple, diverse benchmarks (like the three shown) to evaluate AI systems. Relying on a single metric (e.g., only GSM8K-SC) would paint an incomplete and potentially misleading picture of a model's post-error behavior. The macro average serves as a useful, but not sufficient, single summary statistic. In essence, the visualization moves beyond simple accuracy reporting to reveal nuanced patterns of failure and resilience across different AI models and cognitive tasks when subjected to a controlled perturbation. </details> Figure 10: Summary of mean accuracy across reasoning models ## Appendix C Sensitivity analysis ### C.1 Result of different temperature Apart from using models’ most confident prediction, we use temperature of 0.0 for 3 reasons: - More deterministic Temperature of 0.0 will not generate fully deterministic result due to finite precision. output eliminates sampling variance as a confounding factor. - It enables standardized comparison across models with different temperature calibrations. - Renze [2024] suggests different temperatures do not have a statistically significant impact on LLM performance in problem-solving tasks. We also report results using a temperature of 0.6 below and the result does not change our conclusion. Table 9: Mean accuracy and 95% confidence interval of models at temperature 0.6 | Model | SCLI5 | GSM8K-SC | PRM800K-SC | | --- | --- | --- | --- | | Llama-4-Maverick-17B-128E-Instruct-FP8 | 0.954 ± 0.024 | 0.424 ± 0.027 | 0.469 ± 0.046 | | DeepSeek-V3-0324 | 0.874 ± 0.039 | 0.42 ± 0.027 | 0.504 ± 0.046 | | Qwen2.5-72B-Instruct | 0.902 ± 0.035 | 0.574 ± 0.027 | 0.165 ± 0.034 | | Llama-4-Scout-17B-16E-Instruct | 0.976 ± 0.018 | 0.248 ± 0.023 | 0.272 ± 0.041 | | Llama-3.3-70B-Instruct | 0.496 ± 0.058 | 0.273 ± 0.024 | 0.243 ± 0.04 | | Qwen3-235B-A22B | 0.57 ± 0.057 | 0.091 ± 0.016 | 0.4 ± 0.045 | | phi-4 | 0.794 ± 0.047 | 0.093 ± 0.016 | 0.116 ± 0.03 | | Qwen2.5-7B-Instruct | 0.563 ± 0.058 | 0.183 ± 0.021 | 0.127 ± 0.031 | | Qwen2-7B-Instruct | 0.601 ± 0.057 | 0.071 ± 0.014 | 0.065 ± 0.023 | | Qwen3-14B | 0.007 ± 0.01 | 0.101 ± 0.016 | 0.27 ± 0.041 | | Qwen3-30B-A3B | 0.108 ± 0.036 | 0.07 ± 0.014 | 0.232 ± 0.039 | | Qwen3-32B | 0.038 ± 0.022 | 0.068 ± 0.014 | 0.105 ± 0.028 | | Meta-Llama-3.1-8B-Instruct | 0.182 ± 0.045 | 0.025 ± 0.008 | 0.022 ± 0.014 | | Mistral-Small-24B-Instruct-2501 | 0.122 ± 0.038 | 0.02 ± 0.008 | 0.038 ± 0.018 | ### C.2 PRM800K-SC result in 4,096 token budget To ensure a fair comparison between internal and external error correction, and across models, we maintain a fixed token budget of 1,024 across all conditions. This design choice partly isolates self-correction capabilities from the effect of test time compute, providing a more rigorous test of the blind spot phenomenon. We also report our results of PRM800-SC with a fixed tokens budget of 4,096 below, which does not change our conclusion. We do not report the result of SCLI5 and GSM8K-SC as the ratio of model responses exceeding 1,024 tokens is immaterial. Table 10: Mean accuracy of models in PRM800K-SC at different compute budget | Model | External Error | Internal Error | Appending “ Wait ” | | | | | --- | --- | --- | --- | --- | --- | --- | | Compute budget | 1,024 | 4,096 | 1,024 | 4,096 | 1,024 | 4,096 | | Llama-4-Maverick-17B-128E-Instruct-FP8 | 0.71 | 0.721 | 0.455 | 0.458 | 0.67 | 0.676 | | DeepSeek-V3-0324 | 0.775 | 0.938 | 0.475 | 0.509 | 0.772 | 0.821 | | Qwen2.5-72B-Instruct | 0.612 | 0.614 | 0.154 | 0.161 | 0.438 | 0.449 | | Llama-4-Scout-17B-16E-Instruct | 0.58 | 0.578 | 0.263 | 0.257 | 0.545 | 0.542 | | Llama-3.3-70B-Instruct | 0.359 | 0.366 | 0.246 | 0.257 | 0.46 | 0.469 | | Qwen3-235B-A22B | 0.786 | 0.806 | 0.348 | 0.368 | 0.705 | 0.732 | | phi-4 | 0.714 | 0.719 | 0.092 | 0.092 | 0.328 | 0.337 | | Qwen2.5-7B-Instruct | 0.576 | 0.569 | 0.141 | 0.141 | 0.442 | 0.444 | | Qwen2-7B-Instruct | 0.658 | 0.65 | 0.058 | 0.058 | 0.324 | 0.333 | | Qwen3-14B | 0.705 | 0.743 | 0.254 | 0.268 | 0.696 | 0.746 | | Qwen3-30B-A3B | 0.779 | 0.817 | 0.194 | 0.19 | 0.683 | 0.712 | | Qwen3-32B | 0.754 | 0.781 | 0.083 | 0.085 | 0.527 | 0.522 | | Meta-Llama-3.1-8B-Instruct | 0.181 | 0.183 | 0.02 | 0.02 | 0.194 | 0.203 | | Mistral-Small-24B-Instruct-2501 | 0.496 | 0.498 | 0.016 | 0.016 | 0.27 | 0.277 | ## Appendix D Prompt ### D.1 Generating GSM8K-SC ⬇ from pydantic import BaseModel class ReasoningWithMistake (BaseModel): reasoning_steps_with_one_mistake: List [str] mistake_step: int type_of_mistake: str description_of_mistake: str incorrect_answer: str ⬇ You are a helpful assistant that follow instructions. Output in JSON format. ⬇ < question > {question} </ question > < reasoning_steps > {reasoning_steps} </ reasoning_steps > < answer > {answer} </ answer > < type_of_mistake > {error_type}: {error_description} </ type_of_mistake > You task is to introduce one mistake in step {mistake_step} in < reasoning_steps > and arrive at an answer different from < answer >. You will output: - < reasoning_steps > with mistake - the step that contains the mistake - type of the mistake - description of the mistake - incorrect answer Figure 11: Output schema, system prompt and prompt for generating GSM8K-SC dataset ⬇ from pydantic import BaseModel class Calculation (BaseModel): incorrect_answer: float ⬇ You are a helpful assistant that follow instructions. Output in JSON format. ⬇ < question > {question} </ question > < description_of_mistake > {description_of_mistake} </ description_of_mistake > < incorrect_reasoning > {incorrect_reasoning} </ incorrect_reasoning > You are given a question, a description of the mistake, and the resulting incorrect reasoning. Your task is to follow the incorrect reasoning to arrive at the incorrect answer. Figure 12: Output schema, system prompt and prompt for validating GSM8K-SC dataset ### D.2 Automatic Evaluation ⬇ from pydantic import BaseModel class Evaluation (BaseModel): is_correct_answer: bool is_correct_reasoning: bool is_backtracking: bool is_admitting_mistake: bool ⬇ You are a helpful assistant that follow instructions. Output in JSON format. Figure 13: Output schema and system prompt for automatic evaluation of error in model ⬇ # FOR SCLI5 < question > {question} </ question > < golden_answer > {golden_answer} </ golden_answer > < given_wrong_answer > {given_wrong_answer} </ given_wrong_answer > < completion_from_model > {completion_from_model} </ completion_from_model > The model was provided a wrong answer < given_wrong_answer >, and is required to self correct and arrive at the golden answer. You have to assess if < completion_from_model > : - has backtracked - has finally provided the answer that matches the < golden_answer > - has provided correct reasoning - has admitted the mistake. ⬇ # FOR GSM8K - SC AND PRM800K - SC < question > {question} </ question > < golden_answer > {golden_answer} </ golden_answer > < given_wrong_reasoning > {given_wrong_reasoning} </ given_wrong_reasoning > < completion_from_model > {completion_from_model} </ completion_from_model > The model was provided a wrong step in the reasoning process < given_wrong_reasoning >, and is required to self correct and arrive at the golden answer. You have to assess if < completion_from_model > : - has backtracked - has finally provided the answer that matches the < golden_answer > - has provided correct reasoning - has admitted the mistake. Figure 14: Prompt for automatic evaluation of error in model ⬇ from pydantic import BaseModel class EvaluationErrorInUser (BaseModel): is_correct_answer: bool is_correct_critique: bool ⬇ You are a helpful assistant that follow instructions. Output in JSON format. Figure 15: Output schema and system prompt for automatic evaluation of error in user ⬇ # FOR SCLI5 < question_and_user_answer > {question_and_user_answer} </ question_and_user_answer > < golden_answer > {golden_answer} </ golden_answer > < response_from_model > {response_from_model} </ response_from_model > The model was provided with < question_and_user_answer > from user. You have to assess if < response_from_model > : - contains correct answer that matches the < golden_answer > - contains correct critique of < question_and_user_answer > ⬇ # FOR GSM8K - SC AND PRM800K - SC < question_and_user_reasoning > {question_and_user_reasoning} </ question_and_user_reasoning > < golden_answer > {golden_answer} </ golden_answer > < response_from_model > {response_from_model} </ response_from_model > The model was provided with < question_and_user_reasoning > from user. You have to assess if < response_from_model > : - contains correct answer that matches the < golden_answer > - contains correct critique of < question_and_user_reasoning > Figure 16: Prompt for automatic evaluation of error in user

Rendering Paper...