2602.11361v1

Model: healer-alpha-free

# Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification **Authors**: Weili Shi, Dongliang Guo, Lehan Yang, Tianlong Wang, Hanzhang Yuan, Sheng Li ## Abstract Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens –tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the P araphrastic P robing and C onsistency V erification (PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines. Machine Learning, ICML <details> <summary>images/compare_ct_sc_gsm8k.png Details</summary> ![85014a48](/v1/image/85014a486de3d0ca398c4d4a2fa3bbcd05a61e7825c4da5baf46ef7a508f931c) ### Visual Description ## Line Chart: pass@k(%) vs. Number of Samples (k) ### Overview This is a line chart comparing the performance of two methods, "critical tokens" and "self-consistency," as a function of the number of samples (k). The chart plots the pass@k metric (in percentage) against k, showing how performance scales with increased sampling. ### Components/Axes * **X-Axis (Horizontal):** Labeled "number of sample k". The scale is linear, with major tick marks at 10, 20, 30, 40, and 50. Data points are plotted at approximate k values of 5, 8, 16, 24, 32, and 48. * **Y-Axis (Vertical):** Labeled "pass@k(%)". The scale is linear, ranging from 70.0 to 90.0, with major tick marks every 2.5 units (70.0, 72.5, 75.0, ..., 90.0). * **Legend:** Located in the bottom-right corner of the chart area. * **Red line with upward-pointing triangle markers:** Labeled "critical tokens". * **Purple (magenta) line with star markers:** Labeled "self-consistency". * **Grid:** A dashed gray grid is present, aligned with the major tick marks on both axes. ### Detailed Analysis **Trend Verification:** * **"critical tokens" (Red line, triangles):** The line shows a clear upward trend, starting steeply and then gradually flattening (diminishing returns). It is consistently positioned above the "self-consistency" line. * **"self-consistency" (Purple line, stars):** This line also shows a clear upward trend with a similar shape—steep initial rise followed by a more gradual increase. It is consistently below the "critical tokens" line. **Data Points (Approximate Values):** * **k ≈ 5:** * critical tokens: ~77.0% * self-consistency: ~70.5% * **k ≈ 8:** * critical tokens: ~82.5% * self-consistency: ~77.0% * **k ≈ 16:** * critical tokens: ~85.2% * self-consistency: ~80.1% * **k ≈ 24:** * critical tokens: ~86.7% * self-consistency: ~82.8% * **k ≈ 32:** * critical tokens: ~87.7% * self-consistency: ~83.6% * **k ≈ 48:** * critical tokens: ~89.5% * self-consistency: ~84.6% ### Key Observations 1. **Performance Gap:** The "critical tokens" method maintains a consistent performance advantage over "self-consistency" across all sampled values of k. The gap is approximately 6-7 percentage points at k=5 and narrows slightly to about 5 percentage points at k=48. 2. **Diminishing Returns:** Both methods exhibit diminishing returns. The most significant performance gains occur when increasing k from 5 to 16. The rate of improvement slows considerably for k > 24. 3. **Similar Growth Pattern:** Despite the absolute performance difference, both methods follow a nearly identical logarithmic-like growth curve, suggesting they scale similarly with increased computational effort (more samples). ### Interpretation The chart demonstrates that for the evaluated task, using a "critical tokens" strategy yields a higher pass@k success rate than a "self-consistency" strategy for any given number of samples (k). This suggests that identifying and focusing on "critical tokens" is a more effective method for improving solution accuracy than relying on the self-consistency of multiple sampled outputs. The data implies a trade-off between performance and computational cost. While increasing k improves results for both methods, the most cost-effective gains are achieved at lower k values (e.g., moving from 5 to 16 samples). The persistent gap between the lines indicates that the "critical tokens" method is fundamentally more efficient, achieving a given performance level with fewer samples than "self-consistency." For instance, "critical tokens" at k≈8 achieves a similar pass@k rate (~82.5%) as "self-consistency" at k≈24 (~82.8%). </details> Figure 1: Comparison of the effects of critical tokens and Self-Consistency on the reasoning performance of LLMs, evaluated on samples from the GSM8K training data. ## 1 Introduction The emergence of large language models (LLMs) (Brown et al., 2020; Grattafiori et al., 2024; Achiam et al., 2023; Yang et al., 2025a) has astonished the AI community with their remarkable capabilities across a wide range of reasoning tasks, including mathematical problem solving, programming, and planning. By generating intermediate reasoning steps through techniques such as chain-of-thought prompting (Wei et al., 2022; Kojima et al., 2022; Zhang et al., 2022; Sprague et al., 2025), LLMs can emulate human-like reasoning processes and achieve strong performance on diverse reasoning benchmarks. Despite their success, the problem-solving ability of LLMs often declines on complex reasoning tasks due to hallucinations and the accumulation of errors in intermediate steps (Lightman et al., 2023; Ling et al., 2023; Bubeck et al., 2023). To mitigate this issue, prior works (Wang et al., 2024; Yuan et al., 2025; Chen et al., 2024a; Chen and Li, 2024) have proposed the inference-time optimization paradigm, which evaluates and refines reasoning trajectories through self-correction by the LLM itself or with feedback from external verifiers such as process reward models (PRMs) (Uesato et al., 2022). However, existing studies (Yang et al., 2025b; Tyen et al., 2024; Stechly et al., 2025) suggest that LLMs struggle to reliably identify errors in their reasoning and often fail to correct previous outputs without external guidance. Moreover, obtaining high-quality, stepwise process supervision for training external verifiers is challenging and limits the practicality of these approaches (Feng et al., 2025). <details> <summary>images/para_example.png Details</summary> ![f71d74d7](/v1/image/f71d74d7ad39a613b6cf543bcea9f10f094ad49eea140393f2d73a47c2ed1a2b) ### Visual Description ## Case Study on the Effects of Question Paraphrasing ### Overview The image presents a side-by-side comparison of two mathematical word problems and their step-by-step solutions. The purpose is to demonstrate how paraphrasing a question can affect the reasoning process and final answer. The left column shows an "Original question" and an incorrect response, while the right column shows a "Paraphrased question" and a correct response. The layout is a two-column table with a dark blue header. ### Components/Axes * **Header:** A dark blue bar at the top with white text: "Case study on the effects of question paraphrasing". * **Layout:** Two vertical columns of equal width. * **Left Column (Original):** * **Label:** "Original question:" (bold) * **Question Text:** A word problem about a card game. * **Ground Truth:** "(ground truth answer: 8)" * **Response Label:** "Response:" (bold) * **Solution Steps:** A two-step solution. Key text in **red** indicates an error. * **Final Answer:** "Therefore, Jane played 4 rounds." followed by a black circle with a white "X" (✗). * **Right Column (Paraphrased):** * **Label:** "Paraphrased question:" (bold) * **Question Text:** A reworded version of the same problem, framed as a bridge competition. * **Ground Truth:** "(ground truth answer: 8)" * **Response Label:** "Response:" (bold) * **Solution Steps:** A three-step solution. Key text in **purple** indicates the correct logical adjustment. * **Final Answer:** "The number of games she won is equal to 8." followed by a black circle with a white checkmark (✓). ### Detailed Analysis **1. Original Question & Incorrect Response:** * **Question:** "In a card game, you are awarded 10 points if you win one round. While playing, Jane ends up with 60 points. If she lost 20 points, then how many rounds did she play? (ground truth answer: 8)" * **Response Steps:** * **Step 1 (Error Highlighted in Red):** "Determine the total points Jane earned **by subtracting the points she lost from her final score. Total points earned = Final score - Points lost. Total points earned = 60 – 20. Total points earned = 40.**" * **Step 2:** "Since Jane earns 10 points for each round she wins, we can find the number of rounds she played by dividing the total points earned by the points earned per round. Number of rounds = Total points earned / Points earned per round. Number of rounds = 40 / 10. Number of rounds = 4." * **Conclusion:** "Therefore, Jane played 4 rounds. ✗" **2. Paraphrased Question & Correct Response:** * **Question:** "At a bridge competition, every game won grants the player 10 points. Unfortunately, Jane started with an advantage but ended up losing 20 points and scoring only 60 points. If she had won x rounds and lost none, what would be the value of x? (ground truth answer: 8)" * **Response Steps:** * **Step 1:** "Understand the given information. Jane started with an advantage but ended up losing 20 points. She scored a total of 60 points. Every game won grants the player 10 points." * **Step 2 (Correct Logic in Purple):** "Determine the total points Jane would have had if she hadn't lost any points. **Since she lost 20 points, we need to add 20 to her total score to find out how many points she would have had if she hadn't lost any points. Total points without loss = 60 + 20 = 80 points.**" * **Step 3:** "Calculate the number of games Jane won. Since every game won grants the player 10 points, we can divide the total points without loss by 10 to find the number of games she won. Number of games won = Total points without loss / 10 = 8. The number of games she won is equal to 8. ✓" ### Key Observations 1. **Identical Core Problem:** Both questions describe the same scenario: a final score of 60 points, a loss of 20 points, and a reward of 10 points per win. The ground truth answer for both is 8. 2. **Critical Reasoning Error:** The response to the original question makes a fundamental algebraic error. It incorrectly interprets "lost 20 points" as a deduction from the final score to find "points earned," rather than understanding it as a penalty applied to a base score from wins. 3. **Paraphrasing Corrects the Error:** The paraphrased question, by explicitly stating "If she had won x rounds and lost none," forces the solver to consider a hypothetical scenario without losses. This framing leads the solver to correctly add the 20 lost points back to the final score (60 + 20 = 80) to find the total points from wins alone. 4. **Visual Cues:** The use of **red text** highlights the erroneous step in the first solution. The use of **purple text** highlights the corrective, logical step in the second solution. The ✗ and ✓ symbols provide immediate visual feedback on the correctness of the final answer. ### Interpretation This case study demonstrates a key principle in problem-solving and AI reasoning: **the phrasing of a question can significantly influence the cognitive path taken to solve it.** * **The Original Question's Flaw:** Its wording ("If she lost 20 points, then how many rounds did she play?") is ambiguous. It can be misinterpreted as asking for the number of rounds *played* (which could include losses), not just rounds *won*. The solver fell into the trap of treating the 20-point loss as a simple subtraction from the final tally, a common but incorrect approach to such problems. * **The Paraphrased Question's Strength:** It removes ambiguity by specifying the goal: find `x`, the number of games won *assuming no losses*. This precise framing guides the solver toward the correct algebraic model: `10x - 20 = 60`, which solves to `x = 8`. The paraphrase acts as a "scaffold," making the underlying mathematical structure more transparent. * **Broader Implication:** For AI systems and human learners alike, this highlights the importance of precise language in prompts and problem statements. A well-constructed paraphrase can prevent logical errors by clarifying intent and constraining the solution space to the correct interpretation. The study suggests that evaluating an AI's reasoning capability should consider not just its final answer, but its robustness to different, yet semantically equivalent, formulations of the same problem. </details> Figure 2: Case study illustrating that LLMs can refine intermediate incorrect reasoning segments (highlighted in red). Recent work has introduced the concept of critical tokens (Lin et al., 2025), which play pivotal roles in intermediate reasoning steps and exert strong influence on subsequent reasoning and final outcomes. We evaluate the criticality of a token based on the average accuracy of sampling rollouts generated from the partial reasoning trajectories truncated at the token’s position. A token is identified as critical if it lies on an incorrect trajectory and satisfies the following conditions: - The correctness score of rollouts initiated from the token is close to zero (e.g., below 5%); - The correctness scores of rollouts initiated from subsequent tokens remain below a threshold (e.g., 10%). Prior studies suggest that critical tokens often diverge from human-annotated error tokens, yet they induce more sustained degradation in reasoning quality than other tokens. Moreover, as illustrated in Figure 3, replacing critical tokens in an incorrect reasoning trajectory with suitable candidate tokens can correct subsequent steps and lead to the right answer in new rollouts. To quantitatively assess the effectiveness of critical tokens, we conduct an empirical study using LLMs such as Llama-3.1-8B-Instruct (Grattafiori et al., 2024) on reasoning tasks. Specifically, we randomly sample 100 instances with incorrect reasoning steps from the GSM8K (Cobbe et al., 2021) training data. Following the criterion, we locate critical tokens through exhaustive search. We then truncate the reasoning path at the critical token, substitute it with alternative tokens, and roll out new reasoning paths. For example, as shown in Figure 3, the token “woman” is replaced with “remaining”. We evaluate performance using Pass@k and compare against Self-Consistency (Wang et al., 2023), which also samples multiple reasoning paths. As shown in Figure 1, critical token replacement provides a clear advantage in correcting errors compared to plain sampling. Nonetheless, reliably identifying and leveraging critical tokens for reasoning remains a nontrivial challenge. <details> <summary>images/critical_token_example.png Details</summary> ![7c660b55](/v1/image/7c660b55c23cf9f784e21e547dcd0bd9efe0aa0bc1ca2584b7b135dbb60be4a2) ### Visual Description ## Case Study: Critical Token Replacement in AI Problem-Solving ### Overview The image presents a technical case study titled "Case study on critical tokens." It demonstrates how the replacement of a single, critical word in an AI's reasoning process can correct a mathematical error and lead to the accurate solution of a word problem. The content is structured as a comparison between an initial, incorrect response and a subsequent, corrected response. ### Components The image is a text-based document with the following structural components: 1. **Header:** A blue banner at the top containing the title "Case study on critical tokens". 2. **Question Block:** Presents the original math word problem and the ground truth answer. 3. **First Response Block:** Shows an initial, flawed reasoning process leading to an incorrect answer. It includes a visual annotation (red text) and a failure icon (❌). 4. **Second Response Block:** Shows a corrected reasoning process, labeled as "Response(replacement of critical tokens):". It includes a visual annotation (purple text) and a success icon (✅). ### Detailed Analysis / Content Details **Question:** * **Text:** "There were 10000 spectators at a basketball match. 7000 of them were men. Of the remaining spectators, there were 5 times as many children as women. How many children were there? (ground truth answer: 2500)" **First Response (Incorrect):** * **Reasoning:** The response attempts to find the number of women first, then multiply by 5. * **Step 1:** "Find the number of women. Total spectators = 10000. Men = 7000. Women = Total spectators – Men. Women = 10000 – 7000. Women = 3000." * **Step 2:** "Find the number of children. There are 5 times as many children as women. Children = 5 * Women. Children = 5 * 3000. Children = 15000." * **Conclusion:** "So, the number of children is 15000. ❌" * **Annotation:** The word "**women**" in the introductory sentence is highlighted in **red**. **Second Response (Correct - "replacement of critical tokens"):** * **Reasoning:** The response correctly identifies the need to work with the "remaining spectators" first. * **Step 1:** "Find the number of remaining spectators (excluding men). Total spectators = 10000. Men = 7000. Remaining spectators = Total spectators – Men = 10000 - 7000= 3000" * **Step 2:** "Let's assume the number of women is 'x'. Since there are 5 times as many children as women, the number of children will be 5x." * **Step 3:** "The total number of remaining spectators is the sum of women and children. We can write this as an equation: x + 5x = 3000. Combine like terms: 6x = 3000. Divide both sides by 6: x = 3000 / 6. x = 500." * **Step 4:** "Now that we have the number of women, we can find the number of children. Number of children = 5x= 5 * 500= 2500." * **Conclusion:** "Therefore, the number of children is 2500. ✅" * **Annotation:** The word "**remaining**" in the introductory sentence is highlighted in **purple**. ### Key Observations 1. **Critical Error Source:** The first response's fundamental error was misinterpreting the phrase "Of the remaining spectators." It incorrectly assumed the 3000 non-men were *all women*, then calculated children as 5 times that total (15000), which is logically impossible as it exceeds the total spectator count. 2. **Token Replacement:** The correction is explicitly framed as a "replacement of critical tokens." The key change is substituting the focus from "**women**" (red) to "**remaining**" (purple). This shifts the reasoning to correctly define the subset (remaining spectators = 3000) before applying the ratio. 3. **Methodological Shift:** The incorrect response uses a direct, sequential calculation. The correct response employs algebraic reasoning (setting up an equation with a variable 'x') to solve for the unknown quantities within the constrained total. ### Interpretation This case study is a pedagogical tool illustrating a specific failure mode and correction strategy in AI reasoning, particularly for multi-step word problems. * **What it demonstrates:** It shows that an AI's reasoning path is highly sensitive to the precise interpretation of relational phrases ("of the remaining"). A misinterpretation at an early step propagates and amplifies into a grossly incorrect final answer. * **The "Critical Token" Concept:** The highlighted words ("women" vs. "remaining") are presented as "critical tokens." This suggests that in the model's internal processing, certain words carry disproportionate weight in determining the logical structure of the solution. Replacing or correctly attending to these tokens can redirect the entire reasoning chain. * **Why it matters:** For AI development and debugging, this highlights the importance of: * **Precision in Language Understanding:** Models must accurately parse nested conditions and subset relationships. * **Error Analysis:** Tracing errors back to specific misinterpretations of key terms is crucial for improvement. * **Reasoning Robustness:** The shift to an algebraic method in the correct response is more robust, as it explicitly defines variables and constraints, reducing the chance of logical leaps. The study implies that enhancing an AI's performance may involve not just more data, but targeted interventions on how it processes and weights specific, semantically critical words within a problem statement. </details> Figure 3: An example demonstrating how substitution of a critical token (red) with a candidate token (purple) modifies the reasoning path and produces the correct answer. Recent studies (Zhou et al., 2024; Chen et al., 2024b), on surface form, the way questions, assumptions, and constraints are phrased, have revealed its subtle influence on the trajectory of intermediate reasoning steps, As shown in Figure 2. LLMs could adjust the intermediate steps under the paraphrased form of the question. This motivates us to explore the role of paraphrasing in the extraction and utilization of critical tokens for reasoning tasks. To this end, we propose the P araphrastic P robing and C onsistency V erification (PPCV) framework , a two-stage approach designed to leverage critical tokens to enhance the reasoning ability of LLMs. In the first stage, we probe critical tokens using paraphrased questions. Specifically, we first roll out the initial reasoning path from the original question, then concatenate paraphrased questions with this reasoning path. The resulting synthetic inputs are fed into the LLM to obtain token-level logits for each position in the reasoning path. Positions where the predicted top-1 token diverges from the expected token are flagged as potential pivotal points, as these positions are sensitive to paraphrased inputs and can trigger a pivot in the reasoning trajectory. Next, an empirical criterion is applied to determine the final critical token. In contrast to prior work (Lin et al., 2025), which depends on external models for identifying critical tokens with ambiguous criteria, our method introduces a self-contained mechanism that pinpoints critical tokens. In the second stage, we leverage the extracted critical tokens to refine the initial reasoning path. Specifically, we select the top-K tokens (include critical token itself) at the critical token position and roll out new reasoning paths for both the original and paraphrased questions. Based on the empirical observation that trajectories leading to correct answers are robust to paraphrastic perturbations, we propose a paraphrase consistency mechanism. In contrast to Self-Consistency (Wang et al., 2023), which relies on majority voting, our method selects the final answer by comparing outputs from paraphrased and original questions and choosing the one with the most consistent matches. To address potential ties across multiple answers, we further introduce similarity-weighted paraphrase consistency, which incorporates similarity scores between paraphrased and original questions when computing consistency. Compared with self-correction (Wu et al., 2024; Miao et al., 2024) and PRM-based methods (Wang et al., 2024; Yuan et al., 2025), our framework exploits critical tokens to refine reasoning trajectories without requiring step-level error detection by the LLM itself or auxiliary models. We evaluate our method on mainstream LLMs across mathematical and commonsense reasoning benchmarks, demonstrating consistent improvements in reasoning performance. The contributions of the paper is summarized as follows: - We propose a novel two-stage framework, P araphrastic P robing and C onsistency V erification (PPCV) designed to extract and leverage critical tokens to enhance the reasoning performance of LLMs. - We show that critical tokens can more effectively correct erroneous reasoning trajectories than traditional sampling methods like Self-Consistency. Furthermore, our approach successfully extracts these tokens through paraphrastic probing, achieving improved final results via paraphrase consistency. - We evaluate our method on mainstream LLMs across various reasoning tasks, including math and logic reasoning. Experimental results show significant performance improvements over baseline methods. ## 2 Related Work Inference-Time Optimization for LLM reasoning. With the advent of chain-of-thought (CoT) prompting, LLMs have demonstrated strong reasoning capabilities by producing intermediate steps during inference. This success has motivated a growing body of work (Wu et al., 2025; Snell et al., 2024) on augmenting reasoning trajectories at test time to further improve performance. Existing approaches can be broadly categorized into search-based methods (Bi et al., 2025; Yao et al., 2023; Hao et al., 2023; Xie et al., 2023; Besta et al., 2024), such as Tree-of-Thoughts (Yao et al., 2023), and sampling-based methods (Wang et al., 2023; Xu et al., 2025; Wan et al., 2025; Ma et al., 2025), such as Self-Consistency (Wang et al., 2023). However, due to intrinsic hallucinations (Bubeck et al., 2023), LLMs often generate erroneous intermediate steps, which can ultimately lead to incorrect answers, especially on complex problems. This limitation highlights the need for inference-time optimization of reasoning processes. To address this issue, one line of research (Yin et al., 2024; Chen et al., 2024a; Ling et al., 2023; Wu et al., 2024; Miao et al., 2024; Madaan et al., 2023) designs instructional prompts that guide LLMs to detect and refine their own mistakes. Despite its appeal, prior work has shown that the effectiveness of self-correction is limited in practice. Another line of work (Wang et al., 2024; Yuan et al., 2025; He et al., 2024; Havrilla et al., 2024) introduces external verifiers, such as process reward models (Snell et al., 2024), to identify and filter out error-prone reasoning steps. These methods typically require high-quality training data for the verifier, with data scarcity often mitigated through strategies such as Monte Carlo Tree Search (Guan et al., 2025; Qi et al., 2025; Li, 2025; Zhang et al., 2024). In addition, a recent line of decoding-based approaches (Xu et al., 2025; Ma et al., 2025) seeks to improve reasoning by dynamically adjusting the next-token prediction based on future trajectory probing. In contrast, our method refines reasoning by directly leveraging critical tokens, without relying on stepwise verification or external verifiers. This design underscores both the utility and universality of our approach. Paraphrasing for LLMs. A growing number of work (Zhou et al., 2024; Chen et al., 2024b) has examined the impact of a problem’s surface form on the reasoning ability of LLMs. Findings (Zhou et al., 2024; Chen et al., 2024b; Huang et al., 2025) suggest that even subtle modifications in phrasing can substantially affect both the reasoning process and the final outcome. Building on this observation, several methods (Yadav et al., 2024; Chen et al., 2024b) leverage beneficial paraphrasing to enhance LLM performance in tasks such as reasoning and intent classification. In addition, paraphrasing has been employed to assess model uncertainty (Gao et al., 2024; Tanneru et al., 2024), thereby enhancing the reliability and trustworthiness of LLM applications. In our work, we utilize paraphrasing as a principled tool to extract critical tokens and to aggregate answers. <details> <summary>x1.png Details</summary> ![eae4ca0a](/v1/image/eae4ca0afe6acc6e00b025ae2ff611f9597cf2cb6414e27b8a7c5083b262a67c) ### Visual Description ## Diagram: Two-Phase Paraphrasing and Consistency Verification Process for AI Answer Validation ### Overview The image is a technical flowchart illustrating a two-phase method for verifying the correctness of an AI-generated answer to a math word problem. The process involves paraphrasing the original question, generating an initial response, identifying a "critical token" within that response, and then using consistency checks across multiple generated trajectories to determine a final, verified answer. The diagram uses a combination of text boxes, arrows, icons, and color-coding (red for incorrect, green for correct) to depict the workflow. ### Components/Axes The diagram is divided into two main vertical sections: 1. **Left Section: Phase I: Paraphrastic Probing** * Contains Steps 1 through 4. * Uses a light blue background for the phase header. * Includes example text boxes with a math problem about a bakery. * Features a red "X" icon indicating an incorrect answer in Step 2. * Includes a mathematical notation box in Step 3. * Features a red warning triangle icon in Step 4. 2. **Right Section: Phase II: Consistency Verification** * Contains Steps 1 through 4. * Uses a light blue background for the phase header. * Includes multiple example text boxes showing different answer trajectories. * Features a green checkmark icon indicating the correct final answer in Step 4. ### Detailed Analysis **Phase I: Paraphrastic Probing** * **Step 1: paraphrase the original question.** * Original Question: "A bakery produces 60 loaves of bread each day... How many loaves of bread are sold in the afternoon?" * Paraphrased Question: "In a baking bakery, daily production meets the demand for 60 freshly baked loaves. What is the number of loaves sold in the afternoon?" * **Step 2: generate the initial response.** * The AI's initial response to the original question is shown: "To solve this problem, we will break it down into steps. Step 1: Calculate the number of loaves sold in the morning. Therefore, the number of loaves of bread sold in the afternoon is 5." * This response is marked with a red **X**, indicating it is incorrect. * **Step 3: Concatenate the paraphrased question with the initial answer, and compute the probabilities for top-k tokens and expected tokens at each position.** * The concatenated input is shown: "[Paraphrased Question]... [Initial Answer]..." * A box displays token probability calculations: * `Top-k tokens: p(morning)=0.85, p(5)=0.32, p(10)=0.71, p(number)=0.25, p(sold)=0.85, p(in)=0.22` * `Expected tokens: p(morning)=0.85, p(5)=0.32, p(10)=0.71, p(number)=0.25, p(sold)=0.85, p(in)=0.22` * **Step 4: Identify the critical token with the verifier Δ.** * A calculation is shown: `Δ(morning)=|p(morning)-p'(morning)|=0.85-0.25=0.60, Δ(5)=0.32-0.22=0.10, Δ(10)=0.71-0.48=0.23` * The conclusion is stated: "The token 'in' is the chosen critical token." **Phase II: Consistency Verification** * **Step 1: Obtain the candidate tokens at the critical token position.** * The system looks at the position of the critical token ("in") in the initial response. * **Step 2: Truncate the initial answer and replace the critical token with the critical tokens.** * Two new prompt versions are created by truncating the initial answer before the critical token and inserting candidate tokens. * Prompt A ends with "...sold in **during** the afternoon." * Prompt B ends with "...sold in **the** afternoon." * **Step 3: Generate new trajectories from partial answers to both the original and paraphrased questions, and the same procedure is also applied to the initial answer.** * Four example trajectories are shown in a 2x2 grid: * **Top-Left (Original Question, Prompt A):** Generates an answer ending with "...sold in the afternoon is 5." * **Top-Right (Paraphrased Question, Prompt A):** Generates an answer ending with "...sold in the afternoon is 5." * **Bottom-Left (Original Question, Prompt B):** Generates an answer ending with "...sold in the afternoon is 10." * **Bottom-Right (Paraphrased Question, Prompt B):** Generates an answer ending with "...sold in the afternoon is 10." * **Step 4: Determine the final answer with consistency mechanism.** * The final box states: "The answers derived from the second input are more consistent than that from the first input. Thus, the final answer is 10." * This conclusion is marked with a green **checkmark**. ### Key Observations 1. **Process Flow:** The diagram clearly shows a sequential, two-phase pipeline. Phase I identifies a point of uncertainty (the critical token "in") in an initial, incorrect answer. Phase II uses that token to probe for more consistent answers. 2. **Color Coding:** Red is used exclusively to mark the incorrect initial answer (Step 2, Phase I). Green is used to mark the correct final answer (Step 4, Phase II). 3. **Spatial Layout:** The two phases are presented side-by-side for comparison. Within Phase II, the four generated trajectories are arranged in a 2x2 grid, grouping results by the question type (original vs. paraphrased) and the inserted critical token ("during" vs. "the"). 4. **Mathematical Notation:** Step 3 of Phase I includes explicit probability calculations, showing the model's confidence in different tokens at a specific position. 5. **Consistency as a Metric:** The core insight is that the answer "10" is chosen not because it is generated once, but because it appears consistently across multiple generated paths (both question types) when the critical token is replaced with "the". ### Interpretation This diagram illustrates a sophisticated method for improving the reliability of AI-generated answers, particularly for multi-step reasoning tasks like math word problems. The process acknowledges that an initial AI response may be flawed. Instead of accepting it at face value, the system: 1. **Probes for Uncertainty:** It paraphrases the question and analyzes the model's token probabilities to find a "critical token"—a point where the model's confidence is low or where a small change could alter the answer's meaning (e.g., changing "in" to "during" or "the"). 2. **Tests for Consistency:** It then uses this critical token as a lever to generate multiple alternative answer paths. The key hypothesis is that a *correct* answer will be robust and appear consistently across these varied paths, while an *incorrect* answer will be fragile and lead to inconsistent results. 3. **Makes a Data-Driven Decision:** The final answer is selected based on which candidate ("5" or "10") demonstrates greater consistency across the generated trajectories. In this case, "10" is the consistent outcome. The underlying principle is that **truth is consistent, while error is often arbitrary**. By systematically testing how small perturbations at a point of uncertainty affect the final output, the system can distinguish between a reliably correct answer and a coincidentally generated incorrect one. This method moves beyond simple single-pass generation towards a form of internal verification and consensus-building within the model's own outputs. </details> Figure 4: Illustration of the proposed paraphrastic probing and consistency verification (PPCV) framework. The framework comprises two stages: (i) probing critical tokens through paraphrased forms, and (ii) rolling out new reasoning steps with alternative tokens and selecting the final answer using the paraphrase consistency verification mechanism. <details> <summary>images/probe_illustration.png Details</summary> ![73b0d8b8](/v1/image/73b0d8b81ff52e05ffb1b6199da499561fc8ab108891c17bce25b6e3b7489a57) ### Visual Description ## [Document Type]: Technical Case Study on Critical Token Identification ### Overview The image displays a technical case study titled "Case study on the identification and effects of critical tokens." It presents a mathematical word problem in two forms (original and paraphrased) and compares two AI-generated responses. The left response contains errors and is marked with a cross (❌), while the right response is corrected and marked with a checkmark (✔). The core focus is on how identifying and manipulating "critical tokens" (key words or phrases) in the reasoning process can lead to a correct solution. ### Components/Axes The document is structured into two main vertical columns below a header. - **Header**: A dark blue banner with white text: "Case study on the identification and effects of critical tokens". - **Left Column**: Contains the "Original question," "Paraphrased question," and a "Response with highlighted candidate critical tokens." This response is marked as incorrect. - **Right Column**: Contains a "New responses initialized from the critical token position." This response is marked as correct. - **Text Formatting**: Key phrases ("critical tokens") are highlighted in **red** and **purple** within the left-column response. The right-column response uses **purple** highlighting for its key terms. - **Symbols**: A black circle with a white cross (❌) concludes the left response. A black circle with a white checkmark (✔) concludes the right response. ### Detailed Analysis / Content Details **1. Problem Statements:** - **Original question:** "In a card game, you are awarded 10 points if you win one round. While playing, Jane ends up with 60 points. If she lost 20 points, then how many rounds did she play? (ground truth answer: 8)" - **Paraphrased question:** "At a bridge competition, every game won grants the player 10 points. Unfortunately, Jane started with an advantage but ended up losing 20 points and scoring only 60 points. If she had won x rounds and lost none, what would be the value of x? (ground truth answer: 8)" **2. Left Response (Incorrect):** - **Title:** "Response with highlighted candidate critical tokens" - **Content:** The response outlines a two-step solution. - **Step 1:** "Determine [Understand] the total points Jane earned [scored] by [.] subtracting [winning] the points she lost from her [the] final score. Total points earned = Final score - [+] Points lost. Total points earned = 60 – 20. Total points earned = 40." - **Step 2:** "Since Jane earns 10 points for [per] each round she [won] wins, we can find the number of rounds she played by dividing the total points earned by the points earned per round. Number of rounds = Total points earned / Points earned per round. Number of rounds = 40 / 10. Number of rounds = 4." - **Conclusion:** "Therefore, Jane played 4 rounds. ❌" - **Highlighted Tokens (Red/Purple):** follow, find, Determine, Understand, earned, scored, by, subtracting, winning, her, the, +, for, per, she, won. **3. Right Response (Correct):** - **Title:** "New responses initialized from the critical token position" - **Content:** The response outlines a three-step solution. - **Step 1:** "Determine the total points Jane earned by **winning** games. Since Jane scored 60 points and lost 20 points, we can set up an equation to represent the situation. Total points = Points earned from winning + Points lost We know that Jane lost 20 points, so we can subtract this from the total points to find the points earned from winning. 60 = Points earned from winning – 20." - **Step 2:** "Solve for the points earned from winning. Add 20 to both sides of the equation to isolate the points earned from winning. 60 + 20 = Points earned from winning. 80 = Points earned from winning." - **Step 3:** "Determine the number of games won. Since Jane earns 10 points for each game won, we can divide the points earned from winning by 10 to find the number of games won. Number of games won = Points earned from winning / 10 = 8. ✔" - **Highlighted Token (Purple):** winning. ### Key Observations 1. **Critical Error in Left Response:** The left response makes a fundamental logical error in Step 1. It incorrectly calculates "Total points earned" as `60 - 20 = 40`. This misinterprets the problem, treating the 20 lost points as a deduction from the final score to find points earned, rather than as a separate component of the total score equation. 2. **Correct Logic in Right Response:** The right response correctly sets up the equation `Total Score = Points from Winning + Points Lost`, leading to `60 = Points from Winning - 20`. Solving this yields `Points from Winning = 80`, which is then divided by 10 points per win to get the correct answer of 8 rounds. 3. **Role of Critical Tokens:** The case study visually demonstrates that the incorrect reasoning path (left) is associated with a cluster of highlighted "critical tokens" (e.g., "subtracting," "for," "per"). The corrected reasoning (right) is initialized from a different critical token position, focusing on the token "**winning**," which leads to the proper equation setup. 4. **Spatial Layout:** The incorrect and correct solutions are presented side-by-side for direct comparison, emphasizing the divergent reasoning paths stemming from different initial token focuses. ### Interpretation This case study is an investigation into the interpretability and controllability of AI reasoning. It suggests that: - **Underlying Mechanism:** The "critical tokens" likely represent pivotal points in the model's internal reasoning chain where a specific word or phrase can steer the entire solution trajectory toward correctness or error. - **Error Source:** The incorrect answer (4) stems from a misinterpretation of the problem's arithmetic structure, specifically the relationship between final score, points won, and points lost. The model incorrectly performs subtraction where it should perform addition. - **Corrective Action:** By identifying and re-initializing the reasoning process from a different critical token (e.g., "winning" instead of "subtracting"), the model can be guided to construct the correct algebraic relationship (`60 = W - 20`), leading to the right answer. - **Broader Implication:** The image provides a concrete example of how analyzing and manipulating specific tokens within an AI's generated text can be used to debug errors and improve the reliability of its problem-solving outputs. It moves beyond just stating an answer is wrong to diagnosing *where* and *why* the reasoning diverged from the correct path. </details> Figure 5: Case study illustrating the identification and effects of critical tokens identified by our method. Tokens highlighted in red indicate candidate critical tokens, whereas tokens highlighted in purple correspond to alternative tokens generated when conditioning on paraphrased questions. ## 3 Methodology In this section, we present the two components of our framework in detail: paraphrastic probing and consistency verification. An detailed illustration of our framework is shown in Figure 4. We then discuss the proposed method and provide the complete algorithm. ### 3.1 Paraphrastic Probing Previous findings (Zhou et al., 2024; Chen et al., 2024b; Huang et al., 2025) on the impact of a problem’s surface form suggest that the quality of intermediate reasoning steps is influenced not only by the underlying mathematical relationships and logic, but also by how the problem is expressed. Notably, LLMs are sometimes able to solve a paraphrased version of a problem that they fail to solve in its original form, highlighting the potential of paraphrasing to uncover pivotal tokens that are critical for successful reasoning. Motivated by this observation, we introduce paraphrastic probing to efficiently identify the critical token. Given the original question $q_0$ , we first prompt the LLM to generate multiple paraphrased forms, denoted as $q_1,q_2,...,q_N$ , where $N$ is the number of the paraphrased questions. We adopt Automatic Prompt Engineering (APE) (Zhou et al., 2022) to derive paraphrasing instructions that preserve semantic integrity and all numerical values, mathematical relationships, and core logical structures of the problem, while maximizing linguistic and contextual diversity. Additional details can be found in Appendix B. We then obtain the initial reasoning path $r^q_0_0$ for the original question using greedy decoding. This reasoning path is subsequently concatenated with each paraphrased question, and the resulting synthetic inputs are fed into the LLM to compute the token probability distribution at each position in $r^q_0_0$ . Specifically, the token probability distribution at $i$ th position conditioned on the paraphrased question $q_n$ is expressed as $$ P_i^q_n=LLM(\tilde{a}_i|I,q_n,r^q_0_0,<i), \tag{1} $$ where $I$ denotes the instruction prefix and $\tilde{a}_i$ represents the sampled token at $i$ th position. The token $\tilde{a}_i$ is regarded as a candidate critical token if predicted top-1 token does not match the expected token at the same position in $r^q_0_0$ , i.e., $$ \operatorname*{arg max}P_i^q_n≠ a_i, \tag{2} $$ where $a_i$ denotes the token at the $i$ th position in $r^q_0_0$ . <details> <summary>images/condition_one_gsm8k.png Details</summary> ![dcfc3e75](/v1/image/dcfc3e7588f648d0dfd51164ff122d37e437e1c11370c9fe8c04c9e91b9a674a) ### Visual Description ## Bar Chart: Token Type Distribution by Average Accuracy ### Overview The image is a grouped bar chart with error bars, comparing the fractional percentage of "critical tokens" versus "random tokens" across two categories of average accuracy. The chart visually demonstrates an inverse relationship between token type and accuracy range. ### Components/Axes * **Y-Axis:** Labeled "Fraction(%)". Scale ranges from 0 to 70, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70). * **X-Axis:** Labeled "Average accuracy(%)". Contains two categorical groups: 1. `≤ 5%` (Less than or equal to 5 percent) 2. `> 5%` (Greater than 5 percent) * **Legend:** Positioned in the top-right corner of the plot area. * Darker teal bar: `critical tokens` * Lighter teal bar: `random tokens` * **Data Representation:** Each category on the x-axis has two adjacent bars (one for each token type). Each bar is topped with a black error bar (I-beam style), indicating variability or confidence intervals. ### Detailed Analysis **1. Category: ≤ 5% Average Accuracy** * **Critical Tokens (Darker Teal):** The bar is tall, reaching approximately **69-70%** on the y-axis. The error bar extends from roughly **67% to 72%**. * **Random Tokens (Lighter Teal):** The bar is significantly shorter, reaching approximately **32%**. The error bar extends from roughly **27% to 37%**. * **Trend:** In the low-accuracy range (≤ 5%), the fraction of critical tokens is more than double that of random tokens. **2. Category: > 5% Average Accuracy** * **Critical Tokens (Darker Teal):** The bar is short, reaching approximately **30-31%**. The error bar extends from roughly **28% to 33%**. * **Random Tokens (Lighter Teal):** The bar is tall, reaching approximately **68%**. The error bar extends from roughly **63% to 73%**. * **Trend:** In the higher-accuracy range (> 5%), the pattern reverses. The fraction of random tokens is more than double that of critical tokens. ### Key Observations * **Inverse Relationship:** There is a clear crossover effect. The token type that dominates in the low-accuracy category (`critical tokens`) becomes the minority in the high-accuracy category, and vice-versa. * **Symmetry of Magnitude:** The dominant fraction in each category is similar in magnitude (~68-70%), as is the subordinate fraction (~30-32%). * **Error Bar Consistency:** The error bars for all data points are of similar relative size, suggesting consistent variability across measurements. The error bars for the dominant bars in each category do not overlap with their subordinate counterparts, indicating the differences are likely statistically significant. ### Interpretation This chart suggests a strong correlation between token type and model performance accuracy. The data implies that: 1. **Critical tokens are strongly associated with lower accuracy.** When a model's average accuracy is very low (≤ 5%), a high proportion of the tokens involved are classified as "critical." 2. **Random tokens are strongly associated with higher accuracy.** Conversely, when the model achieves higher accuracy (> 5%), a high proportion of the tokens are "random." This could indicate that "critical tokens" are those pivotal to a task's outcome; their mismanagement leads to failure (low accuracy). "Random tokens" might represent less consequential or correctly handled elements that proliferate in successful (high accuracy) outcomes. The chart effectively visualizes a potential diagnostic metric: a high fraction of critical tokens may be a signal of a model operating in a low-accuracy regime. </details> (a) <details> <summary>images/condition_two_gsm8k.png Details</summary> ![31038867](/v1/image/310388673b30a889bb2af9711bde9576c2be6c571c38f1130cbebe08d3ae0b99) ### Visual Description ## Grouped Bar Chart: Token Type Fraction vs. Average Accuracy ### Overview The image displays a grouped bar chart comparing the fractional percentage of two token types ("critical tokens" and "random tokens") across two categories of average accuracy. The chart includes error bars for each data point, indicating variability or confidence intervals. ### Components/Axes * **Chart Type:** Grouped bar chart with error bars. * **Y-Axis:** * **Label:** `Fraction(%)` * **Scale:** Linear, ranging from 0 to 60, with major tick marks at intervals of 10 (0, 10, 20, 30, 40, 50, 60). * **X-Axis:** * **Label:** `Average accuracy(%)` * **Categories:** Two discrete categories are plotted: 1. `≤ 10%` (Less than or equal to 10 percent) 2. `> 10%` (Greater than 10 percent) * **Legend:** * **Position:** Top-center of the plot area. * **Series:** 1. `critical tokens` - Represented by a teal-colored bar. 2. `random tokens` - Represented by a light green-colored bar. * **Data Series & Spatial Grounding:** * For each x-axis category, two bars are placed side-by-side. The left bar in each pair corresponds to "critical tokens" (teal), and the right bar corresponds to "random tokens" (light green). ### Detailed Analysis **Category 1: Average accuracy ≤ 10%** * **Critical Tokens (Teal Bar, Left):** The bar height indicates a fraction of approximately **62%**. An error bar extends from roughly 60% to 64%. * **Random Tokens (Light Green Bar, Right):** The bar height indicates a fraction of approximately **40%**. An error bar extends from roughly 38% to 42%. **Category 2: Average accuracy > 10%** * **Critical Tokens (Teal Bar, Left):** The bar height indicates a fraction of approximately **37%**. An error bar extends from roughly 35% to 39%. * **Random Tokens (Light Green Bar, Right):** The bar height indicates a fraction of approximately **60%**. An error bar extends from roughly 58% to 62%. **Trend Verification:** * The fraction of **critical tokens** shows a clear **downward trend** as average accuracy increases, dropping from ~62% in the low-accuracy group to ~37% in the high-accuracy group. * The fraction of **random tokens** shows a clear **upward trend** as average accuracy increases, rising from ~40% in the low-accuracy group to ~60% in the high-accuracy group. ### Key Observations 1. **Inverse Relationship:** There is a strong inverse relationship between the two token types across the accuracy categories. When one is high, the other is low. 2. **Dominant Token Type Flips:** In the low-accuracy (`≤ 10%`) scenario, critical tokens are the dominant fraction (~62% vs. ~40%). In the high-accuracy (`> 10%`) scenario, random tokens become the dominant fraction (~60% vs. ~37%). 3. **Magnitude of Change:** The change in fraction for both token types between the two accuracy categories is substantial, on the order of 20-25 percentage points. 4. **Error Bars:** The error bars are relatively small compared to the differences between the bars, suggesting the observed differences between token types and across categories are likely statistically meaningful. ### Interpretation This chart suggests a fundamental shift in the composition of tokens based on model performance (average accuracy). * **Low Accuracy (≤ 10%):** The high fraction of "critical tokens" implies that when a model performs poorly, its outputs or internal states are disproportionately composed of tokens deemed "critical." This could mean the model is struggling with or over-representing key, high-stakes, or error-prone components of the task. * **High Accuracy (> 10%):** The reversal, where "random tokens" dominate, suggests that as model accuracy improves, the proportion of these "critical" tokens decreases significantly. The model's operation becomes characterized more by "random" tokens. This could indicate that high performance is associated with a more balanced, less error-focused, or more fluent distribution of tokens, where the specific "critical" tokens are no longer the primary driver of the output. **Underlying Question:** The data prompts an investigation into the definitions of "critical" and "random" tokens. The chart demonstrates that these categories are not static; their prevalence is a strong function of model accuracy. This relationship is crucial for understanding model behavior, diagnosing failure modes (where critical tokens dominate), and characterizing successful operation (where random tokens are more prevalent). </details> (b) <details> <summary>images/compare_ct_rnd_gsm8k.png Details</summary> ![d9a0bed5](/v1/image/d9a0bed5ec88b291dcae71f1fb69161432879ce6585c7bd9397f6d7672cbb609) ### Visual Description ## Line Chart: pass@k(%) vs. number of sample k ### Overview The image is a line chart comparing the performance of two methods, "critical tokens" and "random tokens," across different sample sizes (k). The performance metric is "pass@k(%)" which likely represents the percentage of problems solved correctly when given k samples. The chart includes error bars for each data point, indicating variability or confidence intervals. ### Components/Axes * **Chart Type:** Line chart with error bars. * **X-Axis:** * **Label:** "number of sample k" * **Scale:** Linear scale from approximately 0 to 50. * **Major Tick Marks:** 10, 20, 30, 40. * **Y-Axis:** * **Label:** "pass@k(%)" * **Scale:** Linear scale from 50 to approximately 88. * **Major Tick Marks:** 50, 55, 60, 65, 70, 75, 80, 85. * **Legend:** * **Location:** Bottom-right corner of the plot area. * **Series 1:** Red line with upward-pointing triangle markers, labeled "critical tokens". * **Series 2:** Purple (magenta) line with star markers, labeled "random tokens". * **Grid:** Dashed gray grid lines are present for both major x and y ticks. ### Detailed Analysis **Data Series: critical tokens (Red line, triangle markers)** * **Trend:** The line shows a clear upward trend, with a steep initial increase that gradually flattens as k increases. * **Data Points (Approximate):** * k ≈ 5: pass@k ≈ 71% (Error bar range: ~70% to ~73%) * k ≈ 10: pass@k ≈ 78% (Error bar range: ~76% to ~80%) * k ≈ 15: pass@k ≈ 82% (Error bar range: ~80.5% to ~83.5%) * k ≈ 20: pass@k ≈ 84% (Error bar range: ~82.5% to ~85.5%) * k ≈ 30: pass@k ≈ 85% (Error bar range: ~83.5% to ~86.5%) * k ≈ 45: pass@k ≈ 86% (Error bar range: ~85% to ~87.5%) **Data Series: random tokens (Purple line, star markers)** * **Trend:** The line also shows an upward trend, but it is consistently lower than the "critical tokens" series. The slope is more gradual throughout. * **Data Points (Approximate):** * k ≈ 5: pass@k ≈ 51% (Error bar range: ~48% to ~54%) * k ≈ 10: pass@k ≈ 57% (Error bar range: ~53.5% to ~60%) * k ≈ 15: pass@k ≈ 60% (Error bar range: ~57% to ~63%) * k ≈ 20: pass@k ≈ 61.5% (Error bar range: ~58.5% to ~64.5%) * k ≈ 30: pass@k ≈ 62.5% (Error bar range: ~59.5% to ~65.5%) * k ≈ 45: pass@k ≈ 64% (Error bar range: ~60.5% to ~67.5%) ### Key Observations 1. **Performance Gap:** There is a significant and consistent performance gap between the two methods. "Critical tokens" outperforms "random tokens" at every measured value of k. 2. **Diminishing Returns:** Both curves show diminishing returns. The largest performance gains for both methods occur when increasing k from 5 to 15. The improvement per additional sample becomes smaller as k grows larger. 3. **Error Bar Comparison:** The error bars for the "random tokens" series appear visually larger (especially at k=5 and k=45) than those for the "critical tokens" series. This suggests that the performance of the random method is more variable or less certain. 4. **Convergence:** The two lines do not appear to be converging. The absolute difference in pass@k between the two methods remains roughly constant (around 20-22 percentage points) across the range of k shown. ### Interpretation This chart demonstrates the superior effectiveness of a "critical tokens" strategy over a "random tokens" strategy for the task measured by pass@k. The data suggests that intelligently selecting or focusing on "critical" tokens leads to a much higher success rate than using randomly selected tokens, regardless of the number of samples (k) considered. The consistent gap indicates that the advantage of the critical token approach is fundamental and not merely a function of sample size. The larger error bars for the random method imply it is less reliable. The diminishing returns for both methods are typical in sampling-based approaches, but the critical token method reaches a high level of performance (over 80%) with a relatively small number of samples (k=15), making it potentially more efficient. The chart provides strong empirical evidence for prioritizing the identification and use of critical tokens in this context. </details> (c) <details> <summary>images/density_gsm8k.png Details</summary> ![1ea4a69b](/v1/image/1ea4a69bcadfc760ac135b52db95ee2cc6369cffa7e1ca9afdf20ae543feb486) ### Visual Description ## Histogram: Consistency Score Density by Answer Correctness ### Overview The image displays a histogram comparing the density distribution of "consistency scores" for two groups: one associated with incorrect answers and one with correct answers. The chart uses overlapping, semi-transparent bars to show the frequency (density) of scores ranging from 0 to 5. ### Components/Axes * **Chart Type:** Histogram (overlapping bar chart). * **X-Axis:** * **Label:** `consistency score` * **Scale:** Linear, with integer markers at `0`, `1`, `2`, `3`, `4`, `5`. * **Y-Axis:** * **Label:** `density(%)` * **Scale:** Linear, ranging from `0.0` to `0.7` with increments of `0.1`. * **Legend:** * **Position:** Top-right corner of the chart area. * **Entry 1:** A light teal (mint green) rectangle labeled `w incorrect answers`. * **Entry 2:** A light pink (salmon) rectangle labeled `w correct answers`. * **Data Series:** 1. **`w incorrect answers` (Teal Bars):** Represents the density distribution of consistency scores for instances where answers were incorrect. 2. **`w correct answers` (Pink Bars):** Represents the density distribution of consistency scores for instances where answers were correct. ### Detailed Analysis **Trend Verification & Data Points (Approximate Values):** * **`w incorrect answers` (Teal):** * **Trend:** The distribution is heavily right-skewed, with a very high peak at the lowest score and a sharp decline as the score increases. * **Data Points:** * Score 0: Density ≈ `0.68` (68%) * Score 1: Density ≈ `0.18` (18%) * Score 2: Density ≈ `0.07` (7%) * Score 3: Density ≈ `0.04` (4%) * Score 4: Density ≈ `0.03` (3%) * Score 5: Density ≈ `0.03` (3%) * **`w correct answers` (Pink):** * **Trend:** The distribution is left-skewed, with low density at low scores, rising to a peak at score 4 before dropping sharply at score 5. * **Data Points:** * Score 0: Density ≈ `0.05` (5%) * Score 1: Density ≈ `0.05` (5%) * Score 2: Density ≈ `0.10` (10%) * Score 3: Density ≈ `0.15` (15%) * Score 4: Density ≈ `0.63` (63%) * Score 5: Density ≈ `0.03` (3%) **Spatial Grounding & Component Isolation:** * The legend is positioned in the **top-right** quadrant, partially overlapping the chart's plot area. * The teal bars for `w incorrect answers` dominate the **left side** of the chart (scores 0-1). * The pink bars for `w correct answers` dominate the **right-center** of the chart (score 4). * The area of overlap (appearing as a muted brownish-pink) is most visible at scores 0, 1, and 2, indicating where both distributions have non-zero density. ### Key Observations 1. **Polar Opposite Peaks:** The two distributions have nearly inverse shapes. Incorrect answers are overwhelmingly concentrated at the lowest consistency score (0), while correct answers are overwhelmingly concentrated at a high consistency score (4). 2. **Minimal Overlap at Extremes:** There is very little overlap in the high-density regions. The peak for incorrect answers (score 0) has very low density for correct answers, and the peak for correct answers (score 4) has very low density for incorrect answers. 3. **Score 5 Anomaly:** Both distributions show a very low and similar density at the maximum score of 5, suggesting that achieving perfect consistency (score 5) is rare for both correct and incorrect answers in this dataset. 4. **Clear Separation:** The chart visually demonstrates a strong separation between the consistency profiles of correct and incorrect answers. ### Interpretation This histogram suggests a powerful correlation between the **consistency score** and the **correctness of an answer**. * **Low Consistency (Scores 0-1)** is a strong indicator of an **incorrect answer**. The data implies that when an answer is inconsistent (scoring 0 or 1), it is highly likely to be wrong. * **High Consistency (Score 4)** is a strong indicator of a **correct answer**. The peak at score 4 for correct answers suggests that a high level of consistency is a key characteristic of correct responses in this context. * **The "Sweet Spot":** The optimal consistency score for correctness appears to be **4**, not the maximum of 5. This could indicate that perfect consistency (score 5) might be associated with overfitting, rigidity, or a different type of error, while score 4 represents a robust and reliable level of consistency. * **Practical Implication:** If this consistency score is a metric from a model or evaluation system, it can be used as a highly effective filter or confidence indicator. Answers with a consistency score of 0 or 1 can be flagged as likely incorrect with high confidence, while answers with a score of 4 can be treated as highly reliable. The clear bimodal separation makes this a particularly useful diagnostic tool. </details> (d) Figure 6: (a) Comparison of the fraction of tokens whose associated rollout correctness falls below or above 5%, for tokens extracted by our method versus randomly selected tokens. (b) Comparison of the fraction of the subsequent tokens whose associated rollout correctness falls below or above 10%, for tokens extracted by our method versus randomly selected tokens. (c) Comparison of the impact of extracted tokens by our method versus random tokens on LLM reasoning performance. (d) Comparison of the density distributions of consistency scores for rollouts with correct and incorrect answers. To validate the effectiveness of our method in identifying critical tokens and to demonstrate the pivotal role these tokens play when extracted via paraphrastic probing, we conduct a case study illustrated in Figure 5. Because large language models can adjust their reasoning trajectories under the paraphrased form of the question, certain tokens in the original reasoning sequence produce different subsequent tokens when conditioned on paraphrased inputs. In this example, most candidate critical tokens are located within incorrect reasoning segments. Notably, when we identify “subtracting” as a critical token and replace it with an alternative token (i.e., “winning”), the resulting rollout corrects the erroneous reasoning steps and yields the correct final answer. This example highlights the impact of critical tokens and underscores the effectiveness of our method in identifying tokens that are pivotal to reasoning outcomes. In addition, we conduct a quantitative analysis to examine the authenticity and impact of tokens extracted via paraphrastic probing, comparing them against randomly selected tokens. Specifically, we sample 100 instances with incorrect reasoning trajectories from the GSM8K (Cobbe et al., 2021) training sets. Following the paraphrastic probing pipeline, we identify candidate critical tokens in the early reasoning steps. In each run, we randomly sample 40% of the candidate critical tokens for evaluation and repeat the experiment independently 10 times. For comparison, we apply the same procedure to randomly selected tokens. All evaluations are conducted using Llama-3.1-8B-Instruct (Grattafiori et al., 2024). We first compute the average accuracy of rollouts generated from partial trajectories truncated at the position of the extracted tokens, and compare these results with those obtained from random-token truncation. As shown in Figure 6(a), a large proportion of the extracted tokens exhibit near-zero correctness, consistent with the first criterion of critical tokens. We further evaluate the average accuracy of rollouts initiated from subsequent tokens (5–10 tokens). The results in Figure 6(b) indicate that errors persist beyond the identified positions, supporting the second criterion that critical tokens induce sustained degradation in downstream reasoning. Finally, we substitute the extracted critical tokens with alternative tokens and roll out multiple new reasoning trajectories. As shown in Figure 6(c), replacing critical tokens leads to a significantly larger improvement in reasoning accuracy compared to substituting random tokens. These results further validate both the pivotal role of the identified critical tokens and the effectiveness of our paraphrastic probing method. Additional results can be found in Appendix A. We introduce a heuristic verifier to select the final critical token from multiple candidates. For a candidate token $a_i$ and paraphrased question $q_n$ , the verification score is defined as $$ Δ_q_{n}(a_i)=\max P^q_n_i-P^q_n_i(\tilde{a}_i=a_i). \tag{3} $$ where $P^q_n_i$ denotes the predictive distribution at position $i$ on question $q_n$ . Intuitively, $Δ$ measures how much the predicted top-1 token deviates from the expected token, indicating the token’s potential impact on the reasoning trajectory. For each extracted token $a_i$ ,we take the maximum score across paraphrases, $$ Δ(a_i)=\max_q_{n}Δ_q_{n}(a_i),\vskip-5.69054pt \tag{4} $$ and select the final critical token as $$ a_c=\operatorname*{arg max}_iΔ(a_i).\vskip-5.69054pt \tag{5} $$ ### 3.2 Consistency Verification After identifying the final critical token $a_c$ , we aim to refine the original reasoning path with alternative tokens and achieve final answer with paraphrase consistency mechanism. Specifically, we generate a set of alternative tokens $a^0_c,a^1_c,a^2_c,...,a^K-1_c$ using the LLM conditioned on original question $q_0$ , where $a^0_c$ is the original token in $r^q_0_0$ and the remaining tokens are sampled via top-K sampling. The initial reasoning path is truncated at the position of critical token, and each alternative token is concatenated to form synthetic inputs $\tilde{r}_c^0,\tilde{r}_c^1,\tilde{r}_c^2,...,\tilde{r}_c^K-1$ . We then roll out new reasoning trajectories for each synthetic input with respect to both the original and paraphrased questions using greedy decoding, denoted as $r^q_0_k,r^q_1_k,...,r^q_N_k$ for $k=0,1,2,...,K-1$ . Next, for the rollout with the $k$ th alternative token, we compare the answers obtained from the paraphrased forms with that from the original form and compute a consistency score $c_k=∑_n-1^NI(Φ(r^q_0_k)=Φ(r^q_n_k))$ , where $Φ(·)$ and $I(·)$ denotes the function that extracts the final answer from a reasoning trajectory and the indicator function, respectively. The answer associated with the highest consistency score is then selected as the final prediction $$ ans_f=Φ(r^q_0_k),where k=\operatorname*{arg max}_kc_k.\vskip-11.38109pt \tag{6} $$ To justify our paraphrase consistency mechanism, we investigate the impact of paraphrased forms on LLM reasoning. We sample instances from the GSM8K (Cobbe et al., 2021) and follow our pipeline to extract critical tokens. From each truncated reasoning trajectory, we roll out multiple reasoning paths by concatenating alternative tokens. For each original question, we generate five paraphrased variants and compute the consistency score for resulting rollouts. The evaluation is conducted on Llama-3.1-8B-Instruct (Grattafiori et al., 2024). We then analyze the distribution of consistency scores for rollouts that yield correct versus incorrect answers. As shown in Figure 6(d), more than 90% of rollouts with correct answers achieve a consistency score of at least 1, whereas this proportion drops to around 30% for rollouts with incorrect answers. This sharp contrast indicates that correct rollouts are more robust across paraphrased variants, motivating the design of our paraphrase consistency mechanism to exploit this property for improved final predictions. To address potential collisions when multiple answers obtain the same maximum consistency score, we introduce similarity-weighted consistency verification. Inspired by weighted majority voting (Dogan and Birant, 2019), this approach adjusts the influence of each paraphrased form on the consistency score according to its similarity to the original form. Intuitively, paraphrased forms with lower similarity should exert greater weight, as they provide stronger evidence of robustness, whereas those closely resembling the original form contribute less. Concretely, we first extract embeddings for both the original and paraphrased questions and compute their similarity scores as $s_n=sim(q_0,q_n)$ , where $sim(·)$ denotes a similarity measure such as cosine similarity. We then derive weights via a softmax function $w_n=softmax(s_n)=\frac{\exp(-λ s_n)}{∑_n\exp(-λ n)}$ , where $λ$ is the temperature scaling coefficient. Finally, the similarity-weighted consistency score is defined as $\tilde{c}_k=∑_n-1^Nw_nI(Φ(r^q_0_k)=Φ(r^q_n_k))$ . This ensures agreement with more diverse paraphrases contributes more strongly to the final decision. Table 1: Comparison of our method with baseline approaches on Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.2. | Llama-3.1 Self-Consistency Tree-of-Thought | Chain-of-Thought 80.60 75.74 | 77.40 31.80 33.28 | 28.00 37.80 31.60 | 31.00 85.10 81.20 | 83.00 60.75 80.72 | 58.91 | | --- | --- | --- | --- | --- | --- | --- | | Guided Decoding | 75.51 | 32.45 | 31.20 | 81.70 | 81.74 | | | Predictive Decoding | 81.43 | 40.26 | 34.00 | 85.90 | 84.56 | | | Phi-Decoding | 86.58 | 39.88 | 38.20 | 84.50 | 85.41 | | | PPCV (Ours) | 88.24 | 49.73 | 50.00 | 89.60 | 88.31 | | | Mistral-7B | Chain-of-Thought | 46.45 | 26.91 | 12.20 | 62.40 | 41.42 | | Self-Consistency | 50.38 | 28.65 | 14.20 | 66.70 | 44.54 | | | Tree-of-Thought | 50.49 | 25.78 | 11.40 | 60.60 | 41.04 | | | Guided Decoding | 50.79 | 27.07 | 14.00 | 62.90 | 39.51 | | | Predictive Decoding | 55.67 | 27.07 | 14.40 | 62.10 | 47.87 | | | Phi-Decoding | 56.60 | 28.43 | 13.40 | 63.20 | 60.24 | | | PPCV (Ours) | 56.58 | 31.08 | 14.60 | 69.30 | 69.88 | | Table 2: Comparison of our method with baseline approaches on Qwen3-32B (non-thinking mode). | Qwen3-32B Guided Decoding Predictive Decoding | Chain-of-Thought 26.67 32.67 | 30.00 22.67 24.00 | 23.67 28.67 33.33 | 30.00 7.33 10.33 | 9.67 | | --- | --- | --- | --- | --- | --- | | Phi-Decoding | 33.60 | 24.33 | 36.67 | 10.67 | | | PPCV (Ours) | 40.00 | 26.00 | 43.33 | 13.33 | | ### 3.3 Discussion Our technical contributions differ from prior works in three distinct ways. First, prior works (Zhou et al., 2024; Chen et al., 2024b; Yadav et al., 2024) typically use paraphrasing merely to expand the solution space. In contrast, we introduce Paraphrastic Probing, a mechanism that uses paraphrasing to test the model’s internal confidence. By analyzing the discrepancy in token-level logits of the initial trajectory between the original and paraphrased questions, we can rigorously pinpoint the critical tokens that may lead to errors in the following steps.This transforms paraphrasing from a generation tool into a precise, token-level diagnostic tool. Second, prior works (Zhou et al., 2024; Chen et al., 2024b) typically rely on simple majority voting across multiple solutions. Our Paraphrase Consistency mechanism is technically distinct. It validates answers based on their robustness across semantic variations of the problem constraint. We further introduce a similarity-weighted consistency metric that weighs answers based on the linguistic diversity of the paraphrase, offering a more nuanced selection criterion than simple frequency counts. At last, a major technical limitation in current reasoning research is the reliance on external models or human-annotated error steps. Our method contributes a fully self-contained pipeline that identifies and corrects errors using the model’s own sensitivity to surface-form perturbations. More discussion on the impact of critical tokens on correct trajectory can be found in Appendix D. Besides, although we select the top candidate for the primary experiments to maintain computational efficiency, the framework itself naturally extends to the multi–critical-token setting. For multiple critical tokens, we can generate alternative tokens for each identified position and apply paraphrase consistency across the new rollouts. This allows the model to refine multiple segments of its intermediate reasoning steps rather than only one. The details of the algorithm can be found in Appendix C. Table 3: Comparison of model performance when using critical tokens versus random tokens. | Chain-of-Thought random tokens critical tokens (Ours) | 77.40 82.08 88.24 | 28.00 40.29 49.73 | 31.00 42.12 50.00 | 83.00 84.77 89.60 | 58.91 75.68 88.31 | | --- | --- | --- | --- | --- | --- | Table 4: Comparison of our proposed paraphrase consistency against the majority voting. | Chain-of-Thought majority voting paraphrase consistency (Ours) | 77.40 87.20 88.24 | 28.00 47.36 49.73 | 31.00 48.19 50.00 | 83.00 88.80 89.60 | 58.91 86.16 88.31 | | --- | --- | --- | --- | --- | --- | ## 4 Experiments In this section, we first describe the experimental setup, followed by the main results of our proposed method compared to the baselines. We also perform ablation study and computational cost analysis. ### 4.1 Setup Datasets. To comprehensively assess our method, we evaluate it on seven benchmarks. Six focus on mathematical reasoning, including GSM8K (Cobbe et al., 2021), GSM-Hard (Gao et al., 2023), SVAMP (Patel et al., 2021), Math500 (Hendrycks et al., 2021), and the more challenging competition-level datasets AIME2024, AIME2025, BRUMO2025, and HMMT2025 (Balunović et al., 2025). In addition, we use ARC-Challenge (Clark et al., 2018) to evaluate knowledge reasoning ability of LLMs. Baselines. In our experiments, we use Chain-of-Thought (CoT) (Wei et al., 2022), Self-Consistency (Wang et al., 2023), Tree-of-Thought (ToT) (Yao et al., 2023), Guided Decoding (Xie et al., 2023), Predictive Decoding (Ma et al., 2025), and Phi-Decoding (Xu et al., 2025) as baseline methods. Metric. Following prior work, we adopt pass@k (k=1,4) as the primary evaluation metric. Implementation Details. In our experiments, we adopt Llama-3.1-8B-Instruct (Grattafiori et al., 2024), Mistral-7B-Instruct-v0.2 (Jiang et al., 2023), Qwen-3-32B (Yang et al., 2025a) and DeepSeek-R1-Distill-Llama-70B as the target models. we employ the non-thinking mode for Qwen-3-32B. Throughout our method, we employ the same model for generating paraphrased problems, identifying critical tokens, and producing new rollouts. In the first stage, we generate 4 paraphrased variants for each problem in the math benchmarks and 3 variants for each problem in the ARC dataset. In the second stage, we select the top 10 tokens for new rollouts, with the temperature scaling coefficient $λ$ set to 2. For fair comparison, we ensure a comparable inference budget across methods. Specifically, we rollout 48 samples for Self-Consistency (Wang et al., 2023). For Predictive Decoding (Ma et al., 2025) and Phi-Decoding (Xu et al., 2025), we rollout 4-8 samples per foresight step, and each problem typically involves 5–8 foresight steps. We also adopt a zero-shot CoT prompt to elicit the new rollouts. For the baselines, we strictly follow their original settings, including temperature values, sampling strategies, and the number of few-shot examples. All experiments are conducted on NVIDIA A100 GPUs. <details> <summary>images/topk.png Details</summary> ![33c164c9](/v1/image/33c164c963c3008449ec94193a734f0cd7466cf3cb387b7ce0b3db47d4c964be) ### Visual Description ## Line Chart: Performance (pass@1%) vs. Number of Alternative Tokens ### Overview This is a line chart comparing the performance of two models or methods, labeled "GSM8K" and "SVAMP," on a metric called "pass@1(%)" as a function of the "number of alternative tokens." The chart shows that performance for both series generally increases as the number of alternative tokens increases, though the rate of improvement differs. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** "number of alternative tokens" * **Scale:** Linear, with major tick marks at integers 3, 4, 5, 6, 7, 8, 9, 10. * **Range:** 3 to 10. * **Y-Axis:** * **Label:** "pass@1(%)" * **Scale:** Linear, with major tick marks every 2 units from 80 to 94. * **Range:** 80 to 94. * **Legend:** * **Position:** Top-right corner of the plot area. * **Series 1:** "GSM8K" - Represented by a yellow-green line with square markers. * **Series 2:** "SVAMP" - Represented by a cyan/teal line with hexagonal markers. * **Grid:** A dashed gray grid is present for both major x and y ticks. * **Background:** White. ### Detailed Analysis **Data Series: GSM8K (Yellow-Green Line, Square Markers)** * **Trend:** The line slopes upward from left to right, indicating a positive correlation between the number of alternative tokens and pass@1 performance. The slope is steeper between x=3 and x=7, after which it flattens significantly. * **Data Points (Approximate):** * At x=3: y ≈ 84.8% * At x=5: y ≈ 86.7% * At x=7: y ≈ 88.1% * At x=10: y ≈ 88.2% **Data Series: SVAMP (Cyan Line, Hexagonal Markers)** * **Trend:** The line also slopes upward, showing a consistent positive trend. The slope appears more constant across the entire range compared to GSM8K. * **Data Points (Approximate):** * At x=3: y ≈ 87.0% * At x=5: y ≈ 87.4% * At x=7: y ≈ 88.2% * At x=10: y ≈ 89.6% ### Key Observations 1. **Performance Gap:** At the lowest measured point (3 alternative tokens), SVAMP starts with a higher performance (~87.0%) than GSM8K (~84.8%). 2. **Convergence and Divergence:** The two lines converge near x=7, where their performance is nearly identical (~88.1% vs. ~88.2%). After this point, they diverge again, with SVAMP continuing to improve while GSM8K plateaus. 3. **Saturation Point:** The GSM8K series shows clear performance saturation. Increasing the number of alternative tokens from 7 to 10 yields a negligible gain of approximately 0.1 percentage points. 4. **Consistent Improvement:** The SVAMP series does not show a similar plateau within the measured range, suggesting its performance may continue to improve with more than 10 alternative tokens. ### Interpretation The chart demonstrates the impact of increasing the "number of alternative tokens" on model performance for two different benchmarks or tasks (GSM8K and SVAMP). The "pass@1(%)" metric likely measures the accuracy or success rate of a model in a single attempt. The data suggests that: * Providing more alternative tokens is generally beneficial for both tasks, improving the model's chance of selecting a correct token. * The benefit of additional tokens is task-dependent. The SVAMP task appears to benefit more consistently from a larger set of alternatives, as its performance curve does not flatten within the observed range. * The GSM8K task reaches a point of diminishing returns around 7 alternative tokens. This could indicate that for this specific task, the model's ability to identify the correct token is largely saturated with a moderate number of options, and further expansion of the candidate set provides little additional value. * The initial performance gap and the different saturation behaviors imply that the underlying difficulty or nature of the two tasks (GSM8K vs. SVAMP) is different, requiring different strategies or amounts of "alternative" information for optimal performance. </details> (a) <details> <summary>images/time.png Details</summary> ![29e68221](/v1/image/29e682211d3e6dd97ff8915fa7b5ed367365114dc0ce78581246b62675778308) ### Visual Description ## Grouped Bar Chart: Latency Comparison of Decoding Methods Across Datasets ### Overview The image is a grouped bar chart comparing the latency (in seconds) of seven different decoding methods across five distinct datasets. The chart visually demonstrates the performance trade-offs between a baseline method (Chain-of-Thought), two existing methods (Predictive Decoding, Phi-Decoding), and four variants of a proposed method labeled "PPCV" (Ours). ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **X-Axis (Categorical):** Represents five different evaluation datasets. From left to right: `GSM8K`, `GSMHard`, `Math500`, `SVAMP`, `ARC`. * **Y-Axis (Numerical):** Labeled `Latency (s)`. The scale runs from 0 to approximately 43 seconds, with major tick marks at intervals of 5 seconds (0, 5, 10, 15, 20, 25, 30, 35, 40). * **Legend:** Located in the top-right corner of the plot area. It defines seven data series by color: 1. `Chain-of-Thought` (Teal) 2. `Predictive Decoding` (Light Seafoam Green) 3. `Phi-Decoding` (Light Beige) 4. `PPCV-T₁ (Ours)` (Light Pink) 5. `PPCV-T₂ (Ours)` (Orange) 6. `PPCV-T₃ (Ours)` (Yellow-Green) 7. `PPCV-T₄ (Ours)` (Salmon Pink) ### Detailed Analysis The latency values for each method across the datasets are approximate, derived from visual inspection of the bar heights against the y-axis. **1. GSM8K Dataset:** * Chain-of-Thought: ~2.0 s * Predictive Decoding: ~15.5 s * Phi-Decoding: ~13.0 s * PPCV-T₁ (Ours): ~2.5 s * PPCV-T₂ (Ours): ~4.8 s (stacked on T₁) * PPCV-T₃ (Ours): ~5.0 s (stacked on T₂) * PPCV-T₄ (Ours): ~17.8 s (stacked on T₃) **2. GSMHard Dataset:** * Chain-of-Thought: ~3.0 s * Predictive Decoding: ~26.2 s * Phi-Decoding: ~23.2 s * PPCV-T₁ (Ours): ~3.0 s * PPCV-T₂ (Ours): ~6.0 s (stacked on T₁) * PPCV-T₃ (Ours): ~6.2 s (stacked on T₂) * PPCV-T₄ (Ours): ~23.0 s (stacked on T₃) **3. Math500 Dataset:** * Chain-of-Thought: ~6.2 s * Predictive Decoding: ~42.5 s (highest bar in the chart) * Phi-Decoding: ~38.0 s * PPCV-T₁ (Ours): ~2.5 s * PPCV-T₂ (Ours): ~10.0 s (stacked on T₁) * PPCV-T₃ (Ours): ~10.5 s (stacked on T₂) * PPCV-T₄ (Ours): ~37.5 s (stacked on T₃) **4. SVAMP Dataset:** * Chain-of-Thought: ~1.8 s * Predictive Decoding: ~14.2 s * Phi-Decoding: ~11.2 s * PPCV-T₁ (Ours): ~2.0 s * PPCV-T₂ (Ours): ~4.0 s (stacked on T₁) * PPCV-T₃ (Ours): ~4.2 s (stacked on T₂) * PPCV-T₄ (Ours): ~16.8 s (stacked on T₃) **5. ARC Dataset:** * Chain-of-Thought: ~2.2 s * Predictive Decoding: ~15.6 s * Phi-Decoding: ~15.4 s * PPCV-T₁ (Ours): ~1.2 s * PPCV-T₂ (Ours): ~3.8 s (stacked on T₁) * PPCV-T₃ (Ours): ~3.9 s (stacked on T₂) * PPCV-T₄ (Ours): ~12.5 s (stacked on T₃) **Note on PPCV Bars:** The bars for the four PPCV variants are **stacked** on top of each other for each dataset, forming a single composite bar. The total height of this composite bar represents the cumulative latency of the T₁ through T₄ components. The individual segment heights are listed above. ### Key Observations 1. **Consistent Baseline:** `Chain-of-Thought` consistently exhibits the lowest latency across all five datasets, ranging from ~1.8s to ~6.2s. 2. **High-Latency Methods:** `Predictive Decoding` and `Phi-Decoding` show significantly higher latency than Chain-of-Thought, with Predictive Decoding often being the slowest method (peaking at ~42.5s on Math500). 3. **PPCV Variant Performance:** The latency of the proposed `PPCV` method varies dramatically by its configuration (T₁ to T₄). * `PPCV-T₁` is very fast, comparable to Chain-of-Thought. * `PPCV-T₂` and `PPCV-T₃` add moderate latency. * `PPCV-T₄` contributes the vast majority of the latency in the composite PPCV bar, making the total latency for the full PPCV stack often comparable to or exceeding Phi-Decoding and Predictive Decoding (e.g., on GSM8K, GSMHard, SVAMP). 4. **Dataset Difficulty:** The `Math500` dataset elicits the highest latency from all methods except Chain-of-Thought, suggesting it is the most computationally demanding task among those tested. ### Interpretation This chart is a performance analysis from a research paper, likely evaluating a new decoding method called **PPCV**. The key takeaway is a **latency-accuracy trade-off**. The authors are demonstrating that their method can be configured for different operating points: * **Low-Latency Mode (`PPCV-T₁`):** Achieves speed comparable to the simple Chain-of-Thought baseline. * **High-Latency/High-Accuracy Mode (Full `PPCV-T₄` stack):** Incurs latency similar to or greater than existing complex methods like Predictive Decoding and Phi-Decoding. The implication is that this higher latency configuration likely yields better accuracy or reasoning quality, which would be shown in a separate accuracy chart. The stacking of the PPCV bars visually emphasizes that the latency cost is additive across its components (T₁ through T₄). The chart effectively argues that the PPCV framework is flexible, allowing users to choose a configuration that balances speed against the desired level of performance (presumably accuracy). The outlier performance on `Math500` highlights that the computational cost of advanced decoding is highly dependent on the complexity of the problem domain. </details> (b) <details> <summary>images/throughput.png Details</summary> ![413ec5c8](/v1/image/413ec5c87f525aea62c21dc8b8b118c03609cf55530f87c7e367f08e2d5ca384) ### Visual Description ## Bar Chart: Throughput Comparison of Decoding Methods Across Datasets ### Overview The image is a grouped bar chart comparing the throughput performance (in tokens per second) of four different decoding or reasoning methods across five distinct datasets. The chart clearly demonstrates the performance advantage of the method labeled "PPCV (Ours)" over the other three baseline methods. ### Components/Axes * **Chart Type:** Grouped Bar Chart. * **Y-Axis:** * **Label:** "Throughput (tokens/sec)". * **Scale:** Linear scale from 0 to 2000, with major tick marks at intervals of 250 (0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000). * **X-Axis:** * **Categories (Datasets):** Five distinct datasets are listed from left to right: `GSM8K`, `GSMHard`, `Math500`, `SVAMP`, `ARC`. * **Legend:** * **Position:** Top-right corner of the chart area. * **Items (from top to bottom):** 1. `Chain-of-Thought` (Teal color) 2. `Predictive Decoding` (Light green/mint color) 3. `Phi-Decoding` (Light beige/peach color) 4. `PPCV (Ours)` (Pink/salmon color) ### Detailed Analysis The chart presents throughput data for each method on each dataset. Values are approximate visual estimates from the bar heights. **1. GSM8K Dataset:** * **Chain-of-Thought (Teal):** ~120 tokens/sec * **Predictive Decoding (Light Green):** ~700 tokens/sec * **Phi-Decoding (Beige):** ~500 tokens/sec * **PPCV (Ours) (Pink):** ~1300 tokens/sec **2. GSMHard Dataset:** * **Chain-of-Thought (Teal):** ~125 tokens/sec * **Predictive Decoding (Light Green):** ~600 tokens/sec * **Phi-Decoding (Beige):** ~450 tokens/sec * **PPCV (Ours) (Pink):** ~1700 tokens/sec **3. Math500 Dataset:** * **Chain-of-Thought (Teal):** ~130 tokens/sec * **Predictive Decoding (Light Green):** ~790 tokens/sec * **Phi-Decoding (Beige):** ~570 tokens/sec * **PPCV (Ours) (Pink):** ~2000 tokens/sec (This is the highest value on the entire chart). **4. SVAMP Dataset:** * **Chain-of-Thought (Teal):** ~110 tokens/sec * **Predictive Decoding (Light Green):** ~540 tokens/sec * **Phi-Decoding (Beige):** ~400 tokens/sec * **PPCV (Ours) (Pink):** ~1520 tokens/sec **5. ARC Dataset:** * **Chain-of-Thought (Teal):** ~125 tokens/sec * **Predictive Decoding (Light Green):** ~760 tokens/sec * **Phi-Decoding (Beige):** ~590 tokens/sec * **PPCV (Ours) (Pink):** ~1500 tokens/sec ### Key Observations 1. **Dominant Performance:** The `PPCV (Ours)` method (pink bars) exhibits significantly higher throughput than all other methods across every single dataset. Its bars are consistently the tallest in each group. 2. **Performance Hierarchy:** A clear and consistent performance order is visible across all datasets: `PPCV (Ours)` > `Predictive Decoding` > `Phi-Decoding` > `Chain-of-Thought`. 3. **Baseline Performance:** `Chain-of-Thought` (teal bars) consistently shows the lowest throughput, hovering around 110-130 tokens/sec for all tasks. 4. **Peak Performance:** The highest recorded throughput is for `PPCV (Ours)` on the `Math500` dataset, reaching approximately 2000 tokens/sec. 5. **Relative Gains:** The performance gap between `PPCV (Ours)` and the next best method (`Predictive Decoding`) is substantial, often exceeding a 2x difference (e.g., on GSMHard: ~1700 vs. ~600). ### Interpretation This chart is a performance benchmark likely from a research paper introducing the "PPCV" method. The data strongly suggests that PPCV is a highly efficient decoding or reasoning technique that dramatically increases token generation throughput compared to established methods like Chain-of-Thought, Predictive Decoding, and Phi-Decoding. The consistent superiority across diverse datasets (GSM8K, GSMHard, Math500, SVAMP, ARC—which are common benchmarks for mathematical and reasoning tasks) indicates that PPCV's performance advantage is robust and not specific to a single type of problem. The dramatic increase in throughput, especially on the `Math500` dataset, implies that PPCV may be particularly well-suited for complex mathematical reasoning tasks where generating many tokens efficiently is crucial. The chart's primary message is one of significant efficiency gain. By showing such a large and consistent margin of improvement, the authors are making a compelling case for the practical utility and superiority of their proposed method (PPCV) in scenarios where processing speed (throughput) is a critical metric. </details> (c) Figure 7: (a) The impact of number of sampled alternative tokens on the performance. (b) Latency comparison between the baselines and our method, measured as the average inference time per question (in seconds). $T_1$ , $T_2$ , $T_3$ , $T_4$ denote time for paraphrased question generation, initial answer generation, forward pass and new rollouts from truncated trajectories. (c) Throughput comparison between the baselines and our method, measured in output tokens per second. ### 4.2 Main Results The main results are summarized in Table 1 and Table 2. The results indicate that Self-Consistency effectively improves the reasoning performance of LLMs compared to vanilla Chain-of-Thought prompting. For example, Llama-3.1-8B-Instruct (Grattafiori et al., 2024) achieves about 3% higher accuracy with Self-Consistency than with CoT. These findings suggest that augmenting reasoning during inference through sampling is an effective way to refine reasoning trajectories. Recent decoding-based methods, such as Predictive Decoding (Ma et al., 2025) and Phi-Decoding (Xu et al., 2025), also achieve strong results.Unlike prior works that rely on carefully designed prompts to self-correct errors in intermediate steps, these two methods modify the current step by probing future steps with pre-defined reward signals. Furthermore, our experimental results demonstrate that the proposed method consistently outperforms the baselines across most tasks, spanning both mathematical and knowledge reasoning, thereby highlighting its generalization ability across different reasoning settings. Notably, our method even surpasses the latest approaches such as Predictive Decoding (Ma et al., 2025) and Phi-Decoding (Xu et al., 2025). In particular, it achieves approximately 50.00% accuracy on the Math500 dataset (Hendrycks et al., 2021), exceeding these baselines considerably. The results on competition-level datasets further demonstrate the effectiveness of our method in enhancing the reasoning ability of LLMs. These results indicate that our method can effectively extract critical tokens that play a pivotal role in the final outcome and correct the reasoning trajectory by leveraging alternative tokens. Additional results can be found in Appendix E. ### 4.3 Ablation Study In this section, we analyze the contribution of each stage individually. All the evaluations are conducted on Llama-3.1-8B-Instruct (Grattafiori et al., 2024). Effectiveness of extracted critical tokens. To demonstrate the effectiveness of our extracted critical tokens, we conduct an evaluation in which the critical tokens are replaced with random tokens in the first stage, while keeping the second stage unchanged. This evaluation is performed across multiple benchmark datasets, with pass@1 as the metric. The results, shown in Table 3, reveal a substantial decline in performance. These findings highlight the pivotal role of critical tokens and indicate that our method can effectively identify and extract them. More ablation study on comparison with Paraphrased Majority Voting (PMV) can be found in Appendix F. Effectiveness of paraphrase consistency. We also evaluate the effectiveness of our proposed paraphrase consistency and compare it with traditional majority voting. While keeping the first stage unchanged, instead of using paraphrased forms to generate new reasoning steps, we simply sample multiple new steps from alternative tokens conditioned on the original question and use majority voting to determine the final answer. The results, shown in Table 4, reveal a noticeable decline in performance, highlighting the importance of paraphrased forms in improving the intermediate reasoning steps. Impact of number of sampled alternative tokens.We investigate the influence of the number of sampled alternative tokens in the second stage by selecting values of 3, 5, 7, and 10. The results, shown in Figure 7(a), demonstrate that performance improves as the number of alternative tokens increases. This suggests that exploring more reasoning steps with additional alternative tokens during inference can be beneficial for reasoning tasks. ## 5 Computational Cost Analysis In this section we examine the composition of the latency in our method. The latency arises from four components: Paraphrased question generation ( $T_1$ ); initial answer generation ( $T_2$ ), equivalent to vanilla CoT; a forward pass for identifying critical tokens ( $T_3$ ), which does not generate new tokens and is computationally lightweight; rollouts of truncated trajectories using alternative tokens under both the original and paraphrased questions ( $T_4$ ), which constitutes the main source of overhead. We evaluate all components on Llama-3.1-8B-Instruct using vLLM on NVIDIA A100 GPUs, with a maximum output length of 4096 tokens for each question. For our method, we use 4 paraphrased questions on math datasets and 3 on ARC, and select the top-10 candidate tokens as alternatives. The updated average latency results are reported in Figure 7(b). As expected, $T_1$ scales with the number of paraphrases, $T_3$ remains minimal, and $T_4$ dominates the total cost. Specifically, $T_4$ depends on the number of top-k alternative tokens, the number of paraphrased questions and the position of the critical token in the trajectory. Since the new rollouts from truncated trajectories for different alternative tokens and paraphrased questions are independent, $T_4$ can take advantage of vLLM’s parallelism. These rollouts can therefore be processed concurrently, improving overall efficiency. This is reflected in the higher throughput (tokens/sec) shown in Figure 7(c). And results of our method in latency comparable to baseline methods, even on challenging benchmarks such as Math500 and GSM-Hard where the critical token tends to occur in later reasoning steps. On the GSM8K and SVAMP benchmarks, our method, as well as baselines such as Predictive Decoding, would incur a approximately $6–8×$ latency overhead compared to vanilla Chain-of-Thought. More analysis on the trade-off between the latency and performance can be found in Appendix G. ## 6 Conclusion In this study, inspired by beneficial impact of paraphrase forms on reasoning, we investigate the pivotal role of critical tokens in shaping the reasoning trajectory. To leverage these two factors, we propose the Paraphrastic Probing and Consistency Verification framework. Our framework consists of two stages: Paraphrastic Probing, which identifies and extracts critical tokens, and Consistency Verification, which adopts paraphrase forms to generate new reasoning trajectories with alternative tokens to reach the final answer. We evaluate our proposed framework with different LLMs and extensive evaluations across multiple benchmarks demonstrate the promising performance of our method. ## Impact Statement This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. ## References - J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint. Cited by: §1. - M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev (2025) MathArena: evaluating llms on uncontaminated math competitions. SRI Lab, ETH Zurich. External Links: Link Cited by: §4.1. - M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. (2024) Graph of thoughts: solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, Vol. 38, pp. 17682–17690. Cited by: §2. - Z. Bi, K. Han, C. Liu, Y. Tang, and Y. Wang (2025) Forest-of-thought: scaling test-time compute for enhancing llm reasoning. In Forty-second International Conference on Machine Learning, Cited by: §2. - T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §1. - S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. (2023) Sparks of artificial general intelligence: early experiments with gpt-4. arXiv preprint. Cited by: §1, §2. - S. Chen, B. Li, and D. Niu (2024a) Boosting of thoughts: trial-and-error problem solving with large language models. In The Twelfth International Conference on Learning Representations, Cited by: §1, §2. - S. Chen and B. Li (2024) Toward adaptive reasoning in large language models with thought rollback. In Forty-first International Conference on Machine Learning, Cited by: §1. - W. Chen, W. Wang, Z. Chu, K. Ren, Z. Zheng, and Z. Lu (2024b) Self-para-consistency: improving reasoning tasks at low cost for large language models. In Findings of the Association for Computational Linguistics ACL 2024, pp. 14162–14167. Cited by: §1, §2, §3.1, §3.3. - P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: §4.1. - K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint. Cited by: §1, §3.1, §3.2, §4.1. - A. Dogan and D. Birant (2019) A weighted majority voting ensemble approach for classification. In 2019 4th international conference on computer science and engineering (UBMK), pp. 1–6. Cited by: §3.2. - S. Feng, X. Kong, S. Ma, A. Zhang, D. Yin, C. Wang, R. Pang, and Y. Yang (2025) Step-by-step reasoning for math problems via twisted sequential monte carlo. In The Thirteenth International Conference on Learning Representations, Cited by: §1. - L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig (2023) Pal: program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. Cited by: §4.1. - X. Gao, J. Zhang, L. Mouatadid, and K. Das (2024) SPUQ: perturbation-based uncertainty quantification for large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2336–2346. Cited by: §2. - A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. arXiv preprint. Cited by: Appendix A, Appendix A, §1, §1, §3.1, §3.2, §4.1, §4.2, §4.3. - X. Guan, L. L. Zhang, Y. Liu, N. Shang, Y. Sun, Y. Zhu, F. Yang, and M. Yang (2025) RStar-math: small llms can master math reasoning with self-evolved deep thinking. In Forty-second International Conference on Machine Learning, Cited by: §2. - S. Hao, Y. Gu, H. Ma, J. Hong, Z. Wang, D. Z. Wang, and Z. Hu (2023) Reasoning with language model is planning with world model. In NeurIPS 2023 Workshop on Generalization in Planning, Cited by: §2. - A. Havrilla, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, M. Zhuravinskyi, E. Hambro, and R. Raileanu (2024) GLoRe: when, where, and how to improve llm reasoning via global and local refinements. In Forty-first International Conference on Machine Learning, Cited by: §2. - M. He, Y. Shen, W. Zhang, Z. Tan, and W. Lu (2024) Advancing process verification for large language models via tree-based preference learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 2086–2099. Cited by: §2. - D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Cited by: Appendix A, Appendix A, §4.1, §4.2. - K. Huang, J. Guo, Z. Li, X. Ji, J. Ge, W. Li, Y. Guo, T. Cai, H. Yuan, R. Wang, et al. (2025) MATH-perturb: benchmarking llms’ math reasoning abilities against hard perturbations. In Forty-second International Conference on Machine Learning, Cited by: §2, §3.1. - A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, and W. E. Sayed (2023) Mistral 7b. arXiv preprint arXiv:2310.06825. Cited by: §4.1. - T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa (2022) Large language models are zero-shot reasoners. Advances in neural information processing systems 35, pp. 22199–22213. Cited by: §1. - Y. Li (2025) Policy guided tree search for enhanced llm reasoning. In Forty-second International Conference on Machine Learning, Cited by: §2. - H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. In The Twelfth International Conference on Learning Representations, Cited by: §1. - Z. Lin, T. Liang, J. Xu, Q. Liu, X. Wang, R. Luo, C. Shi, S. Li, Y. Yang, and Z. Tu (2025) Critical tokens matter: token-level contrastive estimation enhances llm’s reasoning capability. In Forty-second International Conference on Machine Learning, Cited by: §1, §1. - Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memisevic, and H. Su (2023) Deductive verification of chain-of-thought reasoning. Advances in Neural Information Processing Systems 36, pp. 36407–36433. Cited by: §1, §2. - C. Ma, H. Zhao, J. Zhang, J. He, and L. Kong (2025) Non-myopic generation of language models for reasoning and planning. In The Thirteenth International Conference on Learning Representations, Cited by: §2, §2, §4.1, §4.1, §4.2. - A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. (2023) Self-refine: iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36, pp. 46534–46594. Cited by: §2. - N. Miao, Y. W. Teh, and T. Rainforth (2024) SelfCheck: using llms to zero-shot check their own step-by-step reasoning. In The Twelfth International Conference on Learning Representations, Cited by: §1, §2. - A. Patel, S. Bhattamishra, and N. Goyal (2021) Are nlp models really able to solve simple math word problems?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094. Cited by: §4.1. - Z. Qi, M. Mingyuan, J. Xu, L. L. Zhang, F. Yang, and M. Yang (2025) Mutual reasoning makes smaller llms stronger problem-solver. In The Thirteenth International Conference on Learning Representations, Cited by: §2. - C. Snell, J. Lee, K. Xu, and A. Kumar (2024) Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint. Cited by: §2, §2. - Z. R. Sprague, F. Yin, J. D. Rodriguez, D. Jiang, M. Wadhwa, P. Singhal, X. Zhao, X. Ye, K. Mahowald, and G. Durrett (2025) To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning. In The Thirteenth International Conference on Learning Representations, Cited by: §1. - K. Stechly, K. Valmeekam, and S. Kambhampati (2025) On the self-verification limitations of large language models on reasoning and planning tasks. In The Thirteenth International Conference on Learning Representations, Cited by: §1. - S. H. Tanneru, C. Agarwal, and H. Lakkaraju (2024) Quantifying uncertainty in natural language explanations of large language models. In International Conference on Artificial Intelligence and Statistics, pp. 1072–1080. Cited by: §2. - G. Tyen, H. Mansoor, V. Cărbune, Y. P. Chen, and T. Mak (2024) LLMs cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics ACL 2024, pp. 13894–13908. Cited by: §1. - J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins (2022) Solving math word problems with process-and outcome-based feedback. arXiv preprint. Cited by: §1. - G. Wan, Y. Wu, J. Chen, and S. Li (2025) Reasoning aware self-consistency: leveraging reasoning paths for efficient llm sampling. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3613–3635. Cited by: §2. - P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui (2024) Math-shepherd: verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426–9439. Cited by: §1, §1, §2. - X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou (2023) Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, Cited by: §1, §1, §2, §4.1, §4.1. - J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824–24837. Cited by: Appendix D, §1, §4.1. - Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang (2025) Inference scaling laws: an empirical analysis of compute-optimal inference for llm problem-solving. In The Thirteenth International Conference on Learning Representations, Cited by: §2. - Z. Wu, Q. Zeng, Z. Zhang, Z. Tan, C. Shen, and M. Jiang (2024) Large language models can self-correct with key condition verification. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 12846–12867. Cited by: §1, §2. - Y. Xie, K. Kawaguchi, Y. Zhao, J. X. Zhao, M. Kan, J. He, and M. Xie (2023) Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems 36, pp. 41618–41650. Cited by: §2, §4.1. - F. Xu, H. Yan, C. Ma, H. Zhao, J. Liu, Q. Lin, and Z. Wu (2025) Phi-decoding: adaptive foresight sampling for balanced inference-time exploration and exploitation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13214–13227. Cited by: §2, §2, §4.1, §4.1, §4.2. - V. Yadav, Z. Tang, and V. Srinivasan (2024) Pag-llm: paraphrase and aggregate with large language models for minimizing intent classification errors. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, pp. 2569–2573. Cited by: §2, §3.3. - A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025a) Qwen3 technical report. arXiv preprint. Cited by: §1, §4.1. - L. Yang, Z. Yu, T. Zhang, M. Xu, J. E. Gonzalez, B. CUI, and S. YAN (2025b) SuperCorrect: advancing small llm reasoning with thought template distillation and self-correction. In The Thirteenth International Conference on Learning Representations, Cited by: §1. - S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan (2023) Tree of thoughts: deliberate problem solving with large language models. Advances in neural information processing systems 36, pp. 11809–11822. Cited by: §2, §4.1. - Z. Yin, Q. Sun, Q. Guo, Z. Zeng, X. Li, J. Dai, Q. Cheng, X. Huang, and X. Qiu (2024) Reasoning in flux: enhancing large language models reasoning through uncertainty-aware adaptive guidance. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2401–2416. Cited by: §2. - L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou, Z. Liu, and H. Peng (2025) Free process rewards without process labels. In Forty-second International Conference on Machine Learning, Cited by: §1, §1, §2. - D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang (2024) Rest-mcts*: llm self-training via process reward guided tree search. Advances in Neural Information Processing Systems 37, pp. 64735–64772. Cited by: §2. - Z. Zhang, A. Zhang, M. Li, and A. Smola (2022) Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, Cited by: §1. - Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba (2022) Large language models are human-level prompt engineers. In The eleventh international conference on learning representations, Cited by: Appendix B, §3.1. - Y. Zhou, Y. Zhu, D. Antognini, Y. Kim, and Y. Zhang (2024) Paraphrase and solve: exploring and exploiting the impact of surface form on mathematical reasoning in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2793–2804. Cited by: §1, §2, §3.1, §3.3. ## Appendix A Empirical Study <details> <summary>images/condition_one_math.png Details</summary> ![a0361404](/v1/image/a03614046ee5892c6d20f48b013a03ac7b88c328e32f74f848187f09f23f56ee) ### Visual Description \n ## Bar Chart: Token Type vs. Average Accuracy Fraction ### Overview This is a grouped bar chart with error bars, comparing the fraction (percentage) of two token types—"critical tokens" and "random tokens"—across two categories of average accuracy. The chart visually demonstrates a strong inverse relationship between token type and accuracy category. ### Components/Axes * **Chart Type:** Grouped bar chart with error bars. * **X-Axis:** Labeled **"Average accuracy(%)"**. It contains two categorical groups: 1. **"≤ 5%"** (Less than or equal to 5 percent) 2. **"> 5%"** (Greater than 5 percent) * **Y-Axis:** Labeled **"Fraction(%)"**. The scale runs from 0 to 80, with major tick marks at 0, 20, 40, 60, and 80. * **Legend:** Positioned at the top-center of the chart area. It defines two data series: * **"critical tokens"** (represented by a teal/darker cyan bar) * **"random tokens"** (represented by a light teal/pale cyan bar) * **Error Bars:** Each bar has a black, vertical error bar (I-beam style) indicating variability or confidence intervals around the measured fraction. ### Detailed Analysis **1. Category: ≤ 5% Average Accuracy** * **Critical Tokens (Teal Bar):** This is the tallest bar in the chart. Its top aligns approximately with the 85% mark on the y-axis. The error bar extends from roughly 82% to 88%. * **Random Tokens (Light Teal Bar):** This bar is significantly shorter. Its top is near the 45% mark. The error bar spans approximately from 42% to 48%. **2. Category: > 5% Average Accuracy** * **Critical Tokens (Teal Bar):** This is the shortest bar. Its top is near the 12% mark. The error bar ranges from about 9% to 15%. * **Random Tokens (Light Teal Bar):** This bar is taller than its counterpart in the first category. Its top is near the 55% mark. The error bar extends from approximately 52% to 58%. ### Key Observations * **Dominant Inversion:** The relationship between token type and accuracy category is inverted. "Critical tokens" have a very high fraction (~85%) in the low-accuracy (≤5%) group but a very low fraction (~12%) in the higher-accuracy (>5%) group. * **Opposite Trend for Random Tokens:** "Random tokens" show the opposite pattern, with a moderate fraction (~45%) in the low-accuracy group and a higher fraction (~55%) in the higher-accuracy group. * **Magnitude of Difference:** The disparity between token types is most extreme in the "≤ 5%" category, where critical tokens are nearly double the fraction of random tokens. In the "> 5%" category, random tokens are more than four times the fraction of critical tokens. * **Error Bar Consistency:** The error bars are relatively small compared to the differences between the bars, suggesting the observed trends are statistically meaningful. ### Interpretation The data suggests a strong correlation between the type of token present and the resulting model accuracy. **Critical tokens are disproportionately associated with low-accuracy outcomes (≤5%).** Conversely, **random tokens are more frequently associated with higher-accuracy outcomes (>5%).** This pattern implies that the presence or influence of "critical tokens" may be a key factor in model failure or low-performance scenarios. They could represent problematic, ambiguous, or highly influential tokens that, when present, significantly increase the chance of an incorrect prediction. The higher fraction of random tokens in the >5% accuracy group might indicate that when the model's output is more accurate, it is less reliant on these specific "critical" tokens, or that random tokens are a feature of more stable, correct predictions. From a diagnostic or improvement perspective, this chart highlights "critical tokens" as a primary target for investigation. Understanding why these tokens correlate so strongly with low accuracy could lead to strategies for improving model robustness, such as targeted data augmentation, specialized preprocessing, or architectural adjustments to better handle these token types. </details> (a) <details> <summary>images/condition_two_math.png Details</summary> ![e2613660](/v1/image/e2613660cfac3846ba49829dabf7b37c660d36e4276782983a04c2eae6008207) ### Visual Description \n ## Bar Chart: Token Type Distribution by Average Accuracy ### Overview This is a grouped bar chart with error bars, comparing the fractional distribution of two token types—"critical tokens" and "random tokens"—across two categories of average accuracy. The chart visually demonstrates how the prevalence of each token type changes depending on whether the average accuracy is low (≤ 10%) or high (> 10%). ### Components/Axes * **Chart Type:** Grouped bar chart with error bars. * **X-Axis (Horizontal):** Labeled **"Average accuracy(%)"**. It has two categorical groups: 1. **"≤ 10%"** (Left group) 2. **"> 10%"** (Right group) * **Y-Axis (Vertical):** Labeled **"Fraction(%)"**. The scale runs from 0 to 80, with major tick marks at 0, 20, 40, 60, and 80. * **Legend:** Located in the top-center area of the chart. * **Teal square:** Labeled **"critical tokens"**. * **Light teal square:** Labeled **"random tokens"**. * **Data Series:** Two series are plotted for each x-axis category. * **Series 1 (Teal):** Represents "critical tokens". * **Series 2 (Light Teal):** Represents "random tokens". * **Error Bars:** Each bar has a black, vertical error bar (I-beam style) extending from its top, indicating variability or confidence intervals around the measured fraction. ### Detailed Analysis **Category 1: Average accuracy ≤ 10%** * **Critical Tokens (Teal Bar):** This is the tallest bar in the chart. Its top aligns with approximately **88%** on the y-axis. The error bar extends roughly from 86% to 90%. * **Random Tokens (Light Teal Bar):** This bar is significantly shorter. Its top aligns with approximately **56%** on the y-axis. The error bar extends roughly from 54% to 58%. **Category 2: Average accuracy > 10%** * **Critical Tokens (Teal Bar):** This is the shortest bar in the chart. Its top aligns with approximately **11%** on the y-axis. The error bar extends roughly from 9% to 13%. * **Random Tokens (Light Teal Bar):** This bar is taller than its counterpart in the same group. Its top aligns with approximately **44%** on the y-axis. The error bar extends roughly from 42% to 46%. ### Key Observations 1. **Inverse Relationship:** There is a stark inverse relationship between the two token types across the accuracy categories. "Critical tokens" dominate the low-accuracy (≤ 10%) group but are minimal in the high-accuracy (> 10%) group. The opposite is true for "random tokens." 2. **Magnitude of Difference:** The difference in fraction between critical and random tokens is much larger in the low-accuracy group (~32 percentage points) than in the high-accuracy group (~33 percentage points, but in the opposite direction). 3. **Error Bar Consistency:** The error bars are relatively small compared to the bar heights, suggesting the measured fractions have moderate precision. The error ranges for the two token types within each category do not overlap, indicating the differences are likely statistically significant. ### Interpretation The data suggests a strong correlation between token type and model accuracy performance. "Critical tokens" are overwhelmingly associated with scenarios where the average accuracy is very low (≤ 10%). This could imply that when a model is performing poorly, its outputs or internal states are heavily influenced by these "critical" tokens. Conversely, in higher accuracy regimes (> 10%), the presence of "critical tokens" drops dramatically, while "random tokens" become more prevalent. This pattern might indicate that "critical tokens" are markers of failure modes or specific types of errors, whereas "random tokens" are more characteristic of a model's baseline or successful operation. The chart effectively argues that analyzing the composition of tokens (critical vs. random) can serve as a diagnostic tool for understanding model accuracy. The clear separation between the groups underscores that these two token categories behave fundamentally differently across the accuracy spectrum. </details> (b) <details> <summary>images/compare_ct_rnd_math.png Details</summary> ![519032bf](/v1/image/519032bf54547bb39236392be944f820dfa8aa1b8cbc0d042066a35e17e33763) ### Visual Description ## Line Chart: pass@k(%) vs. number of sample k ### Overview The image is a line chart with error bars, plotting the performance metric "pass@k(%)" against the "number of sample k". It compares two distinct methods or conditions: "critical tokens" and "random tokens". The chart demonstrates how the pass rate changes as the number of samples (k) increases for each condition. ### Components/Axes * **Chart Type:** Line chart with error bars. * **X-Axis:** * **Label:** `number of sample k` * **Scale:** Linear scale from approximately 0 to 50. * **Major Tick Marks:** 10, 20, 30, 40. * **Y-Axis:** * **Label:** `pass@k(%)` * **Scale:** Linear scale from approximately 25 to 95. * **Major Tick Marks:** 30, 40, 50, 60, 70, 80, 90. * **Legend:** * **Position:** Centered within the plot area, slightly to the right. * **Entry 1:** Red line with upward-pointing triangle markers, labeled `critical tokens`. * **Entry 2:** Purple (magenta) line with star markers, labeled `random tokens`. * **Grid:** Dashed gray grid lines are present for both major x and y ticks. ### Detailed Analysis **Data Series 1: critical tokens (Red line, triangle markers)** * **Trend:** The line shows a steep, concave-down increase. The rate of improvement in pass@k is highest for small k and gradually diminishes as k increases. * **Data Points (Approximate with Error Bar Ranges):** * k ≈ 5: pass@k ≈ 55% (Error bar range: ~53% to ~57%) * k ≈ 8: pass@k ≈ 67% (Error bar range: ~65% to ~70%) * k ≈ 15: pass@k ≈ 75% (Error bar range: ~72% to ~78%) * k ≈ 23: pass@k ≈ 79% (Error bar range: ~76% to ~83%) * k ≈ 31: pass@k ≈ 83% (Error bar range: ~80% to ~87%) * k ≈ 47: pass@k ≈ 87% (Error bar range: ~83% to ~91%) **Data Series 2: random tokens (Purple line, star markers)** * **Trend:** The line shows a steady, nearly linear increase. The slope is positive but significantly shallower than the "critical tokens" series. * **Data Points (Approximate with Error Bar Ranges):** * k ≈ 5: pass@k ≈ 29% (Error bar range: ~26% to ~32%) * k ≈ 8: pass@k ≈ 34% (Error bar range: ~30% to ~38%) * k ≈ 15: pass@k ≈ 39% (Error bar range: ~35% to ~43%) * k ≈ 23: pass@k ≈ 42% (Error bar range: ~38% to ~47%) * k ≈ 31: pass@k ≈ 44% (Error bar range: ~39% to ~49%) * k ≈ 47: pass@k ≈ 47% (Error bar range: ~41% to ~53%) ### Key Observations 1. **Performance Gap:** There is a substantial and consistent performance gap between the two methods. "Critical tokens" achieves a pass@k rate approximately 25-40 percentage points higher than "random tokens" across all measured values of k. 2. **Diminishing Returns:** The "critical tokens" series exhibits clear diminishing returns. The gain from k=5 to k=8 (~12%) is much larger than the gain from k=31 to k=47 (~4%). The "random tokens" series shows more constant, linear returns. 3. **Error Bar Magnitude:** The error bars (representing uncertainty or variance) for both series appear to increase slightly in absolute terms as k increases. The relative uncertainty (error bar size compared to the mean value) seems more stable. 4. **Convergence:** The two lines are not converging. The absolute difference in pass@k between them remains large even at the highest k value shown (k≈47). ### Interpretation This chart likely comes from a machine learning or natural language processing context, evaluating a model's ability to generate correct outputs (pass@k) given a certain number of sampling attempts (k). The "critical tokens" method appears to be a targeted or informed strategy for guiding generation, while "random tokens" represents a baseline or unguided approach. The data strongly suggests that **strategically selecting or influencing "critical tokens" is a far more effective strategy for improving model performance than relying on random sampling.** The steep initial rise for critical tokens indicates that even a small number of guided samples yields a high probability of success. The persistent gap shows that the advantage of the guided method does not diminish with more attempts; it maintains a superior efficiency. The linear trend for random tokens is characteristic of a baseline where each additional sample provides a roughly equal, independent chance of success. The chart effectively argues for the value of the "critical tokens" intervention. </details> (c) <details> <summary>images/density_math.png Details</summary> ![bdd78b61](/v1/image/bdd78b613ceaca88e2382d08f711082f12f4c0d301e2873d860619c9a188eb82) ### Visual Description ## Histogram: Consistency Score Density by Answer Correctness ### Overview The image displays a histogram comparing the density distribution of "consistency scores" for two groups: one with incorrect answers and one with correct answers. The chart uses overlapping, semi-transparent bars to show the frequency (density) of scores across a range from 0 to 5. ### Components/Axes * **Chart Type:** Histogram (overlapping bar chart). * **X-Axis:** * **Label:** `consistency score` * **Scale:** Linear, ranging from 0 to 5. * **Markers:** Major ticks at integer values: 0, 1, 2, 3, 4, 5. * **Y-Axis:** * **Label:** `density(%)` * **Scale:** Linear, ranging from 0.0 to 0.7 (representing 0% to 70%). * **Markers:** Major ticks at 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7. * **Legend:** * **Position:** Top-right corner, inside the plot area. * **Entries:** 1. `w incorrect answers` (Light teal/cyan color) 2. `w correct answers` (Light pink/salmon color) ### Detailed Analysis The histogram presents two distinct distributions: **1. Distribution for "w incorrect answers" (Teal Bars):** * **Trend:** Shows a steep, monotonic decline from left to right. * **Data Points (Approximate Density):** * Consistency Score 0: ~0.73 (73%) * Consistency Score 1: ~0.16 (16%) * Consistency Score 2: ~0.05 (5%) * Consistency Score 3: ~0.03 (3%) * Consistency Score 4: ~0.02 (2%) * Consistency Score 5: ~0.02 (2%) * **Observation:** The vast majority of instances with incorrect answers have a very low consistency score (0), with density dropping off sharply as the score increases. **2. Distribution for "w correct answers" (Pink Bars):** * **Trend:** Shows a steady, monotonic increase from left to right. * **Data Points (Approximate Density):** * Consistency Score 0: ~0.10 (10%) * Consistency Score 1: ~0.12 (12%) * Consistency Score 2: ~0.13 (13%) * Consistency Score 3: ~0.15 (15%) * Consistency Score 4: ~0.50 (50%) * Consistency Score 5: ~0.50 (50%) * **Observation:** Instances with correct answers are concentrated at higher consistency scores, with the highest density at scores 4 and 5. **Overlap Region:** * The bars overlap significantly at lower scores (0-2), where both distributions have non-zero density. The combined color in these regions is a muted brownish-pink. ### Key Observations 1. **Inverse Relationship:** The two distributions are nearly mirror images. High density for incorrect answers corresponds to low consistency scores, while high density for correct answers corresponds to high consistency scores. 2. **Polarization:** The data is highly polarized. The "incorrect" group is overwhelmingly clustered at score 0, and the "correct" group is heavily clustered at scores 4 and 5. 3. **Minimal Middle Ground:** There is relatively low density for both groups in the middle range of consistency scores (2 and 3). ### Interpretation This histogram demonstrates a strong, positive correlation between a higher "consistency score" and the likelihood of providing a correct answer. The data suggests that the "consistency score" is a highly effective metric for distinguishing between correct and incorrect responses in this context. * **What it means:** A low consistency score (especially 0) is a very strong indicator of an incorrect answer. Conversely, a high consistency score (4 or 5) is a strong indicator of a correct answer. * **Why it matters:** This relationship validates the "consistency score" as a meaningful measure. It could be used as a proxy for confidence, understanding, or reliability in the system being measured. The stark separation between the groups implies the scoring mechanism is well-calibrated for this task. * **Underlying Pattern:** The pattern suggests a binary-like outcome: responses tend to be either highly consistent (and correct) or highly inconsistent (and incorrect), with few ambiguous cases in the middle. This could indicate a task where understanding is clear-cut, or where the consistency metric captures a fundamental aspect of the correct solution process. </details> (d) Figure 8: (a) Comparison of the fraction of tokens whose associated rollout correctness falls below or above 5%, for tokens extracted by our method versus randomly selected tokens. (b) Comparison of the fraction of the subsequent tokens whose associated rollout correctness falls below or above 10%, for tokens extracted by our method versus randomly selected tokens. (c) Comparison of the impact of extracted tokens by our method versus random tokens on LLM reasoning performance. (d) Comparison of the density distributions of consistency scores for rollouts with correct and incorrect answers. All the evaluations are conducted on Math500 dataset. we follow the previous pipeline to conduct a quantitative analysis to examine the authenticity and impact of tokens extracted via paraphrastic probing, comparing them against randomly selected tokens. Specifically, we sample 100 instances with incorrect reasoning trajectories from the Math500 (Hendrycks et al., 2021) training sets. Following the paraphrastic probing pipeline, we identify candidate critical tokens in the early reasoning steps. In each run, we randomly sample 40% of the candidate critical tokens for evaluation and repeat the experiment independently 10 times. For comparison, we apply the same procedure to randomly selected tokens. All evaluations are conducted using Llama-3.1-8B-Instruct (Grattafiori et al., 2024). We first compute the average accuracy of rollouts generated from partial trajectories truncated at the position of the extracted tokens, and compare these results with those obtained from random-token truncation. As shown in Figure 8(a), we also obverse a large proportion of the extracted tokens exhibit near-zero correctness. We further evaluate the average accuracy of rollouts initiated from subsequent tokens (5–10 tokens). The results in Figure 8(b) also indicate that errors persist beyond the identified positions. Finally, we substitute the extracted critical tokens with alternative tokens and roll out multiple new reasoning trajectories. As shown in Figure 8(c), replacing critical tokens leads to a significantly larger improvement in reasoning accuracy compared to substituting random tokens. To justify our paraphrase consistency mechanism on Math500 dataset (Hendrycks et al., 2021), we follow the same procedure. The evaluation is conducted on Llama-3.1-8B-Instruct (Grattafiori et al., 2024). We then analyze the distribution of consistency scores for rollouts that yield correct versus incorrect answers. As shown in Figure 8(d), majority of rollouts with correct answers achieve a consistency score of at least 1, whereas this proportion drops sharply for rollouts with incorrect answers. ## Appendix B Automatic Prompt Engineering for Paraphrasing The quality of paraphrased questions is crucial for our framework, as it directly affects both the identification of critical tokens and the stability of paraphrase-based consistency verification. To ensure high-quality paraphrasing, we adopt Automatic Prompt Engineering (APE) (Zhou et al., 2022) as an evaluation to optimize the paraphrasing instruction. This allows us to systematically control the quality of paraphrases rather than relying solely on manually written prompts. The APE procedure we employ consists of four steps: - We create a small set of original questions paired with valid paraphrased variants. Using this exemplar set, the model generates multiple candidate instructions that could produce the paraphrased outputs from the original questions. - Prompt each candidate instruction to the language model to generate paraphrases for the original problems and compare the mean solve rate change before and after paraphrasing. - Choose the instruction that maximizes the mean solve rate change. - Repeat the previous steps multiple times. In our experiments, we initialize the paraphrasing instruction and iteratively refine it using APE with samples drawn from the GSM8K and Math500 training sets. With this approach, the refined paraphrasing prompt helps us reliably produce high-quality paraphrased questions. Algorithm 1 PPCV Framework 0: LLM; original question $q_0$ ; number of paraphrased questions $N$ ; number of sampled alternative tokens $K$ ; temperature-scaling coefficient $λ$ . 0: Final answer $ans_f$ . (Paraphrasing) Generate paraphrased variants $q_1,q_2,\dots,q_N$ from $q_0$ using the LLM. (Initial reasoning) Obtain the initial reasoning path $r^q_0_0$ for $q_0$ . (Token influence analysis) For each paraphrased question $q_n$ , compute the token distribution $P^q_n$ by concatenating $q_n$ with $r^q_0_0$ (Eq. 1). (Candidate selection) Identify candidate critical token positions based on the mismatch between the top-1 predicted tokens and the expected tokens in $q_0$ (Eq. 2). (Critical token verification) Select the final critical token $a_c$ using the criteria (Eqs. 3 – 5). (Alternative token sampling) Sample $K$ alternative tokens $a_c^0,a_c^1,\dots,a_c^K-1$ (including $a_c^0=a_c$ ) using top- $K$ sampling on $q_0$ . (Truncated rollouts) Truncate the initial reasoning path $r^q_0_0$ at position $a_c$ and form synthetic inputs $\tilde{r}_c^0,\tilde{r}_c^1,\dots,\tilde{r}_c^K-1$ by appending each alternative token. for each synthetic input $\tilde{r}_c^k$ , $k=0,\dots,K-1$ do Generate rollouts $r^q_0_k,r^q_1_k,\dots,r^q_N_k$ for the original and paraphrased questions using the LLM. end for (Consistency verification) Compute the final answer $ans_f$ using (similarity-weighted) paraphrase consistency (Eq. 6). ## Appendix C Algorithm The complete algorithm of our proposed PPCV framework is illustrated in Algo. 1. ## Appendix D Discussion <details> <summary>images/critical_token_on_correct.png Details</summary> ![4ed06683](/v1/image/4ed06683e859375f6e154f18fd1ec4c74f722bd8b67fad2384bb437a3c17c0c1) ### Visual Description ## Textual Comparison: Case Study on Identification of Critical Tokens via Paraphrastic Probing ### Overview The image displays a side-by-side textual comparison used as a case study. It demonstrates how a mathematical word problem is solved by an AI system, with the core difference being the paraphrasing of a key term ("meters" vs. "distances") in the initial response. The layout is structured in two parallel columns under a common header, showing identical questions and final answers but with differently phrased reasoning steps. ### Components 1. **Header Bar**: A teal-colored bar at the top spanning the full width. It contains the title text in white. 2. **Left Column**: Contains the original problem and solution path. 3. **Right Column**: Contains the same problem but with a paraphrased term in the response. 4. **Text Styling**: Key terms are highlighted in color within the "Response" lines. A checkmark symbol (✓) appears at the end of each solution. ### Content Details **Header Text:** `Case study on identification of critical tokens via Paraphrastic Probing` **Left Column Content:** * **Question:** `James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? (ground truth answer: 540)` * **Response:** `To find the total **meters** James runs in a week, we need to calculate the total meters he runs in one sprint session and then multiply it by the number of sessions he has in a week.` (The word "meters" is highlighted in **red**). * **Step 1:** `Calculate the total meters James runs in one sprint session. James runs 3 sprints of 60 meters each. Total meters in one sprint session = 3 * 60 = 180 meters.` * **Step 2:** `Calculate the total meters...` (The text is truncated with an ellipsis). * **Conclusion:** `Therefore, James runs a total of 540 meters a week` * **Final Answer:** `The answer is 540. ✓` **Right Column Content:** * **Question:** `James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? (ground truth answer: 540)` (Identical to the left column). * **Response:** `To find the total **distances** James runs in a week, we need to calculate the total distance he runs in one sprint session and then multiply it by the number of sessions he has in a week.` (The word "distances" is highlighted in **purple**). * **Step 1:** `Calculate the total distance James runs in one sprint session. Distance per sprint = 60 meters. Number of sprints per session = 3. Total distance per session = Distance per sprint * Number of sprints per session. Total distance per session = 60 * 3. Total distance per session = 180 meters.` * **Step 2:** `Calculate the total distance James runs in a week.` * **Conclusion:** `Therefore, James runs a total of 540 meters in a week. ✓` ### Key Observations 1. **Identical Problem & Answer:** Both columns present the exact same word problem and arrive at the identical, correct numerical answer of 540 meters. 2. **Paraphrased Core Token:** The critical difference is in the first sentence of the "Response." The left column uses the term "**meters**" (highlighted in red), which directly matches a unit in the problem. The right column paraphrases this to "**distances**" (highlighted in purple), a more general term. 3. **Divergent Reasoning Paths:** While the final calculation is the same (3 sprints * 60 meters * 3 sessions), the explanatory steps are phrased differently. The right column's reasoning is more verbose and explicitly breaks down the formula (`Distance per sprint * Number of sprints per session`). 4. **Layout for Comparison:** The side-by-side presentation is designed to isolate and highlight the effect of substituting a single, semantically related token ("meters" -> "distances") on the model's generated reasoning process. ### Interpretation This case study visually demonstrates the concept of **Paraphrastic Probing** in the context of AI model analysis. The goal is to identify which tokens (words) in a prompt are "critical" for the model's reasoning process. * **What it demonstrates:** By showing that the model can correctly solve the problem even when a key term ("meters") is paraphrased to a synonym or related concept ("distances"), the study suggests the model's understanding is not rigidly tied to specific lexical tokens. It can map different surface forms to the same underlying concept needed for calculation. * **Why it matters:** This technique is used to test the robustness and generalization capabilities of language models. If a model fails when a non-critical word is changed but succeeds when a paraphrased critical token is used, it indicates a deeper, more semantic understanding of the problem structure. The highlighted colors (red/purple) serve to draw the viewer's attention directly to the manipulated variable in this experiment. * **Underlying Pattern:** The image is not presenting raw data but a curated example of an experimental method. The "data" here is the qualitative comparison of two text outputs. The key finding is the **invariance of the correct solution** to the paraphrasing of a core problem-solving token, which is the central point of the case study. </details> Figure 9: Effect of the critical token on an initial reasoning path that yields the correct answer. Critical tokens are highlighted in bold red, and alternative tokens in bold purple. We have shown that replacing critical tokens can correct intermediate reasoning paths and lead to the correct answer. In this section, we examine how our method influences reasoning paths that are already correct. First, we conduct a case study on an instance with a correct answer, where we follow our pipeline to identify the critical token and roll out new reasoning paths using alternative tokens. As illustrated in Figure 9, the new rollouts also yield the correct answer. Second, our pipeline incorporates both the initial reasoning path $r^q_0_0$ and its paraphrased variants $r^q_n_0$ for evaluation. The robustness of correct rollouts across paraphrased forms ensures high consistency scores, allowing them to stand out as the final answer. These findings suggest that our pipeline preserves the performance of LLMs on problems that can already be solved correctly by CoT (Wei et al., 2022). Table 5: Comparison between our method and baselines with DeepSeek-R1-Distill-Llama-70B model on mathematical reasoning benchmarks using pass@1 accuracy. | Chain-of-Thought Predictive Decoding Phi-Decoding | 56.67 60.00 64.00 | 38.00 40.66 46.67 | 43.33 44.66 48.00 | 30.00 30.66 31.33 | | --- | --- | --- | --- | --- | | PPCV (Ours) | 70.00 | 56.66 | 56.66 | 33.33 | Table 6: Comparison of our method and the baseline approach on Llama-3.1-8B-Instruct, evaluated using pass@k (k=4). | Phi-Decoding PPCV (Ours) | 92.15 93.83 | 53.57 61.41 | 52.60 59.39 | 91.19 94.48 | 90.73 94.24 | | --- | --- | --- | --- | --- | --- | Table 7: Comparison of our method and the baseline approach on Qwen3-32B, evaluated using pass@k (k=4). | Phi-Decoding PPCV (Ours) | 41.61 49.71 | 30.19 34.28 | 45.09 51.42 | 13.11 19.28 | | --- | --- | --- | --- | --- | Table 8: The comparison of performance between Paraphrased Majority Voting (PMV) and our proposed PPCV using pass@1 accuracy. | Chain-of-Thought PMV PPCV (Ours) | 77.40 83.55 88.24 | 28.00 33.16 49.73 | 31.00 40.60 50.00 | 83.00 87.70 89.60 | 58.91 61.63 88.31 | | --- | --- | --- | --- | --- | --- | ## Appendix E Main results To further assess the effectiveness of our method, we evaluate it on a larger reasoning model such as DeepSeek-R1-Distill-Llama-70B. We apply our full pipeline and compare against all baselines across multiple mathematical reasoning benchmarks. The results, presented in Table 5, show that our method consistently outperforms the baselines. These improvements demonstrate that our approach remains effective for stronger reasoning models and generalizes well beyond the smaller models. we also conduct additional experiments to report pass@k performance for both our method and the baselines. In these experiments, we use Phi-Decoding as the representative baseline and evaluate on two models: Llama-3.1-8B-Instruct and Qwen3-32B. We set k= 4 and generate 12 samples per question to obtain stable estimates of pass@4. The results across multiple benchmarks are presented in Table 6 and Table 7, respectively. Consistent with our main findings, our method achieves higher pass@k scores compared to the baseline methods, indicating that paraphrastic critical-token refinement continues to provide benefits in a multi-sample setting. These results further validate the robustness of our approach under stochastic sampling and demonstrate improved hit rates when multiple outputs are available. ## Appendix F Ablation Study We also perform a comparison against a Paraphrased Majority Voting (PMV) strategy is essential to demonstrate that the performance gains of our method are indeed derived from the critical token correction, rather than simply the benefit of having multiple paraphrased samples. Thus, we have performed the suggested additional experiment. We adopt Paraphrased Majority Voting (PMV) as the baseline. For a fair comparison, this approach uses the same total number of trajectories generated from the paraphrased questions as PPCV, and reaches the final answer using standard majority voting over the rollouts from the paraphrased questions. Specifically, we use four variants of each question, with 10 rollouts per variant. We compare the PMV baseline against our full PPCV method and ensure the same number of paraphrased questions as in our method. The results shown in Table 8 demonstrate that the performance yielded by PMV is significantly lower than that of our proposed PPCV method. This result confirms that simply rolling out trajectories from paraphrased questions and aggregating the answers is insufficient. The substantial performance gap between PMV and PPCV decisively highlights the necessity and effectiveness of our extracted critical tokens and the subsequent corrective reasoning step. <details> <summary>images/latency_trade_off.png Details</summary> ![e89a1502](/v1/image/e89a15027c8f6ba49b0dffa78fb023e7895e0f3826cc413964b7717315354e63) ### Visual Description ## Dual-Axis Bar and Line Chart: Latency vs. Accuracy for Top-N Selection ### Overview This image displays a dual-axis combination chart comparing two performance metrics—Latency (in seconds) and Accuracy (in percentage)—across four different "Top-N" selection categories (Top-3, Top-5, Top-7, Top-10). The chart uses light blue bars to represent latency and a red line with markers to represent accuracy. Both data series include error bars, indicating variability or confidence intervals. ### Components/Axes * **X-Axis (Categorical):** Labeled with four categories: "Top-3", "Top-5", "Top-7", "Top-10". These are positioned along the bottom of the chart. * **Primary Y-Axis (Left, for Latency):** Labeled "Latency (s)". The scale runs from 0.0 to approximately 18.0, with major tick marks at 0.0, 2.5, 5.0, 7.5, 10.0, 12.5, 15.0, and 17.5. * **Secondary Y-Axis (Right, for Accuracy):** Labeled "Accuracy (%)". The scale runs from 70 to 100, with major tick marks at 70, 75, 80, 85, 90, 95, and 100. * **Data Series 1 (Bars):** Light blue vertical bars represent the Latency value for each category. Each bar has a black error bar extending vertically from its top. * **Data Series 2 (Line):** A solid red line connects circular red markers, representing the Accuracy value for each category. Each marker has a black error bar extending vertically through it. * **Legend:** There is no explicit legend box present in the chart. The association of the blue bars with the left "Latency (s)" axis and the red line with the right "Accuracy (%)" axis is inferred from the axis labels and standard charting conventions. ### Detailed Analysis **Trend Verification:** * **Latency (Blue Bars):** The height of the bars shows a clear, consistent upward trend from left to right. Latency increases as the "N" in "Top-N" increases. * **Accuracy (Red Line):** The red line also shows a consistent upward trend from left to right, though the slope appears slightly less steep than the latency increase. Accuracy improves as "N" increases. **Data Point Extraction (Approximate Values):** * **Top-3:** * Latency: ~14.0 seconds (Bar height). Error bar spans approximately 13.5 to 14.5 seconds. * Accuracy: ~87% (Red marker). Error bar spans approximately 86% to 88%. * **Top-5:** * Latency: ~15.8 seconds. Error bar spans approximately 15.3 to 16.3 seconds. * Accuracy: ~88.5%. Error bar spans approximately 87.5% to 89.5%. * **Top-7:** * Latency: ~17.0 seconds. Error bar spans approximately 16.5 to 17.5 seconds. * Accuracy: ~89%. Error bar spans approximately 88% to 90%. * **Top-10:** * Latency: ~18.0 seconds. Error bar spans approximately 17.5 to 18.5 seconds. * Accuracy: ~89.5%. Error bar spans approximately 88.5% to 90.5%. ### Key Observations 1. **Positive Correlation:** Both latency and accuracy increase monotonically with the "Top-N" parameter. Considering more candidates (higher N) leads to better accuracy but at the cost of higher processing time. 2. **Diminishing Returns on Accuracy:** The improvement in accuracy appears to slow down as N increases. The jump from Top-3 to Top-5 (~1.5%) is larger than the jump from Top-7 to Top-10 (~0.5%). 3. **Linear Latency Increase:** The increase in latency appears relatively linear across the categories, suggesting a roughly constant additional processing cost per added candidate. 4. **Error Bar Consistency:** The size of the error bars (variability) for both metrics appears relatively consistent across all categories, indicating stable measurement variance. ### Interpretation This chart illustrates a classic performance trade-off in a selection or retrieval system (e.g., a recommendation engine, search algorithm, or classification model). The "Top-N" parameter controls how many top-ranked items are considered for a final decision or output. * **The Trade-off:** The data demonstrates that expanding the candidate pool (increasing N) yields a more accurate result, as the system has a better chance of including the correct item. However, this accuracy gain comes with a direct and significant cost in computational latency. The system takes longer to process and evaluate more candidates. * **Operational Implication:** The choice of an optimal "N" is not purely technical but depends on the application's requirements. A real-time system (e.g., live video analysis) might prioritize low latency and choose Top-3 or Top-5, accepting slightly lower accuracy. A batch processing system where accuracy is paramount (e.g., medical image analysis) might opt for Top-10 despite the higher latency. * **Diminishing Returns:** The flattening accuracy curve suggests that beyond a certain point (perhaps Top-7 or Top-10), the marginal gain in accuracy may not justify the additional latency cost. This insight is crucial for system optimization, indicating a potential point of optimal efficiency. * **Underlying System Behavior:** The linear latency increase suggests the processing time per candidate is roughly constant. The consistent error bars imply the system's performance variability is stable and not exacerbated by considering more candidates. </details> (a) <details> <summary>images/throughput_trade_off.png Details</summary> ![20f651ae](/v1/image/20f651ae4b816ca6cae5ee3fe04483885ca2467477406b3fedddf3d2ca461dcb) ### Visual Description ## Dual-Axis Performance Chart: Throughput vs. Accuracy by Top-k ### Overview This image is a dual-axis combination chart displaying two performance metrics—throughput and accuracy—across four different "Top-k" configurations (Top-3, Top-5, Top-7, Top-10). The chart uses vertical bars to represent throughput and a connected line with markers to represent accuracy, each plotted against its own y-axis. ### Components/Axes * **Chart Type:** Dual-axis combination chart (bar chart + line chart). * **X-Axis (Categorical):** Labeled with four discrete categories: `Top-3`, `Top-5`, `Top-7`, `Top-10`. These are evenly spaced along the horizontal axis. * **Primary Y-Axis (Left):** Labeled `Throughput (tokens/sec)`. The scale runs from 0 to 1400, with major tick marks at intervals of 200 (0, 200, 400, 600, 800, 1000, 1200, 1400). * **Secondary Y-Axis (Right):** Labeled `Accuracy (%)`. The scale runs from 50 to 100, with major tick marks at intervals of 10 (50, 60, 70, 80, 90, 100). * **Data Series 1 (Bars):** Light blue vertical bars represent **Throughput**. Each bar has a black error bar (I-beam style) at its top. * **Data Series 2 (Line):** A solid red line with circular markers represents **Accuracy**. Each data point has a small, vertical black error bar. * **Legend:** There is no explicit legend box. The axis labels and visual encoding (blue bars for throughput, red line for accuracy) serve as the legend. ### Detailed Analysis **Throughput (Blue Bars - Left Axis):** * **Trend:** The throughput shows a clear, monotonic **upward trend** as the Top-k value increases. The bars grow progressively taller from left to right. * **Data Points (Approximate):** * **Top-3:** ~550 tokens/sec. Error bar spans approximately 540 to 560. * **Top-5:** ~800 tokens/sec. Error bar spans approximately 790 to 810. * **Top-7:** ~1010 tokens/sec. Error bar spans approximately 980 to 1040. * **Top-10:** ~1340 tokens/sec. Error bar spans approximately 1260 to 1420. This bar has the largest error range. **Accuracy (Red Line - Right Axis):** * **Trend:** The accuracy shows a very **slight, gradual upward trend** that appears to plateau. The line is nearly flat, with a minimal positive slope. * **Data Points (Approximate):** * **Top-3:** ~88%. Error bar spans approximately 87.5% to 88.5%. * **Top-5:** ~89%. Error bar spans approximately 88.5% to 89.5%. * **Top-7:** ~89.2%. Error bar spans approximately 88.7% to 89.7%. * **Top-10:** ~89.3%. Error bar spans approximately 88.8% to 89.8%. ### Key Observations 1. **Strong Positive Correlation (Throughput vs. Top-k):** There is a direct and significant relationship between increasing the Top-k parameter and system throughput. Moving from Top-3 to Top-10 results in an approximate **144% increase** in throughput (from ~550 to ~1340 tokens/sec). 2. **Negligible Impact on Accuracy:** In contrast, accuracy remains remarkably stable, increasing by only about **1.3 percentage points** (from ~88% to ~89.3%) across the same range. The error bars for accuracy are small and consistent. 3. **Increasing Variance in Throughput:** The error bars on the throughput bars widen as Top-k increases, indicating greater variability or uncertainty in the throughput measurement for higher Top-k values, especially at Top-10. 4. **Spatial Layout:** The accuracy line is positioned in the upper portion of the chart area, visually corresponding to the 85-90% range on the right axis. The throughput bars occupy the lower to middle portion, scaling with the left axis. ### Interpretation This chart demonstrates a clear **performance trade-off (or lack thereof)** in a computational system, likely related to a machine learning model's inference or decoding process (e.g., beam search, sampling). * **What the data suggests:** Increasing the `Top-k` parameter (which typically means considering more candidate outputs at each step) leads to a substantial gain in processing speed (throughput) without a meaningful penalty to output quality (accuracy). This is a highly desirable outcome. * **How elements relate:** The dual-axis format effectively contrasts the two metrics. The dramatic rise of the blue bars against the near-flat red line visually emphasizes that throughput is highly sensitive to the Top-k setting, while accuracy is largely insensitive to it within this tested range. * **Notable implications:** The findings suggest that for this specific system or task, one can safely increase the Top-k value to gain significant speed improvements. The widening error bars at Top-10, however, hint that at very high k values, the throughput might become less predictable. The stability of accuracy implies that the model's top predictions are of consistently high quality, and expanding the candidate pool doesn't dilute the final selected output's correctness. This could be due to the model having high confidence in its top choices, making the exploration of a larger set (higher k) beneficial for parallelization or batching efficiency without introducing noise. </details> (b) Figure 10: Trade-off between performance and efficiency for our method under different numbers of top-k tokens (k = 3, 5, 7, 10), measured by latency (left) and throughput (right). ## Appendix G Computational Cost Analysis We also conduct a trade-off analysis between performance and efficiency by varying the number of alternative tokens considered at the critical position. Specifically, we test top-3, top-5, top-7, and top-10 alternatives using GSM8K, sampling 200 random questions and repeating the experiment five times to compute confidence intervals. The results are presented in Figure 10. We observe a clear trade-off. Reducing the number of alternative tokens lowers both latency and throughput, while causing a slight reduction in accuracy. This provides a practical mechanism for adjusting performance-efficiency trade-offs in real deployments. Depending on resource availability and target accuracy, practitioners can choose the appropriate number of alternative tokens. ### G.1 Prompts Prompts for reasoning (for math datasets) System: You are a specialized AI assistant designed to solve mathematical problems. User: Please solve the following problem step by step, and put your final answer within boxed{}. Each pack of dvds costs 76 dollars. If there is a discount of 25 dollars on each pack How much do you have to pay to buy each pack? Assistant: To find the price after the discount, we need to subtract the discount amount from the original price. Step 1: Identify the original price and the discount amount.Original price = $76. Discount amount = $25 Step 2: Subtract the discount amount from the original price. Price after discount = Original price - Discount amount. Price after discount = $76 - $25. Price after discount = $51 So, you have to pay $51 to buy each pack. boxed{$51} Prompts for reasoning (for ARC dataset) System: You are a specialized AI assistant designed to solve complex reasoning problems. User: Your task is to carefully read the following question, apply logical reasoning, scientific knowledge, and commonsense inference, and arrive at the correct answer. Always explain your reasoning step by step in clear language. At the end of your response, output the final choice in the following format: The answer is: (option), where option is the correct option letter (A, B, C, D, etc.), enclosed in parentheses. A group of engineers wanted to know how different building designs would respond during an earthquake. They made several models of buildings and tested each for its ability to withstand earthquake conditions. Which will most likely result from testing different building designs? (A) buildings will be built faster (B) buildings will be made safer (C) building designs will look nicer (D) building materials will be cheaper? Assistant: We are given a scenario in which engineers are testing different building designs to see how well they withstand earthquake conditions. The goal is to determine which of the listed outcomes is most likely to result from such testing. Let 2019s evaluate each option in turn: (A) Buildings will be built faster. 2013 This is not directly related to how well a building withstands earthquakes. Earthquake resistance is about structural integrity, not construction speed. While a good design might simplify construction, the primary purpose of the test is not speed. So this is not the most likely result. (B) Buildings will be made safer 2013 This is directly aligned with the purpose of the experiment. Testing building designs under earthquake conditions helps identify which designs can better absorb or resist seismic forces. The goal is to improve safety during earthquakes. Therefore, this outcome is a direct and logical result of such testing. (C) Building designs will look nicer 2013 Aesthetics are not the focus of earthquake testing. While a design might be both safe and attractive, the test itself does not evaluate visual appeal. So this is not a likely result of the testing. (D) Building materials will be cheaper 2013 Cost of materials is not necessarily a result of testing for earthquake resistance. … The answer is: (B) Prompts to paraphrase the question (for math datasets) System: You are a paraphrasing assistant specialized in math problems. Your task is to paraphrase a given math word problem. The core objective is to create a new, diverse expression of the problem while strictly preserving all numerical values, mathematical relationships, and the core logic required to solve it. User: Please create 5 paraphrased versions of the question below. Do NOT change any numbers or quantities. Maintain the exact mathematical operations and relationships. And vary the context and language. Please output paraphrased questions starting with a bracketed serial number like [1] [2] and ending with a new line. A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Assistant: [1] Two bolts of blue fabric and one and a half times that amount of white fabric are required to make a robe. How many bolts in all are needed? [2] A robe is made from 2 bolts of blue yarn and an additional amount equal to half of that for white yarn. In total, how many bolts does it call for? [3] To create a robe, you’ll need 2 bolts of blue thread and another half-quantity of white thread. How many bolts of thread in total should you purchase? [4] For a single robe, the production process demands 2 bolts of blue fabric and 1.5 times as much white fabric. What is the total number of bolts required? [5] The manufacture of a robe necessitates the usage of 2 bolts of blue material and slightly less than double that quantity for white material. How many total bolts does this equate to? Prompts to paraphrase the question (for ARC dataset) System: You are given a multiple-choice question from a reasoning benchmark. Your task is to generate multiple different paraphrased versions of the question. Each version must preserve the original meaning, answer choices, and scientific/commonsense accuracy. Guidelines: - Keep the question logically equivalent to the original. - Do not change the correct answer. - Do not add or remove details that would alter the difficulty. - Use different phrasing, sentence structures, or synonyms for each version. - Keep the multiple-choice options (A, B, C, D, etc.) unchanged. Example Input: A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5’ end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy? (A) antisense (B) lariat (C) R-loops (D) polyA tail User: Please create 3 paraphrased versions of the question below. Only paraphrases the question and leaves the multiple-choice options untouched. Please output paraphrased questions starting with a bracketed serial number like [1] [2] and ending with a new line, An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation? (A) Planetary density will decrease. (B) Planetary years will become longer. (C) Planetary days will become shorter. (D) Planetary gravity will become stronger. Assistant: [1] An astronomer notices that a planet’s rotational speed increases following a meteorite collision. What is the most probable consequence of this accelerated rotation? (A) Planetary density will decrease. (B) Planetary years will become longer. (C) Planetary days will become shorter. (D) Planetary gravity will become stronger. [2] After a meteorite strikes a planet, the astronomer observes an increase in its rotational velocity. What is the most likely outcome of this faster rotation? (A) Planetary density will decrease. (B) Planetary years will become longer. (C) Planetary days will become shorter. (D) Planetary gravity will become stronger. [3] A planet begins to rotate more quickly after experiencing a meteorite impact. Which of the following is most likely to result from this increased spin? (A) Planetary density will decrease. (B) Planetary years will become longer. (C) Planetary days will become shorter. (D) Planetary gravity will become stronger. ## Appendix H Use of Large Language Models In our work, we focus on improving large Malaguena models, and LLMs are used in our experimental evaluation. In addition, we employ LLMs to check grammatical errors and refine sentence structures.

Rendering Paper...