2601.22642

Model: gemini-2.0-flash

# Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification in Language Models **Authors**: Chuxue Cao, Jinluan Yang, Haoran Li, Kunhao Pan, Zijian Zhao, Zhengyu Chen, Yuchen Tian, Lijun Wu, Conghui He, Sirui Han, Yike Guo Abstract Large Language Models (LLMs) show remarkable capabilities, yet their stochastic next-token prediction creates logical inconsistencies and reward hacking that formal symbolic systems avoid. To bridge this gap, we introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process, providing real-time feedback to detect and rectify errors as they occur. Distinguished from previous neuro-symbolic methods limited by passive post-hoc validation, our approach actively penalizes intermediate fallacies during the reasoning chain. We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization. Extensive evaluation on six benchmarks spanning mathematical, logical, and general reasoning demonstrates that our 7B and 14B models outperform state-of-the-art baselines by average margins of 10.4% and 14.2%, respectively. These results validate that formal verification can serve as a scalable mechanism to significantly push the performance boundaries of advanced LLM reasoning. We will release our data and models for further exploration soon. Machine Learning, ICML 1 Introduction <details> <summary>x1.png Details</summary> ![06ef587d](/v1/image/06ef587de393cad860256829567c0f340affc43268b05f5da9dd58a18e06735a) ### Visual Description ## Chart/Diagram Type: Comparative Analysis of Reasoning Methods ### Overview The image presents a comparative analysis of Natural Language (NL) reasoning versus Formal Logic reasoning in solving a problem involving relative scoring. It includes a problem statement, NL reasoning steps, formal logic reasoning steps, compiler output, and a bar chart comparing the consistency of logic in NL reasoning chains. ### Components/Axes * **Problem Statement:** "Alice > Bob, Charlie < Alice, Diana > Charlie. Who scores higher: Bob or Diana?" * **NL Reasoning (Left):** * Steps: "Charlie < Diana < Alice > Bob → Therefore: Diana > Bob" * Answer: "Diana scores higher than Bob" (marked with a red "X") * **NL Reasoning (Right):** * Steps: "Charlie < Diana < Alice > Bob → Therefore: Diana > Bob" * **Formal Logic Reasoning:** * Code: * "solver.add(bob > diana)" * "result = solver.check()" * "solver.add(diana > bob)" * "result = solver.check()" * **Compiler Output:** "Unknown" * **Final Answer (Formal Logic):** "Relationship is undetermined" (marked with a green checkmark) * **Bar Chart:** * X-axis: "Logic Consistency in NL Reasoning Chains" with categories "Correct CoT" and "Wrong CoT" * Y-axis: "Percentage (%)" * Legend (top-center): * Blue: "Consistent Logic" * Red: "Inconsistent Logic" ### Detailed Analysis **Bar Chart Data:** * **Correct CoT (Correct Chain of Thought):** * Consistent Logic (Blue): 60.7% * Inconsistent Logic (Red): 39.3% * **Wrong CoT (Wrong Chain of Thought):** * Consistent Logic (Blue): 47.6% * Inconsistent Logic (Red): 52.4% **Trend Verification:** * For "Correct CoT", the "Consistent Logic" bar is significantly higher than the "Inconsistent Logic" bar. * For "Wrong CoT", the "Inconsistent Logic" bar is slightly higher than the "Consistent Logic" bar. ### Key Observations * NL reasoning, while providing a seemingly logical answer, is marked as incorrect. * Formal logic reasoning, through code, determines that the relationship between Bob and Diana cannot be determined. * The bar chart shows that even with a "Correct CoT", there's still a significant percentage (39.3%) of inconsistent logic in NL reasoning. * When the "CoT" is wrong, the "Inconsistent Logic" is slightly higher than the "Consistent Logic". ### Interpretation The image highlights the potential pitfalls of relying solely on NL reasoning for problem-solving, especially when dealing with logical relationships. While NL reasoning can provide an intuitive answer, it may not always be logically consistent or accurate. Formal logic, on the other hand, provides a more rigorous approach, capable of identifying when a relationship cannot be definitively determined. The bar chart emphasizes that inconsistencies can arise even when the chain of thought appears correct, suggesting that NL reasoning is prone to errors. The comparison underscores the importance of using formal methods to verify the correctness of NL-based solutions, especially in critical applications. </details> Figure 1: Comparison between natural language reasoning and formal logic verification-guided reasoning. Formal verification detects logical errors in natural language reasoning and provides corrected reasoning paths. “NL” means Natural Language. Large Language Models (LLMs) demonstrate impressive proficiency in mathematical and logical reasoning (Ahn et al., 2024; Ji et al., 2025; Liu et al., 2025b; Chen et al., 2025a), yet their probabilistic decoding process lacks inherent mechanisms to ensure consistency (Sheng et al., 2025b; Baker et al., 2025). This tension creates significant risks, including hallucinations (Li and Ng, 2025; Sheng et al., 2025b), safety vulnerabilities (Zhou et al., 2025; Cao et al., 2025b), and reward hacking (Chen et al., 2025b; Baker et al., 2025). Although recent efforts have employed model-based verifiers to offer denser feedback than sparse ground-truth labels (Ma et al., 2025; Liu et al., 2025c; Guo et al., 2025b), they often overlook intermediate reasoning steps. To enforce more rigorous supervision, subsequent research has incorporated formal tools like theorem provers and code interpreters (Ospanov et al., 2025; Kamoi et al., 2025; Liu et al., 2025a) to address this drawback. However, existing formal approaches face critical limitations: they are often restricted to specific domains (e.g., Mathematics) (Ospanov et al., 2025; Liu et al., 2025a), rely on uncertain autoformalization (Ospanov et al., 2025; Feng et al., 2025b), or utilize post-hoc verification that cannot actively prevent error propagation (Kamoi et al., 2025; Feng et al., 2025b). This yields the primary question to be explored: (Q) Can we utilize the formal verification to further enhance the LLM reasoning across diverse domains? To explore this question, we first quantified the logical consistency gap in current LLMs by conducting a formal verification analysis of generated reasoning chains. A critical finding emerges as shown in Figure 1: even chains that arrive at correct final answers suffer from substantial logical inconsistency, with 39.3% of steps formally disproved, a trend consistent with previous research (Sheng et al., 2025b; Leang et al., 2025). For chains leading to incorrect answers, this failure rate rises to 52.4%. The comparative example in Figure 1 illustrates this gap: natural language reasoning incorrectly concludes “Diana $>$ Bob” from the given constraints, while formal verification identifies the incorrect conclusion and lead to an correct answer. This phenomenon reveals pervasive “reward hacking,” where models exploit superficial patterns to reach correct labels without establishing sound logical foundations (Skalse et al., 2022). Ultimately, these results expose a fundamental limitation of natural language reasoning: without explicit verification mechanisms, models cannot guarantee reasoning validity or global logical coherence across multi-step inference. To bridge this gap, we propose a novel framework that synergizes probabilistic natural language reasoning with formal symbolic verification. Distinguished from prior approaches relying on static filtering or narrow domains, our method dynamically interleaves formal verification into the generation process. By incorporating feedback from satisfiability results, counterexamples, and execution outputs, we extend the standard chain-of-thought paradigm to enable real-time error detection and rectification. To effectively operationalize this interleaved reasoning approach, we introduce a formal logic verification-guided training framework comprising two synergistic stages: (i) Supervised Fine-tuning (SFT), which employs a hierarchical data synthesis pipeline. Crucially, we mitigate the noise of raw autoformalization by enforcing execution-based validation, thereby ensuring high alignment between natural language and formal proofs; and (ii) Reinforcement Learning (RL), which utilizes Group Relative Policy Optimization (GRPO) with a composite reward function to enforce structural integrity and penalize logical fallacies. Empirical evaluations across logical, mathematical, and general reasoning domains demonstrate that this framework substantially enhances reasoning accuracy, highlighting the potential of formal verification to push the performance boundaries of LLM reasoning. Our contributions can be concluded as follows: - We propose the first framework that dynamically interleaves formal verification into LLM reasoning across diverse domains, utilizing the real-time feedback via symbolic interpreters to transcend the limitations of passive post-hoc filtering and domain-specific theorem proving. - We introduce a two-stage training framework that combines formal verification-guided supervised fine-tuning with policy optimization, featuring a novel data synthesis pipeline with execution-based validation to enforce logical soundness and structural integrity. - Extensive evaluations on six benchmarks demonstrate that our models break performance ceilings, surpassing SOTA by 10.4% (7B) and 14.2 % (14B). This validates the scalability and effectiveness of our proposed method. 2 Related Works 2.1 Large Language Models for Natural Reasoning Supervised fine-tuning (SFT) on chain-of-thought examples (Wei et al., 2022) and step-by-step solutions (Cobbe et al., 2021) has been foundational for developing reasoning capabilities in LLMs, with recent efforts curating high-quality datasets across mathematics (LI et al., 2024), code (Xu et al., 2025), and science (Wang et al., 2022). However, SFT alone cannot effectively optimize for complex objectives beyond imitation and struggles with multi-step error correction (Lightman et al., 2023; Uesato et al., 2022; Zhou et al., 2026). Recent RL advances using outcome-based optimization methods have achieved remarkable success in mathematical reasoning (Cobbe et al., 2021; Zeng et al., 2025), code generation (Le et al., 2022; Feng et al., 2025a), and general-domain reasoning (Ma et al., 2025; Chen et al., 2025c). However, optimizing solely for final answer correctness creates perverse incentives where models learn correct conclusions through logically invalid pathways (Uesato et al., 2022), leading to reward hacking (Skalse et al., 2022) and brittleness under distribution shift (Hendrycks et al., 2021). To address these limitations, process-based rewards incorporate feedback from intermediate steps, providing dense supervision through human-annotated judgments (Uesato et al., 2022; Lightman et al., 2023; She et al., 2025; Khalifa et al., 2025). However, the probabilistic nature of language model-based verifiers introduces errors and biases (Zheng et al., 2023), limiting their ability to detect subtle logical inconsistencies that emerge during multi-step reasoning. 2.2 Formal Reasoning and Verification Recent work has integrated formal verification tools, including theorem provers (Yang et al., 2023; Cao et al., 2025a; Tian et al., 2025), code interpreters (Feng et al., 2025a), and symbolic solvers (Li et al., 2025a), to provide machine-checkable validation beyond the biases of LLM-as-a-judge approaches (Li and Ng, 2025; Uesato et al., 2022; Lightman et al., 2023). This direction is increasingly recognized as critical for grounding generative models in verifiable systems (Ren et al., 2025; Wang et al., 2025; Hu et al., 2025). Existing approaches differ in how verification is applied. HERMES (Ospanov et al., 2025) interleaves informal reasoning with Lean-verified steps, ensuring real-time soundness but requiring mature formal libraries. Safe (Liu et al., 2025a) applies post-hoc verification to audit completed reasoning chains, though this passive mode cannot prevent error accumulation during generation. VeriCoT (Feng et al., 2025b) checks logical consistency on extracted first-order logic, while others train verifier models using formal tool signals (Kamoi et al., 2025). Tool-integrated methods (Xue et al., 2025; Zeng et al., 2025; Li et al., 2025b; Feng et al., 2025a) embed interpreter calls into generation for calculation or simulation, but often lack strict logical guarantees. These approaches face key limitations: specialization to mathematical theorem proving, treating verification as separate filtering without guiding generation, or relying on uncertain logic extraction and neural verifiers. In contrast, we propose verification-guided reasoning that extends formal verification to general logical domains and employs real-time feedback as dynamic, in-process guidance to steer reasoning trajectories and enable self-correction beyond specialized formal tasks. 3 Preliminaries <details> <summary>x2.png Details</summary> ![f2cf04f0](/v1/image/f2cf04f0c585b4b7be0cdd978a6bd77a209a0613511a2f35e52345737370119e) ### Visual Description ## Bar Chart: Number of Correct Answers by Domain ### Overview The image is a bar chart comparing the number of correct answers achieved by two models, "Natural-SFT" and "FLV-SFT," across three domains: Logical, Mathematical, and General. The chart displays the absolute number of correct answers for each model and domain, along with the percentage increase of FLV-SFT over Natural-SFT for each domain. ### Components/Axes * **X-axis:** "Domain" with categories: Logical, Mathematical, General. * **Y-axis:** "Number of Correct Answers," ranging from 0 to 350, with gridlines at intervals of 50. * **Legend:** Located at the top-right of the chart. * Blue bar: "Natural-SFT" * Red bar: "FLV-SFT" * **Data Labels:** Numerical values are displayed above each bar, indicating the exact number of correct answers. Percentage increase values are displayed above each pair of bars. ### Detailed Analysis * **Logical Domain:** * Natural-SFT: 219 correct answers (blue bar) * FLV-SFT: 291 correct answers (red bar) * Percentage increase: +32.8% * **Mathematical Domain:** * Natural-SFT: 163 correct answers (blue bar) * FLV-SFT: 243 correct answers (red bar) * Percentage increase: +49.3% * **General Domain:** * Natural-SFT: 166 correct answers (blue bar) * FLV-SFT: 213 correct answers (red bar) * Percentage increase: +28.5% ### Key Observations * FLV-SFT consistently outperforms Natural-SFT across all three domains. * The largest performance increase of FLV-SFT over Natural-SFT is observed in the Mathematical domain (+49.3%). * The smallest performance increase of FLV-SFT over Natural-SFT is observed in the General domain (+28.5%). * Both models achieve the highest number of correct answers in the Logical domain. ### Interpretation The bar chart demonstrates that the FLV-SFT model is more effective than the Natural-SFT model in answering questions across the Logical, Mathematical, and General domains. The significant percentage increase in the Mathematical domain suggests that FLV-SFT is particularly well-suited for mathematical problem-solving. The consistent outperformance of FLV-SFT indicates a potential advantage in its architecture, training data, or learning algorithm compared to Natural-SFT. The data suggests that FLV-SFT is a more robust and accurate model for question answering across various domains. </details> Figure 2: Number of correct answers using natural language reasoning versus formal logic verification. We randomly sampled 500 instances from each domain for this comparison. We first present empirical evidence illustrating the gap between probabilistic natural language reasoning and formal verification in LLM reasoning. We then formally define our reasoning paradigm and introduce the preliminaries on the symbolic verification methods utilized in our framework. 3.1 Natural vs. Formal Reasoning in LLMs Our previous experiments demonstrate that LLMs lack mechanisms to ensure global logical consistency (Figure 1), motivating us to explore formal logic verification. Formal logic provides a rigorous framework where reasoning steps can be reliably validated using formal solvers. As shown in Figure 2, integrating formal logic verification with natural language reasoning yields significant performance improvements across diverse domains. We compare two approaches: Natural-SFT, which relies solely on natural language reasoning, and FLV-SFT, which incorporates formal logic verification. Across 500 randomly sampled instances from each domain, FLV-SFT consistently outperforms Natural-SFT, achieving 291 vs. 219 correct answers in the Logical domain (+32.8%), 243 vs. 163 in the Mathematical domain (+49.3%), and 213 vs. 166 in the General domain (+28.5%). These substantial improvements across all three categories demonstrate that formal verification effectively addresses the consistency gaps inherent in purely neural approaches. These results underscore the significant potential of formal verification to bridge the reasoning gap and strongly motivate our approach of interleaving natural language reasoning with formal verification throughout the reasoning process. 3.2 Problem Formulation Formally, let $\mathcal{D}=\{(x,y)\}$ be a dataset of reasoning tasks, where $x$ denotes the input context (e.g., problem description) and $y$ denotes the ground-truth answer. Standard CoT Paradigm. In conventional Chain-of-Thought reasoning, an LLM $P_{\theta}$ generates a sequence of reasoning steps $z=(s_{1},s_{2},...,s_{n})$ in natural language, aiming to maximize: $$ P_{\theta}(y,z\mid x)=P_{\theta}(y\mid z,x)\cdot\prod_{i=1}^{n}P_{\theta}(s_{i}\mid s_{<i},x) \tag{1} $$ However, this objective does not guarantee that $z$ is logically valid or formally verifiable. Our Paradigm: Formal Logic Verification-Guided Reasoning. We propose augmenting the reasoning chain with formal verification at each step. Specifically, we define an extended reasoning chain $z^{\prime}=(s_{1},f_{1},v_{1},s_{2},f_{2},v_{2},...,s_{n},f_{n},v_{n})$ , where: - $s_{i}$ : Natural language reasoning step (as in standard CoT) - $f_{i}$ : Formal specification that encodes the logical correctness of $s_{i}$ (e.g., symbolic constraints, SAT clauses, SMT formulas, or executable code) - $v_{i}$ : Formal Logic Verification result returned by a formal verifier $\mathcal{V}$ when applied to $f_{i}$ During training, our objective is to maximize the likelihood of correct final answers: $$ \max_{\theta}\mathbb{E}{(x,y)\sim\mathcal{D}}\left[\log P\theta(y,z^{\prime}\mid x)\right] \tag{2} $$ During inference, the verification function $\mathcal{V}$ takes the formal reasoning as input and returns detailed feedback at each reasoning step. This feedback may include satisfiability results, counterexamples, proof traces, execution outputs, or error messages, which guide the model to generate logically sound and verifiable subsequent reasoning steps. <details> <summary>x3.png Details</summary> ![3002ccb3](/v1/image/3002ccb396c374a17bf41ad4ec68cc8a7c59671fdfe6962dac80c68bb477291c) ### Visual Description ## Diagram: Formal Logic Verification with Reinforcement Learning ### Overview The image is a diagram illustrating a two-stage process involving Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for formal logic verification. The diagram outlines the flow of data and processes, including natural language reasoning, formal logic reasoning, and verification steps. ### Components/Axes * **Stages:** The diagram is divided into two stages: Stage 1 (SFT) and Stage 2 (RL). * **SFT Stage (Top):** * **Input q:** A green rounded rectangle labeled "Input q" represents the initial input. * **Distill:** An icon of a robot labeled "Distill" processes the input. * **Filter by Fail rate:** A series of red "X" marks and a green checkmark indicate filtering based on a "Fail rate". * **Difficult Questions:** A stack of documents labeled "Difficult Questions" represents the output of the filtering process. * **Formal Logic Incert:** An icon of a robot labeled "Formal Logic Incert" processes the difficult questions. * **Formal Logic-Verification Trace:** A blue box containing "Natural Language Reasoning", "Formal Logic Reasoning", and "Verifier Observation". * **Verifier:** A code icon labeled "Verifier" checks the output. * **Verified:** A green database icon labeled "Verified" represents successful verification. * **Rejected:** A trash can icon labeled "Rejected" represents failed verification. * **FLV-SFT Model:** An icon of a robot labeled "FLV-SFT Model" represents the final model. * **RL Stage (Bottom):** * **Policy Model:** An icon of a robot labeled "Policy Model" initiates the RL stage. * **Step tn:** A blue box containing "(Natural Language Reasoning & Formal Action)" and code snippets. The code includes "Implies(And(both\_optimal, duality\_holds), both\_feasible)", "solver = Solver()", and "result = solver.check()". * **Step tn+1:** A blue box containing "Verifier Observation" and the question "Are the triple iff and strong duality equivalent? False". * **Step tm:** A blue box containing "Next Natural Language Reasoning" and the text "The result indicates that the triple iff structure (Both the primal LP and the dual LP have feasible solutions <-> then...". * **Interleaved Reasoning Loop:** A curved arrow indicates a loop between the steps. * **Correct Reward Instruct Reward:** A pink box containing a dollar sign icon and the text "Correct Reward Instruct Reward". Also contains a graph labeled "GRPO Advantage Ai". * **Backprop:** An arrow labeled "Backprop" indicates the feedback loop. ### Detailed Analysis or ### Content Details * **SFT Stage Flow:** The SFT stage begins with an input "q", which is distilled and then filtered based on a fail rate. The difficult questions are then processed by a formal logic component, verified, and either accepted or rejected. * **RL Stage Flow:** The RL stage involves a policy model that interacts with the environment through a series of steps (tn, tn+1, tm). These steps involve natural language reasoning, formal action, and verifier observations. An interleaved reasoning loop connects these steps. The process is rewarded based on correctness, and the feedback is backpropagated to the policy model. * **Code Snippets:** The code snippets in Step tn suggest the use of a solver to check logical implications and feasibility. * **Questions:** The question in Step tn+1 indicates a check for equivalence between triple iff and strong duality. ### Key Observations * The diagram illustrates a hybrid approach combining SFT and RL for formal logic verification. * The process involves both natural language reasoning and formal logic reasoning. * The RL stage uses an interleaved reasoning loop to refine the verification process. ### Interpretation The diagram presents a system designed to improve formal logic verification through a combination of supervised learning and reinforcement learning. The SFT stage likely serves to pre-train a model on a dataset of logical problems, while the RL stage fine-tunes the model's reasoning and verification abilities through interaction with an environment. The interleaved reasoning loop in the RL stage suggests an iterative process of hypothesis generation and verification. The reward system encourages the model to produce correct and informative reasoning steps. The overall architecture aims to create a robust and reliable system for formal logic verification. </details> Figure 3: Overview of the formal logic verification-guided training framework. The framework operates in two stages: (1) SFT: A teacher model synthesizes formal logic verification traces, which are validated by a verifier. A subset of verified samples is used to fine-tune the model, while challenging samples are reserved for RL training. (2) RL: The policy model generates natural language reasoning followed by formal reasoning. A formal interpreter verifies the formal reasoning and provides feedback, enabling iterative refinement until the model produces a final answer or reaches the maximum number of interpreter calls. Rewards computed from verification outcomes are used to calculate advantages and update the policy model via reinforcement learning. 4 Methodology To address logical inconsistencies and hallucinations in pure natural language reasoning (Section 3), we propose integrating formal logic verification into the reasoning process. Our approach consists of two stages: (i) Supervised Fine-tuning to enable the model to generate interleaved natural language and formal proofs, and (ii) Reinforcement Learning to optimize the model using composite rewards that enforce logical soundness and correctness. 4.1 Formal Logic Verification-Guided SFT The primary goal of the SFT stage is to align the model’s output distribution with a structured reasoning format that supports self-verification. Since large-scale datasets containing interleaved reasoning and formal proofs are scarce, we employ a hierarchical formal proof data synthesis pipeline. 4.1.1 Data Synthesis Pipeline CoT Generation. Given a raw reasoning problem $q$ , we first employ a strong teacher model to generate $K=4$ candidate reasoning chains. A judge model evaluates the correctness of the final answers to calculate pass rates. We select a subset of verified chains that yield correct solutions for subsequent processing. Let $z$ denote a selected correct reasoning chain. To incorporate formal logic, we utilize an LLM to decompose $z$ into discrete logical modules $\{s_{k}\}_{k=1}^{N}$ . For each module $s_{k}$ , the LLM synthesizes a corresponding formal proof $f_{k}$ and predicts an expected execution output $v_{k}^{\text{exp}}$ . The prompt template is provided in Figure 9. Execution-based Validation and Correction. To ensure the fidelity of synthesized formal proofs, we implement a rigorous validation mechanism. Each generated formal proof $f_{k}$ is executed in a sandbox to obtain the actual output $v_{k}^{\text{act}}$ . We then perform a three-stage validation: Stage 1: Exact Match. If the actual output exactly matches the expected output ( $v_{k}^{\text{act}}=v_{k}^{\text{exp}}$ ), the proof is accepted immediately and integrated into the training data. Stage 2: Semantic Equivalence Check. In cases where $v_{k}^{\text{act}}≠ v_{k}^{\text{exp}}$ , we employ a verification model to assess whether the discrepancy is semantically negligible (e.g., differences in capitalization, output ordering, or numerical precision). If the outputs are deemed equivalent under mathematical or logical semantics, we proceed to Stage 3. Stage 3: Proof Rewriting. When minor inconsistencies are detected, we require the LLM to regenerate the natural language reasoning $s_{k}^{\prime}$ conditioned on the actual execution result $v_{k}^{\text{act}}$ . This ensures that the natural language reasoning module $s_{k}^{\prime}$ , the formal proof $f_{k}^{\prime}$ , and the execution output $v_{k}^{\text{act}}$ maintain strict logical coherence. Proofs that fail both exact match and semantic equivalence checks are discarded. The resulting training instance is structured as follows: $$ \begin{split}z_{\text{aug}}=\bigoplus_{k=1}^{N}\Big(s_{k}\oplus\texttt{<code>}f_{k}^{\prime}\texttt{</code>}\\ \oplus\texttt{<interpreter>}v_{k}^{\text{act}}\texttt{</interpreter>}\Big)\end{split} \tag{3} $$ where $f_{k}^{\prime}$ denotes the validated formal proof and $v_{k}^{\text{act}}$ is the verified execution output. This pipeline ensures that every training example $(s,f,v)$ exhibits perfect alignment between natural language hypotheses, formal logic reasoning, and execution feedback, thereby providing high-quality supervision for the model to learn reliable reasoning patterns. See Appendix B for dataset construction details. 4.1.2 Optimization Objective Given the augmented training dataset $\mathcal{D}_{\text{SFT}}=\{(q_{i},z_{\text{aug},i})\}_{i=1}^{M}$ , we optimize the model parameters $\theta$ by maximizing the log-likelihood of generating structured reasoning sequences: $$ \mathcal{L}_{\text{SFT}}(\theta)=-\mathbb{E}_{(q,z_{\text{aug}})\sim\mathcal{D}_{\text{SFT}}}\left[\log P_{\theta}(z_{\text{aug}}\mid q)\right] \tag{4} $$ This can be decomposed into the sequential generation of reasoning modules, formal proofs, and execution outputs: $$ \begin{split}\mathcal{L}_{\text{SFT}}(\theta)=-\mathbb{E}_{(q,z_{\text{aug}})\sim\mathcal{D}_{\text{SFT}}}\Bigg[\sum_{k=1}^{N}\Big(\log P_{\theta}(s_{k}\mid q,z_{<k})\\ +\log P_{\theta}(f_{k}^{\prime}\mid q,z_{<k},s_{k})+\log P_{\theta}(v_{k}^{\text{act}}\mid q,z_{<k},s_{k},f_{k}^{\prime})\Big)\Bigg]\end{split} \tag{5} $$ where $z_{<k}$ denotes all previous modules. We train using AdamW with cosine learning rate scheduling and gradient clipping. This stage enables the model to generate verifiable, formally grounded reasoning chains. 4.2 Formal Verification-Guided Policy Optimization To further enhance the formal logic verification-guided reasoning capabilities of LLMs, we employ reinforcement learning. The core of this stage is a multi-dimensional reward function that provides fine-grained feedback on structure, semantics, and computational efficiency. 4.2.1 Hierarchical Reward Design To ensure both generation stability and reasoning rigor, we design a hierarchical reward function $R(y)$ that evaluates responses in a strictly prioritized order: Format Integrity $\succ$ Structural Compliance $\succ$ Logical Correctness. The unified reward is formulated as: $$ R(y)=\begin{cases}R_{\text{fatal}}&y\in\mathbb{C}_{\text{fatal}}\qquad\text{(L1: Fatal)}\\ R_{\text{invalid}}&y\in\mathbb{C}_{\text{invalid}}\qquad\text{(L2: Format)}\\ R_{\text{total}}(y)&\text{otherwise}\qquad\text{(L3: Valid)}\end{cases} \tag{6} $$ The total reward for valid responses is: $$ R_{\text{total}}(y)=R_{\text{struct}}(y)+R_{\text{logic}}(y) \tag{7} $$ Level 1 & 2: Penalties for Malformed Generation. We first filter out pathological behaviors to prevent reward hacking and infinite loops during training. - Fatal Errors ( $\mathbb{C}_{\text{fatal}}$ ): Responses with severe and unrecoverable execution failures (e.g., timeouts, repetition loops, excessive tool calls). We assign a harsh penalty $R_{\text{fatal}}=-\gamma_{\text{struct}}-W$ to strictly inhibit these states, where $W>0$ is a correctness weight hyperparameter. - Format Violations ( $\mathbb{C}_{\text{invalid}}$ ): Responses that are technically executable but structurally flawed (e.g., missing termination tags, no extractable final answer, excessive verbosity). These incur a moderate penalty $R_{\text{invalid}}=-\beta_{\text{struct}}-W$ , where $\gamma_{\text{struct}}>\beta_{\text{struct}}>0$ . Level 3: Incentives for Valid Reasoning. For responses that pass the format checks ( $y∉\mathbb{C}_{\text{fatal}}\cup\mathbb{C}_{\text{invalid}}$ ), the reward is a composite of structural quality and logical correctness. (i) Structural Reward $R_{\text{struct}}(y)$ : Encourages concise and compliant tool usage. $$ R_{\text{struct}}(y)=\alpha-\lambda_{\text{tag}}\cdot N_{\text{undef}}-\lambda_{\text{call}}\cdot\max(N_{\text{call}}-N_{\text{max}},0) \tag{8} $$ Here, $\alpha$ is a base bonus, $N_{\text{undef}}$ tracks undefined tags, and the last term penalizes excessive tool invocations beyond a threshold $N_{\text{max}}$ . (ii) Logical Correctness Reward $R_{\text{logic}}(y)$ : Evaluates the final answer $\hat{a}$ against the ground truth $a^{*}$ . $$ R_{\text{logic}}(y)=\begin{cases}W-\lambda_{\text{len}}\cdot\Delta_{\text{len}}(\hat{a},a^{*})&\text{if }\hat{a}=a^{*}\\ -W&\text{if }\hat{a}\neq a^{*}\end{cases} \tag{9} $$ where $\Delta_{\text{len}}$ penalizes length discrepancies to discourage verbose reasoning and promote conciseness. Detailed hyperparameter settings are provided in Appendix C. Table 1: Comparative evaluation between our proposed formal-language verification (FLV) methods (gray background) and other approaches. Bold values denote the best results. KOR-Bench and BBH contain multiple subfields and report macro-averaged scores. | Model | Logical | Mathemathcal | General | AVG | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | KOR-Bench | BBH | MATH500 | AIME24 | GPQA-D | TheoremQA | | | | Qwen2.5-7B | | | | | | | | | Base | 13.2 | 41.9 | 60.2 | 6.5 | 29.3 | 29.1 | 30.0 | | Qwen2.5-7B-Instruct | 40.2 | 67.0 | 75.0 | 9.4 | 33.8 | 36.6 | 43.7 | | SimpleRL-Zoo | 34.2 | 59.8 | 74.0 | 14.8 | 24.2 | 43.1 | 41.7 | | General-Reasoner | 43.9 | 61.9 | 73.4 | 12.7 | 38.9 | 45.3 | 46.0 | | RLPR | 42.2 | 66.2 | 77.2 | 14.2 | 37.9 | 44.3 | 47.0 | | Synlogic | 48.1 | 66.5 | 74.6 | 15.4 | 27.8 | 39.2 | 45.3 | | ZeroTIR | 30.0 | 40.0 | 62.4 | 28.5 | 28.8 | 36.4 | 37.7 | | SimpleTIR | 37.0 | 62.0 | 82.6 | 41.0 | 22.7 | 51.1 | 49.4 | | \rowcolor [HTML]E0E0E0 FLV-SFT (Ours) | 48.0 | 68.5 | 77.2 | 20.0 | 32.3 | 53.0 | 49.8 | | \rowcolor [HTML]E0E0E0 FLV-RL (Ours) | 51.0 | 70.0 | 78.6 | 20.8 | 35.4 | 55.7 | 51.9 | | Qwen2.5-14B | | | | | | | | | Base | 37.4 | 52.0 | 65.4 | 3.6 | 32.8 | 33.0 | 37.4 | | Qwen2.5-14B-Instruct | 51.5 | 72.9 | 77.4 | 12.2 | 41.4 | 41.9 | 49.6 | | SimpleRL-Zoo | 37.2 | 72.7 | 77.2 | 12.9 | 39.4 | 48.9 | 48.1 | | General-Reasoner | 41.3 | 71.5 | 78.6 | 17.5 | 43.4 | 55.3 | 51.3 | | \rowcolor [HTML]E0E0E0 FLV-SFT (Ours) | 54.0 | 77.5 | 79.8 | 21.9 | 40.4 | 60.6 | 55.7 | | \rowcolor [HTML]E0E0E0 FLV-RL (Ours) | 57.0 | 78.0 | 81.4 | 30.2 | 41.4 | 63.5 | 58.6 | 4.2.2 Optimization Objective We optimize $\pi_{\theta}$ using GRPO. For each input $x\sim\mathcal{D}_{\text{difficult}}$ , we sample $G$ outputs ${y_{1},...,y_{G}}$ and compute: $$ \begin{split}\mathcal{L}_{\text{GRPO}}(\theta)&=\mathbb{E}_{x\sim\mathcal{D}_{\text{difficult}}}\Bigg[\frac{1}{G}\sum_{i=1}^{G}\sum_{t=1}^{|y_{i}|}\bigg(\\ &\qquad\min\big(r_{i,t}\hat{A}_{i},\mathrm{clip}(r_{i,t},1-\epsilon,1+\epsilon)\hat{A}_{i}\big)\\ &\qquad-\beta\log\frac{\pi_{\theta}(y_{i,t}|x,y_{i,<t})}{\pi_{\text{ref}}(y_{i,t}|x,y_{i,<t})}\bigg)\Bigg]\end{split} \tag{10} $$ where $r_{i,t}=\pi_{\theta}(y_{i,t}|x,y_{i,<t})/\pi_{\text{old}}(y_{i,t}|x,y_{i,<t})$ . The advantage $\hat{A}_{i}$ is group-normalized on $R(y_{i})$ , stabilizing training by emphasizing relative quality over absolute reward. 5 Experiment 5.1 Experimental Setup Models. We utilize the Qwen2.5-7B and Qwen2.5-14B (Qwen-Team, 2025) as our backbone architectures. These models serve as the initialization point for both the SFT and subsequent Policy Optimization stages. Training Data. Our training corpus is constructed using three datasets: WebInstruct-Verified (Ma et al., 2025), K&K (Xie et al., 2024), and NuminaMath-TIR (LI et al., 2024). These sources provide a collection of diverse, verifiable reasoning tasks across multiple domains. We employ DeepSeek-R1 (Guo et al., 2025a) for data distillation and difficulty assessment via pass-rate. We utilize GPT-4o (Hurst et al., 2024) as a judge for answer correctness and Claude-Sonnet-4.5 (Anthropic, 2024) to synthesize the interleaved formal logic steps as detailed in Section 4. The RL data is selected based on the pass rate of answers generated by the teacher model DeepSeek-R1, retaining only questions with a pass rate below 50%. The categorical distribution of our curated training data is illustrated in Figure 7. Evaluation. We conduct a comprehensive evaluation across three distinct reasoning domains to assess models: - Logical Reasoning: We employ KOR (Ma et al., 2024) to evaluate knowledge-grounded logical reasoning across diverse domains and BBH (Suzgun et al., 2023) for challenging tasks requiring multi-step deduction. - Mathematical Reasoning: We evaluate performance on MATH-500 (Hendrycks et al., 2024) for competition-level mathematics problems and AIME 2024 for Olympiad-level mathematical reasoning challenges. - General Reasoning: We utilize GPQA-Diamond (Rein et al., 2024) for graduate-level reasoning in subdomains including physics, chemistry, and biology. Additionally, we use TheoremQA (Chen et al., 2023) to assess graduate-level theorem application across mathematics, physics, Electrical Engineering & Computer Science (EE&CS), and Finance, testing the model’s ability to correctly apply and reason with formal theorems. All evaluations use OpenCompass (Contributors, 2023) with greedy decoding, except AIME24 which reports avg@16 from sampling runs following RLPR (Yu et al., 2025). Baselines. To validate the effectiveness of our framework, we compare our approach against five categories of models: (i) Base Models: Qwen2.5-7B/14B (Qwen-Team, 2025), establishing the baseline performance to measure the net gain of our training methodology. (ii) Simple-RL-Zoo (Zeng et al., 2025), a comprehensive collection of mathematics-focused RL models. (iii) General-Reasoner (Ma et al., 2025), a suite of general-domain RL models trained using a model-based verifier. (iv) RLPR-7B (Yu et al., 2025), a general-domain RL model trained via a simplified verifier-free framework. (v) SynLogic-7B (Liu et al., 2025b), a specialized model trained to enhance the logical reasoning capabilities of LLMs. (vi) ZeroTIR (Mai et al., 2025), a tool-integrated reasoning model specifically designed to execute Python code for solving mathematical problems. (vii) SimpleTIR (Xue et al., 2025), a multi-turn tool-integrated reasoning model for mathematical reasoning problems. Implementation Details. We implement our RL training using the verl framework (Sheng et al., 2025a), following ToRL (Li et al., 2025b). SFT Stage: We use a learning rate of $1× 10^{-5}$ with cosine scheduling and a global batch size of 32. The model is trained for 3 epochs. RL Stage: We employ a learning rate of $5× 10^{-7}$ . We generate 8 rollouts per prompt with a temperature of 1.0. To prevent policy divergence, we set the KL coefficient to 0.05 and the clip ratio to 0.3. The training utilizes a batch size of 1024 and a context length of 16,384 tokens. Training proceeds for 120 steps on a cluster of 16 NVIDIA H800 GPUs. To manage computational overhead, we limit the formal verification process to a maximum of 4 iterative rounds. 5.2 Main Results Table 2: We compare the performance of our proposed method (FLV) against natural language baselines across two training stages: SFT and GRPO. Natural-SFT/GRPO denotes models trained on the same data but without formal logic verification components. FLV-SFT/GRPO denotes our method incorporating formal logic modules and execution feedback. | Model | Logical | Mathematical | General | AVG | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | KOR-Bench | BBH | MATH500 | AIME24 | GPQA-D | TheoremQA | | | | Base | 13.2 | 41.9 | 60.2 | 6.5 | 29.3 | 29.1 | 30.0 | | Natural-SFT | 30.4 | 55.9 | 56.6 | 8.5 | 27.3 | 39.1 | 36.3 | | FLV-SFT (Ours) | 48.0 | 68.5 | 77.2 | 20.0 | 32.3 | 53.0 | 49.8 | | Natural-RL | 35.7 | 55.4 | 54.4 | 4.8 | 30.3 | 41.2 | 37.0 | | FLV-RL (Ours) | 51.0 | 70.0 | 78.6 | 20.8 | 35.4 | 55.7 | 51.9 | Table 1 presents the comprehensive evaluation of our Formal Logic Verification (FLV) approach against standard baselines across Qwen2.5-7B and 14B scales. Formal logic verification-guided methods outperform traditional natural language-based methods. As shown in the results, our proposed FLV framework demonstrates superior performance compared to standard natural language reasoning approaches. Notably, even our supervised fine-tuning stage (FLV-SFT) surpasses all comparative RL baselines on the 7B scale. On Qwen2.5-7B, FLV-SFT achieves an average score of 49.8, outperforming the strongest natural language baseline (RLPR, 47.0) by 2.8 points. This suggests that integrating formal logic verification during the SFT phase provides a more robust reasoning foundation than standard RL training on natural language chains alone. Furthermore, the application of Group Relative Policy Optimization (FLV-RL) yields consistent improvements over the SFT stage. For the 14B model, FLV-RL improves upon FLV-SFT by increasing the average score from 55.7 to 58.6, with significant gains in hard mathematical tasks like AIME 2024 (+8.3%) and general theorem application in TheoremQA (+2.9%). This confirms that our verifier-guided RL effectively refines the policy beyond the supervised baseline. Formal logic verification unlocks model reasoning potential, achieving SOTA with limited data. Despite utilizing a concise training set (approx. 17k samples total), our approach establishes new state-of-the-art performance among models of similar size, significantly outperforming baselines that typically rely on larger-scale data consumption. (i) On the challenging AIME 2024 benchmark, our FLV-RL-14B model achieves 30.2%, nearly doubling the performance of the General-Reasoner baseline (17.5%) and far exceeding the Base model (3.6%). Similarly, on MATH-500, we achieve 81.4%, surpassing all baselines. (ii) We observe dominant performance on TheoremQA (63.5% on 14B), outperforming the nearest competitor by over 8 points. In logical reasoning (KOR-Bench), our method achieves a 15.7% improvement over the General-Reasoner on the 14B scale (57.0 vs 41.3). While FLV shows a slight weakness on GPQA-Diamond (likely due to benchmark reliability issues discussed in Appendix G), our method consistently excels in tasks requiring rigorous multi-step deduction and symbolic manipulation, validating the hypothesis that formal verification serves as a catalyst for deep reasoning capabilities. <details> <summary>x4.png Details</summary> ![a2ad2efa](/v1/image/a2ad2efa98b164996bca1b1c14ee1424182c5f0e5c59200b544c4ef1ef9ba61a) ### Visual Description ## Horizontal Bar Chart: SimpleTIR vs. FLV-RL Usage by Category ### Overview The image is a horizontal bar chart comparing the percentage of usage between "SimpleTIR" and "FLV-RL" across three categories: "Other", "Numerical/Scientific", and "Symbolic/Logic". The chart uses a split bar format, with "SimpleTIR" represented by blue bars extending to the left from a central vertical line, and "FLV-RL" represented by red bars extending to the right. ### Components/Axes * **Title:** There is no explicit title on the chart. * **X-axis:** "Percentage (%)" ranging from 80% (left) to 80% (right), with 0% at the center. Increments of 20% are marked on both sides of the center. * **Y-axis:** Categorical axis with three categories: "Other", "Numerical/Scientific", and "Symbolic/Logic". These categories are listed vertically from top to bottom. * **Legend:** Located at the top of the chart. "SimpleTIR" is represented by a blue bar, and "FLV-RL" is represented by a red bar. * **Directional Indicators:** At the bottom, "← SimpleTIR uses more" is written in blue, and "FLV-RL uses more →" is written in red, indicating the direction of higher usage for each system. ### Detailed Analysis Here's a breakdown of the percentage values for each category and system: * **Other:** * SimpleTIR (blue): 35.7% * FLV-RL (red): 17.1% * **Numerical/Scientific:** * SimpleTIR (blue): 21.8% * FLV-RL (red): 20.4% * **Symbolic/Logic:** * SimpleTIR (blue): 42.5% * FLV-RL (red): 62.5% ### Key Observations * For "Other" and "Numerical/Scientific" categories, SimpleTIR has a higher percentage of usage compared to FLV-RL. * For the "Symbolic/Logic" category, FLV-RL has a significantly higher percentage of usage compared to SimpleTIR. ### Interpretation The chart illustrates a comparative analysis of the usage of two systems, SimpleTIR and FLV-RL, across different categories. SimpleTIR appears to be more prevalent in "Other" and "Numerical/Scientific" contexts, while FLV-RL dominates in "Symbolic/Logic" applications. The difference in "Symbolic/Logic" is particularly pronounced, suggesting that FLV-RL is significantly better suited or more frequently used for tasks within that category. The data suggests that the choice between SimpleTIR and FLV-RL may depend on the specific type of task or application. </details> Figure 4: Python packages type distribution invoked by SimpleTIR (blue) vs. FLV-RL (red) across three domains. Formal verification instills a shift from calculation to symbolic reasoning, enabling superior generalization. While tool-integrated baselines like SimpleTIR primarily utilize tools as “solvers” for direct computation (achieving 41.0 on AIME24), this paradigm struggles with tasks requiring rigorous logical deduction. In contrast, our FLV framework employs formal methods as a “verifier” to enforce logical consistency. This approach yields dominant performance on logic-heavy benchmarks such as KOR-Bench (51.0 vs. 37.0 for SimpleTIR) and GPQA-Diamond (35.4 vs. 28.8 for ZeroTIR). To understand the mechanism behind this reliability, we analyze the distribution of invoked Python packages in Figure 4. The data reveals a distinct behavioral shift: whereas SimpleTIR relies significantly on generic utility packages (Other), FLV-RL demonstrates a massive surge in the usage of Symbolic/Logic libraries. These formal tools constitute 62.5% of FLV-RL’s calls—a 20-point increase over SimpleTIR. Meanwhile, the usage of Numerical/Scientific libraries remains stable ( $\sim$ 21%), indicating that our method’s gains are driven specifically by the adoption of symbolic logic engines to verify reasoning processes, rather than merely computing numerical answers. See Appendix 9 for the package categorization principles. <details> <summary>x5.png Details</summary> ![d22efb26](/v1/image/d22efb26f28c9d4069b6433826b8b3c65c724910c065b9335f1abdb111d54312) ### Visual Description ## Box Plot: Token Length Comparison ### Overview The image is a box plot comparing the token lengths of three different models: General-Reasoner, SimpleTIR, and FLV-RL. The y-axis represents "Token length", and the x-axis represents the model names. The plot displays the distribution of token lengths for each model, showing the median, quartiles, and outliers. ### Components/Axes * **X-axis:** Model names: General-Reasoner, SimpleTIR, FLV-RL * **Y-axis:** Token length, with a scale from 0 to 20000. Gridlines are present at intervals of approximately 5000. * **Box Plots:** Each box plot represents the distribution of token lengths for a specific model. * The box represents the interquartile range (IQR), containing the middle 50% of the data. * The line inside the box represents the median. * The whiskers extend to the furthest data point within 1.5 times the IQR from the box. * Outliers are represented as individual points beyond the whiskers. * **Data Labels:** Numerical values are displayed near each box plot, indicating specific statistical values. ### Detailed Analysis * **General-Reasoner (Green Box):** * Median: Approximately 933 * Q1 (25th percentile): Approximately 562 * Q3 (75th percentile): Approximately 1344 * The box is relatively small, indicating a narrow range of token lengths. * **SimpleTIR (Blue Box):** * Median: Approximately 4352 * Q1 (25th percentile): Approximately 2828 * Q3 (75th percentile): Approximately 6985 * The box is larger than that of General-Reasoner, indicating a wider range of token lengths. * **FLV-RL (Red Box):** * Median: Approximately 6180 * Q1 (25th percentile): Approximately 3478 * Q3 (75th percentile): Approximately 9862 * The box is the largest, indicating the widest range of token lengths. ### Key Observations * The median token length increases from General-Reasoner to SimpleTIR to FLV-RL. * The interquartile range (IQR) also increases from General-Reasoner to SimpleTIR to FLV-RL, indicating greater variability in token lengths for the latter models. * FLV-RL has a significantly larger range of token lengths compared to the other two models. ### Interpretation The box plot illustrates the distribution of token lengths generated by three different models. General-Reasoner produces the shortest and most consistent token lengths, while FLV-RL generates the longest and most variable token lengths. SimpleTIR falls in between these two. This suggests that FLV-RL might be generating more complex or verbose outputs compared to the other models. The wider range of token lengths for FLV-RL could also indicate that it is more sensitive to variations in input or task complexity. </details> Figure 5: Token length distribution comparison across General-Reasoner, SimpleTIR, and FLV-RL. The box plots illustrate the median token usage (center line) and interquartile ranges. Efficiency Analysis. We analyze the computational cost of our approach by comparing the token length distributions of the natural language baseline (General-Reasoner), tool-integrated reasoning (SimpleTIR), and our FLV-RL method (Figure 5). While FLV-RL incurs a moderate computational overhead, we argue that this cost is justified by the substantial performance gains observed across diverse domains. The increased token consumption represents a necessary trade-off for achieving breakthrough generalization and ensuring logical soundness in high-stakes reasoning tasks. 5.3 Ablation Studies To evaluate the individual contributions of our proposed components, we conducted an ablation study examining two critical dimensions: the impact of FLV versus pure natural language reasoning, and the effectiveness of different training paradigms. Results are presented in Table 2. Impact of formal logic verification. Comparing FLV-based models against natural language baselines trained on identical data reveals substantial improvements. FLV-SFT achieves 49.8% average accuracy versus 36.5% for Natural-SFT, with particularly strong gains on logic-intensive tasks (KOR-Bench: +16.2 points, TheoremQA: +13.9 points). This demonstrates that formal proofs and execution validation fundamentally improve reasoning by grounding outputs in verifiable logic rather than probabilistic patterns. Impact of multi-stage training We can observe that supervised fine-tuning establishes strong foundations, improving from 30.0% (Base) to 49.8% (FLV-SFT). Policy optimization yields further substantial gains to 51.9% (FLV-RL). Notably, natural language baselines barely improve with RL (37.0% vs 36.5%), while FLV-RL substantially outperforms FLV-SFT, indicating formal verification provides more stable and reliable reward signals for policy optimization. 5.4 Verification Paradigm: Balancing Formalism and Computational Fluency Our initial data construction enforced explicit verification outputs (e.g., proved / disproved) after each logical module. However, this rigid format introduced two critical issues: (i) formal language redundancy, and (ii) suppression of direct calculation. When computational verification was needed, models would bypass direct arithmetic in favor of indirect validation via z3-solver (e.g., asserting A + B == C is proved rather than computing the sum), significantly degrading mathematical performance. To address this, we adopted a flexible verification strategy that decouples calculation from validation: (i) Calculation as inference: Models invoke numerical tools directly during reasoning without mandatory verification keywords. (ii) Logic as validation: Formal verification serves as post-hoc validation rather than a per-step constraint. Figure 6 compares performance across logic, general, and math subsets under both paradigms. The flexible approach substantially improves math scores while preserving logical reasoning capability. Representative cases are detailed in Appendix E. <details> <summary>x6.png Details</summary> ![82843207](/v1/image/828432075263ccb2484766da607c91886535f91861188aad305bd4cff6c9c044) ### Visual Description ## Bar Chart: Verification Paradigms and Performance Gains ### Overview The image presents a two-part figure. Part (a) illustrates two verification paradigms: "Enforced" and "Flexible." Part (b) is a bar chart comparing the accuracy (%) of these two paradigms across three datasets: MATH500, BBH, and GPQA-D. ### Components/Axes **Part (a): Verification Paradigms** * **Title:** (a) Verification Paradigms * **Paradigm 1:** Enforced (Steps: Step1, Verify (with lock icon), Step2, Verify (with lock icon)) * **Paradigm 2:** Flexible (Steps: Step1, calculation, Step2, Verify) **Part (b): Performance Gains** * **Title:** (b) Performance Gains * **Y-axis:** Accuracy (%) * **X-axis:** Datasets (MATH500, BBH, GPQA-D) * **Legend:** * Blue: Enforced * Red: Flexible (Ours) ### Detailed Analysis **Part (b): Performance Gains** * **MATH500:** * Enforced (Blue): 60.0% * Flexible (Red): 71.0% * **BBH:** * Enforced (Blue): 51.3% * Flexible (Red): 61.0% * **GPQA-D:** * Enforced (Blue): 29.8% * Flexible (Red): 31.3% **Trend Verification:** * For each dataset, the "Flexible" paradigm (red) consistently shows higher accuracy than the "Enforced" paradigm (blue). ### Key Observations * The "Flexible" paradigm consistently outperforms the "Enforced" paradigm across all three datasets. * The performance difference between the two paradigms is most significant for the MATH500 dataset. * The accuracy scores are generally lower for the GPQA-D dataset compared to MATH500 and BBH. ### Interpretation The data suggests that the "Flexible" verification paradigm, as implemented by the authors ("Ours"), leads to performance gains in accuracy compared to the "Enforced" paradigm. This is consistent across all three datasets tested. The difference in performance may be attributed to the different verification steps outlined in part (a) of the figure. The "Enforced" paradigm includes a "Verify" step with a lock icon after both "Step1" and "Step2", while the "Flexible" paradigm includes a "calculation" step after "Step1" and a "Verify" step after "Step2". The "Flexible" paradigm's "calculation" step may allow for more adaptable or nuanced verification, leading to higher accuracy. The lower accuracy scores on the GPQA-D dataset may indicate that this dataset is inherently more challenging for both paradigms. </details> Figure 6: Enforced vs. Flexible Verification Paradigms. (a) Enforced verification imposes rigid checkpoints throughout the reasoning process, while flexible verification enables adaptive utilization of logic verification. (b) Performance gains after switching to flexible reasoning across three representative benchmarks. 6 Conclusion In this work, we addressed the fundamental tension between probabilistic language generation and logical consistency in LLM reasoning by introducing a framework that dynamically integrates formal logic verification into the reasoning process. Through our two-stage training methodology combining FLV-SFT’s rigorous data synthesis pipeline with formal logic verification-guided policy optimization, we demonstrated that real-time symbolic feedback can effectively mitigate logical fallacies that plague standard Chain-of-Thought approaches. Empirical evaluation across six diverse benchmarks validates our approach, with our 7B and 14B models achieving average improvements of 10.4% and 14.2% respectively over SOTA baselines, while providing interpretable step-level correctness guarantees. Beyond performance gains, our framework establishes a principled foundation for trustworthy reasoning systems by bridging neural fluency with symbolic rigor, thereby enabling more robust logical inference. This opens pathways toward more reliable AI that scales effectively to complex real-world problems across domains requiring strict logical soundness. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References - J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin (2024) Large language models for mathematical reasoning: progresses and challenges. arXiv preprint arXiv:2402.00157. Cited by: §1. - A. Anthropic (2024) Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card. External Links: Link Cited by: §5.1. - B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y. Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi (2025) Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. Cited by: §1. - C. Cao, M. Li, J. Dai, J. Yang, Z. Zhao, S. Zhang, W. Shi, C. Liu, S. Han, and Y. Guo (2025a) Towards advanced mathematical reasoning for llms via first-order logic theorem proving. arXiv preprint arXiv:2506.17104. Cited by: §2.2. - C. Cao, H. Zhu, J. Ji, Q. Sun, Z. Zhu, Y. Wu, J. Dai, Y. Yang, S. Han, and Y. Guo (2025b) SafeLawBench: towards safe alignment of large language models. arXiv preprint arXiv:2506.06636. Cited by: §1. - J. Chen, Q. He, S. Yuan, A. Chen, Z. Cai, W. Dai, H. Yu, Q. Yu, X. Li, J. Chen, et al. (2025a) Enigmata: scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914. Cited by: §1. - W. Chen, M. Yin, M. Ku, P. Lu, Y. Wan, X. Ma, J. Xu, X. Wang, and T. Xia (2023) Theoremqa: a theorem-driven question answering dataset. arXiv preprint arXiv:2305.12524. Cited by: 3rd item. - Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. (2025b) Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410. Cited by: §1. - Z. Chen, J. Yang, T. Xiao, R. Zhou, L. Zhang, X. Xi, X. Shi, W. Wang, and J. Wang (2025c) Reinforcement learning for tool-integrated interleaved thinking towards cross-domain generalization. arXiv preprint arXiv:2510.11184. Cited by: §2.1. - K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §2.1. - O. Contributors (2023) OpenCompass: a universal evaluation platform for foundation models. Note: https://github.com/open-compass/opencompass Cited by: §5.1. - J. Feng, S. Huang, X. Qu, G. Zhang, Y. Qin, B. Zhong, C. Jiang, J. Chi, and W. Zhong (2025a) Retool: reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536. Cited by: §2.1, §2.2. - Y. Feng, N. Weir, K. Bostrom, S. Bayless, D. Cassel, S. Chaudhary, B. Kiesl-Reiter, and H. Rangwala (2025b) VeriCoT: neuro-symbolic chain-of-thought validation via logical consistency checks. arXiv preprint arXiv:2511.04662. Cited by: §1, §2.2. - D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025a) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §5.1. - J. Guo, Z. Chi, L. Dong, Q. Dong, X. Wu, S. Huang, and F. Wei (2025b) Reward reasoning model. arXiv preprint arXiv:2505.14674. Cited by: §1. - D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), External Links: Link Cited by: §2.1. - D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2024) Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv. org/abs/2103.03874 2. Cited by: 2nd item. - J. Hu, J. Zhang, Y. Zhao, and T. Ringer (2025) HybridProver: augmenting theorem proving with llm-driven proof synthesis and refinement. arXiv preprint arXiv:2505.15740. Cited by: §2.2. - A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. (2024) Gpt-4o system card. arXiv preprint arXiv:2410.21276. Cited by: §5.1. - Y. Ji, X. Tian, S. Zhao, H. Wang, S. Chen, Y. Peng, H. Zhao, and X. Li (2025) AM-thinking-v1: advancing the frontier of reasoning at 32b scale. arXiv preprint arXiv:2505.08311. Cited by: §1. - R. Kamoi, Y. Zhang, N. Zhang, S. S. S. Das, and R. Zhang (2025) Training step-level reasoning verifiers with formal verification tools. arXiv preprint arXiv:2505.15960. Cited by: §1, §2.2. - M. Khalifa, R. Agarwal, L. Logeswaran, J. Kim, H. Peng, M. Lee, H. Lee, and L. Wang (2025) Process reward models that think. arXiv preprint arXiv:2504.16828. Cited by: §2.1. - H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi (2022) Coderl: mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35, pp. 21314–21328. Cited by: §2.1. - J. O. J. Leang, G. Hong, W. Li, and S. B. Cohen (2025) Theorem prover as a judge for synthetic data generation. arXiv preprint arXiv:2502.13137. Cited by: §1. - C. Li, Z. Tang, Z. Li, M. Xue, K. Bao, T. Ding, R. Sun, B. Wang, X. Wang, J. Lin, et al. (2025a) CoRT: code-integrated reasoning within thinking. arXiv preprint arXiv:2506.09820. Cited by: §2.2. - J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y. Fleureau, G. Lample, and S. Polu (2024) NuminaMath. Numina. Note: [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) Cited by: §2.1, §5.1. - J. Li and H. T. Ng (2025) The hallucination dilemma: factuality-aware reinforcement learning for large reasoning models. arXiv preprint arXiv:2505.24630. Cited by: §1, §2.2. - X. Li, H. Zou, and P. Liu (2025b) ToRL: scaling tool-integrated rl. External Links: 2503.23383, Link Cited by: §2.2, §5.1. - H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. In The Twelfth International Conference on Learning Representations, Cited by: §2.1, §2.2. - C. Liu, Y. Yuan, Y. Yin, Y. Xu, X. Xu, Z. Chen, Y. Wang, L. Shang, Q. Liu, and M. Zhang (2025a) Safe: enhancing mathematical reasoning in large language models via retrospective step-aware formal verification. arXiv preprint arXiv:2506.04592. Cited by: §1, §2.2. - J. Liu, Y. Fan, Z. Jiang, H. Ding, Y. Hu, C. Zhang, Y. Shi, S. Weng, A. Chen, S. Chen, et al. (2025b) SynLogic: synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641. Cited by: §1, §5.1. - S. Liu, H. Liu, J. Liu, L. Xiao, S. Gao, C. Lyu, Y. Gu, W. Zhang, D. F. Wong, S. Zhang, et al. (2025c) Compassverifier: a unified and robust verifier for llms evaluation and outcome reward. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 33454–33482. Cited by: §1. - K. Ma, X. Du, Y. Wang, H. Zhang, Z. Wen, X. Qu, J. Yang, J. Liu, M. Liu, X. Yue, W. Huang, and G. Zhang (2024) KOR-bench: benchmarking language models on knowledge-orthogonal reasoning tasks. External Links: 2410.06526, Link Cited by: 1st item. - X. Ma, Q. Liu, D. Jiang, G. Zhang, Z. Ma, and W. Chen (2025) General-reasoner: advancing llm reasoning across all domains. arXiv preprint arXiv:2505.14652. Cited by: Appendix C, §1, §2.1, §5.1, §5.1. - X. Mai, H. Xu, W. Wang, Y. Zhang, W. Zhang, et al. (2025) Agentic rl scaling law: spontaneous code execution for mathematical problem solving. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: §5.1. - A. Ospanov, Z. Feng, J. Sun, H. Bai, X. Shen, and F. Farnia (2025) HERMES: towards efficient and verifiable mathematical reasoning in llms. arXiv preprint arXiv:2511.18760. Cited by: §1, §2.2. - Qwen-Team (2025) Qwen2.5 technical report. External Links: 2412.15115, Link Cited by: §5.1, §5.1. - D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: 3rd item. - Z. Ren, Z. Shao, J. Song, H. Xin, H. Wang, W. Zhao, L. Zhang, Z. Fu, Q. Zhu, D. Yang, et al. (2025) Deepseek-prover-v2: advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801. Cited by: §2.2. - S. She, J. Liu, Y. Liu, J. Chen, X. Huang, and S. Huang (2025) R-prm: reasoning-driven process reward modeling. arXiv preprint arXiv:2503.21295. Cited by: §2.1. - G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2025a) Hybridflow: a flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp. 1279–1297. Cited by: §5.1. - J. Sheng, L. Lyu, J. Jin, T. Xia, A. Gu, J. Zou, and P. Lu (2025b) Solving inequality proofs with large language models. arXiv preprint arXiv:2506.07927. Cited by: §1, §1. - J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger (2022) Defining and characterizing reward gaming. Advances in Neural Information Processing Systems 35, pp. 9460–9471. Cited by: §1, §2.1. - M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, et al. (2023) Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051. Cited by: 1st item. - Y. Tian, R. Huang, X. Wang, J. Ma, Z. Huang, Z. Luo, H. Lin, D. Zheng, and L. Du (2025) EvolProver: advancing automated theorem proving by evolving formalized problems via symmetry and difficulty. arXiv preprint arXiv:2510.00732. Cited by: §2.2. - J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins (2022) Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275. Cited by: §2.1, §2.2. - H. Wang, M. Unsal, X. Lin, M. Baksys, J. Liu, M. D. Santos, F. Sung, M. Vinyes, Z. Ying, Z. Zhu, et al. (2025) Kimina-prover preview: towards large formal reasoning models with reinforcement learning. arXiv preprint arXiv:2504.11354. Cited by: §2.2. - Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. (2022) Super-naturalinstructions: generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705. Cited by: §2.1. - J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824–24837. Cited by: §2.1. - C. Xie, Y. Huang, C. Zhang, D. Yu, X. Chen, B. Y. Lin, B. Li, B. Ghazi, and R. Kumar (2024) On memorization of large language models in logical reasoning. External Links: 2410.23123, Link Cited by: §5.1. - Z. Xu, Y. Liu, Y. Yin, M. Zhou, and R. Poovendran (2025) Kodcode: a diverse, challenging, and verifiable synthetic dataset for coding. arXiv preprint arXiv:2503.02951. Cited by: §2.1. - Z. Xue, L. Zheng, Q. Liu, Y. Li, X. Zheng, Z. Ma, and B. An (2025) Simpletir: end-to-end reinforcement learning for multi-turn tool-integrated reasoning. arXiv preprint arXiv:2509.02479. Cited by: §2.2, §5.1. - K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar (2023) Leandojo: theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems 36, pp. 21573–21612. Cited by: §2.2. - T. Yu, B. Ji, S. Wang, S. Yao, Z. Wang, G. Cui, L. Yuan, N. Ding, Y. Yao, Z. Liu, et al. (2025) RLPR: extrapolating rlvr to general domains without verifiers. arXiv preprint arXiv:2506.18254. Cited by: §5.1, §5.1. - S. Zeng (2026) External Links: Link Cited by: Appendix G. - W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He (2025) Simplerl-zoo: investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892. Cited by: §2.1, §2.2, §5.1. - L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. (2023) Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36, pp. 46595–46623. Cited by: §2.1. - K. Zhou, C. Liu, X. Zhao, S. Jangam, J. Srinivasa, G. Liu, D. Song, and X. E. Wang (2025) The hidden risks of large reasoning models: a safety assessment of r1. arXiv preprint arXiv:2502.12659. Cited by: §1. - Y. Zhou, C. Cao, J. Yang, L. Wu, C. He, S. Han, and Y. Guo (2026) LRAS: advanced legal reasoning with agentic search. arXiv preprint arXiv:2601.07296. Cited by: §2.1. Limitations Despite significant improvements in logical reasoning capabilities, our framework faces two primary limitations. First, integrating real-time formal verification introduces computational overhead, increasing RL training time by approximately 2× compared to standard baselines. However, this cost is acceptable given the substantial performance gains (10.4%-14.2% improvement) and superior data efficiency—we achieve comparable results using only a fraction of the training data required by existing methods, such that reduced data collection costs offset the increased training time. Second, our data synthesis pipeline faces formalization challenges when translating natural language into verifiable formal representations. While conversion success rates are high in structured domains like mathematics and logic, ambiguous or commonsense-heavy descriptions may produce mapping errors that generate incorrect verification feedback, limiting generalizability to open-ended reasoning tasks and highlighting the need for more robust auto-formalization techniques. Appendix A Reward Calculation Pseudocode Table 3: Hierarchical reward function for formal logic verification-guided policy optimization | Input: Output $y$ , Ground truth answer $a^{*}$ , Predicted answer $\hat{a}$ | | --- | | Output: Total reward $R(y)$ | | Hyperparameters: | | $W=3$ (correctness weight) | | $\gamma_{\text{struct}}=3.0$ , $\beta_{\text{struct}}=1.0$ (severity penalties) | | $\alpha=1.0$ (base structural score) | | $\lambda_{\text{tag}}=0.005$ , $\tau_{\text{tag}}=200$ (tag penalty coefficients) | | $\lambda_{\text{call}}=0.5$ , $N_{\text{max}}=3$ (tool call limits) | | $\lambda_{\text{len}}=0.04$ , $\delta_{\text{max}}=10$ (length penalty) | | Step 1: Check Fatal Errors ( $\mathbb{C}_{\text{fatal}}$ ) | | if token-level repetition detected or | | execution timeout or | | tool calls $>2× N_{\text{max}}$ or | | multiple termination tags then | | return $R(y)=-\gamma_{\text{struct}}-W=-8.0$ | | Step 2: Check Invalid Format ( $\mathbb{C}_{\text{invalid}}\setminus\mathbb{C}_{\text{fatal}}$ ) | | if solution extraction fails or | | solution length $>512$ tokens or | | missing closing tag or | | $N_{\text{max}}<$ tool calls $≤ 2× N_{\text{max}}$ then | | return $R(y)=-\beta_{\text{struct}}-W=-6.0$ | | Step 3: Compute Structural Reward $R_{\text{struct}}(y)$ | | $N_{\text{undef}}=$ count of undefined tags | | $N_{\text{call}}=$ count of tool invocations | | $R_{\text{struct}}(y)=\alpha-\lambda_{\text{tag}}·\min(N_{\text{undef}},\tau_{\text{tag}})$ | | $-\lambda_{\text{call}}·\max(N_{\text{call}}-N_{\text{max}},0)$ | | Step 4: Compute Correctness Reward $R_{\text{correct}}(y)$ | | $f_{\text{len}}(\hat{a},a^{*})=\min(|\text{len}(\hat{a})-\text{len}(a^{*})|,\delta_{\text{max}})$ | | if $\hat{a}$ matches $a^{*}$ then | | $R_{\text{correct}}(y)=W-\lambda_{\text{len}}· f_{\text{len}}(\hat{a},a^{*})$ | | else | | $R_{\text{correct}}(y)=-W$ | | Step 5: Compute Total Reward | | $R(y)=R_{\text{struct}}(y)+R_{\text{correct}}(y)$ | Table 3 provides the complete algorithmic implementation of our multi-component reward function used in FLV-RL training. The pseudocode details the step-by-step computation of format rewards, correctness rewards, and formal verification rewards, including all constraint checks and penalty mechanisms described in Section 4. The time complexity for calculating the reward $R(y)$ is dominated by the verification of structural constraints and semantic correctness. Let $L$ denote the length of the generated response $y$ in tokens. The initial screening for pathological states ( $\mathbb{C}_{\text{fatal}}$ ) and invalid formats ( $\mathbb{C}_{\text{invalid}}$ ) requires a linear scan of the output tokens to detect repetition loops, count tool invocations ( $N_{\text{call}}$ ), and validate tags, resulting in $O(L)$ complexity. If the response is valid, computing $R_{\text{struct}}(y)$ involves constant-time arithmetic operations after the initial scan. The semantic verification $R_{\text{correct}}(y)$ depends on the evaluation metric; assuming string matching or metric comparison between the extracted answer $\hat{a}$ and ground truth $a^{*}$ , this step operates in $O(|\hat{a}|+|a^{*}|)$ . Therefore, the total time complexity per generation is $O(L)$ , ensuring the reward calculation remains efficient and does not introduce significant computational overhead during training. Appendix B Dataset Construction Details Our data construction pipeline systematically processes a reasoning question through multiple stages of generation, logic extraction, formal translation, and verification to create high-quality training data for supervised fine-tuning. The resulting dataset combines natural language reasoning with formally verified logical modules, enabling our models to learn both human-readable reasoning patterns and mathematically sound logical validation. Table 4 presents a comprehensive analysis of the execution-based validation outcomes across our merged dataset of 9,162 reasoning chains (a subset of of full training data). The match rate distribution reveals that a substantial majority (59.55%) of the generated formal proofs achieve perfect alignment with expected outputs ( $v_{k}^{\text{act}}=v_{k}^{\text{exp}}$ ), successfully passing Stage 1 validation without requiring further verification. An additional 26.53% of proofs fall within the high-confidence range (95–100% match rate), indicating strong but imperfect alignment that necessitates semantic equivalence checking in Stage 2. Notably, 40.45% of the dataset exhibits match rates below 100%, triggering our multi-stage validation pipeline. Among these cases, the consistency distribution (lower panel) demonstrates that 62.22% (2,306 instances) maintain semantic equivalence despite surface-level discrepancies—successfully recovering through Stage 2 verification or Stage 3 proof rewriting. The remaining 37.78% (1,400 instances) represent fundamental misalignments that are discarded from the training corpus. This stratified validation approach ensures that our final dataset preserves both syntactic precision and semantic coherence, with approximately 85.7% overall retention rate ( $=59.55\%+40.45\%× 62.22\%$ ) after rigorous quality control. The distribution pattern further reveals that only 0.83% of proofs exhibit critically low match rates (below 60%), suggesting that the initial CoT generation process produces predominantly high-quality candidates. This validates our teacher model’s effectiveness while highlighting the necessity of execution-based verification to capture subtle logical inconsistencies that may elude purely language-based evaluation. Table 4: Match rate and consistency analysis. | Match Rate Range (%) 0–20 20–40 | Count 1 11 | Percentage (%) 0.01 0.12 | | --- | --- | --- | | 40–60 | 64 | 0.70 | | 60–70 | 63 | 0.69 | | 70–80 | 168 | 1.83 | | 80–85 | 149 | 1.63 | | 85–90 | 269 | 2.94 | | 90–95 | 551 | 6.01 | | 95–99 | 1,402 | 15.30 | | 99–100 (excl.) | 1,028 | 11.22 | | 100 | 5,456 | 59.55 | | Total | 9,162 | 100.00 | | Consistency Distribution (when match_rate $≠$ 100%) | | | | Consistent | Count | Percentage (%) | | No | 1,400 | 37.78 | | Yes | 2,306 | 62.22 | | Total | 3,706 | 100.00 | <details> <summary>x7.png Details</summary> ![8050f1c8](/v1/image/8050f1c8d88276ee3055801f4023c324f1a4b760c872dadf13b8362ad9220149) ### Visual Description ## Chart Type: Pie Charts Comparing Datasets ### Overview The image presents two pie charts side-by-side, comparing the distribution of samples across different categories in two datasets: "SFT Dataset" and "RL Dataset." The SFT Dataset contains a total of 14,117 samples, while the RL Dataset contains 3,525 samples. Each slice of the pie charts represents a category (e.g., Tool Use, Mathematics, Physics), and the size of the slice corresponds to the percentage of samples belonging to that category. The legend below each pie chart lists the categories and the number of samples in each category. ### Components/Axes * **Titles:** * Left Pie Chart: "SFT Dataset (Total: 14,117 samples)" * Right Pie Chart: "RL Dataset (Total: 3,525 samples)" * **Categories (SFT Dataset):** * Tool Use * Mathematics * Physics * Chemistry * Economics * Puzzle * Other * Business * Finance * History * Biology * Computer Science * Psychology * Other STEM * Philosophy * Engineering * Law * Health * **Categories (RL Dataset):** * Mathematics * Economics * Physics * Business * Finance * Chemistry * Engineering * Biology * Computer Science * Health * Other * Other STEM * Philosophy * History * Law * Psychology * **Percentages:** Displayed on each slice of the pie charts. * **Legends:** Located below each pie chart, listing the categories and the number of samples in each. ### Detailed Analysis **SFT Dataset:** * **Tool Use:** Light Green, 35.2%, 4964 samples * **Mathematics:** Light Blue, 19.3%, 2728 samples * **Physics:** Blue, 11.6%, 1640 samples * **Chemistry:** Green, 5.1%, 722 samples * **Economics:** Purple, 2.4%, 343 samples * **Puzzle:** Pink, 14.0%, 1977 samples * **Other:** Yellow, 1.3%, 180 samples * **Business:** Orange, 4.8%, 682 samples * **Finance:** Light Pink, 2.8%, 395 samples * **History:** Dark Yellow, 1.1%, 159 samples * **Biology:** Dark Teal, 1.1%, 159 samples * **Computer Science:** Red, 0.3%, 36 samples * **Psychology:** Dark Orange, 0.3%, 41 samples * **Other STEM:** Light Purple, 0.2%, 25 samples * **Philosophy:** Salmon, 0.1%, 15 samples * **Engineering:** Brown, 0.1%, 8 samples * **Law:** Teal, 0.1%, 13 samples * **Health:** Gray, 0.2%, 30 samples **RL Dataset:** * **Mathematics:** Light Blue, 23.0%, 809 samples * **Economics:** Purple, 5.6%, 199 samples * **Physics:** Blue, 32.5%, 1147 samples * **Business:** Orange, 11.0%, 388 samples * **Finance:** Light Pink, 10.0%, 351 samples * **Chemistry:** Green, 14.5%, 512 samples * **Engineering:** Brown, 0.6%, 21 samples * **Biology:** Dark Teal, 0.9%, 31 samples * **Computer Science:** Red, 0.3%, 12 samples * **Health:** Gray, 0.1%, 5 samples * **Other:** Yellow, 0.6%, 20 samples * **Other STEM:** Light Purple, 0.3%, 9 samples * **Philosophy:** Salmon, 0.2%, 6 samples * **History:** Dark Yellow, 0.2%, 7 samples * **Law:** Teal, 0.2%, 7 samples * **Psychology:** Dark Orange, 0.03%, 1 samples ### Key Observations * In the SFT Dataset, "Tool Use" constitutes the largest portion (35.2%), followed by "Mathematics" (19.3%) and "Physics" (11.6%). * In the RL Dataset, "Physics" constitutes the largest portion (32.5%), followed by "Mathematics" (23.0%) and "Chemistry" (14.5%). * The distribution of categories is significantly different between the two datasets. For example, "Tool Use" is a major category in the SFT Dataset but is not present in the RL Dataset. * The "Other," "Other STEM," "Philosophy," "History," "Law," and "Psychology" categories have very small percentages in both datasets. ### Interpretation The pie charts illustrate the composition of the SFT and RL datasets in terms of different categories. The significant differences in the distribution of categories between the two datasets suggest that they are designed for different purposes or represent different types of data. The SFT dataset seems to be heavily focused on "Tool Use," while the RL dataset emphasizes "Physics" and "Mathematics." The presence of categories like "Puzzle" in the SFT dataset and the varying proportions of other categories indicate different data collection or generation strategies. The small percentages of some categories in both datasets might indicate that these categories are less relevant or less frequently encountered in the respective contexts. </details> Figure 7: Distribution of data categories in training Sets. Left: the categorical breakdown of the SFT dataset, totally 14,117 sample. Right: the categorical breakdown of the RL dataset. Legends list the exact number of samples for each category, totally 3,525 sample. Appendix C Hyperparameter Specification We carefully calibrate the reward function hyperparameters to balance the learning of hybrid reasoning patterns with answer correctness, while preventing pathological behaviors during training. Correctness vs. Format Balance. We set the correctness weight $W=3$ and base structural score $\alpha=1.0$ to ensure the model balances learning the hybrid reasoning paradigm with exploring correct answers under this new framework. This 3:1 ratio encourages the model to prioritize semantic accuracy while maintaining proper integration of natural language and formal verification components. Fatal Error Prevention. Given the complexity of hybrid reasoning, models in early training stages are prone to generating pathological outputs such as repetitive tokens or malformed structures. We therefore define such cases as fatal errors with the maximum penalty $\gamma_{\text{struct}}=3.0$ , while setting $\beta_{\text{struct}}=1.0$ for invalid but recoverable format violations. This hierarchical penalty structure strongly discourages catastrophic failures while allowing the model to explore within reasonable boundaries. Tag Usage Regulation. In hybrid reasoning, the model must learn our defined tag vocabulary (e.g., <code>, <interpreter>). During training, the model may explore alternative tags, which we discourage through $\lambda_{\text{tag}}=0.005$ with a cap at $\tau_{\text{tag}}=200$ . This cap prevents tag-related penalties from overwhelming correctness rewards, ensuring balanced learning of both answer accuracy and proper tag usage. Formal Verification Efficiency. Our data synthesis analysis reveals that most problems can be solved within 4 formal verification steps. To optimize inference efficiency and reduce unnecessary formalization, we set $N_{\text{max}}=3$ as the baseline limit with penalty coefficient $\lambda_{\text{call}}=0.5$ . Responses exceeding this threshold incur incremental penalties, while those exceeding $2× N_{\text{max}}=6$ calls are classified as fatal errors, strongly discouraging excessive tool invocations. Length Control. Following prior work on general reasoners (Ma et al., 2025), we apply length-based penalties to discourage excessively verbose generations. We set $\lambda_{\text{len}}=0.04$ with a maximum cap $\delta_{\text{max}}=10$ , and enforce a hard limit of 512 tokens for extracted solutions. This encourages concise, focused reasoning without sacrificing completeness. Model-Based Verifier. Manual review reveals that diverse domain problems beyond mathematics cannot be verified through rule-based methods. We therefore employ a model-based verifier. Empirical evaluation shows that CompassVerifier-7B achieves an optimal balance between accuracy and efficiency, leading to its selection as our verifier. Appendix D Training Dynamics and Behavior Analysis We analyze key training dynamics across 120 optimization steps in Figure 8. Reward Evolution. Figure 8(a) shows steady improvement from -0.45 to -0.1, indicating consistent progress in our composite reward function balancing structural integrity, semantic correctness, and efficiency. Response Length. Figure 8(b) exhibits a U-shaped pattern: initial decrease from 3100 to 2850 tokens as the model eliminates redundancy, followed by stabilization around 3100-3200 tokens, reflecting an optimal balance between completeness and conciseness. Formal Logic Verification Efficiency. Figure 8(c) demonstrates rapid improvement from 2.3 to 1.9 in the first 40 steps, then gradual stabilization. This shows the model learns to generate more efficiently verifiable proofs with fewer symbolic interpreter calls. <details> <summary>x8.png Details</summary> ![03385923](/v1/image/03385923d9a27f1d735661aa3c37dd9ee8bb56dfcf3171c4380b5fb9ac3f380c) ### Visual Description ## Line Chart: Performance over Steps ### Overview The image is a line chart showing the performance over steps. The x-axis represents the "Step" number, and the y-axis represents a performance metric ranging from approximately -0.45 to -0.1. The blue line shows an upward trend, indicating improving performance as the step number increases. ### Components/Axes * **X-axis:** "Step" with markers at approximately 20, 40, 60, 80, 100, and 120. * **Y-axis:** Numerical values ranging from -0.4 to -0.1, with markers at -0.4, -0.3, -0.2, and -0.1. * **Data Series:** A single blue line representing the performance metric. ### Detailed Analysis * **Blue Line (Performance):** The blue line starts at approximately -0.44 at step 0. It generally slopes upward, indicating an improvement in performance as the step number increases. The line fluctuates, showing variability in performance at each step. * At step 20, the value is approximately -0.3. * At step 40, the value is approximately -0.22. * At step 60, the value is approximately -0.3. * At step 80, the value is approximately -0.18. * At step 100, the value is approximately -0.15. * At step 120, the value is approximately -0.1. ### Key Observations * The performance generally improves over the steps. * There are fluctuations in performance at each step. * The rate of improvement appears to decrease as the step number increases. ### Interpretation The chart suggests that the system or model being evaluated is learning or improving its performance as it progresses through the steps. The fluctuations indicate that the performance is not consistently improving, but the overall trend is positive. The decreasing rate of improvement might suggest that the system is approaching a performance limit or that further optimization is needed. </details> (a) Reward <details> <summary>x9.png Details</summary> ![3f60ea31](/v1/image/3f60ea319c0c1f11bce452d4b0395d64cbd8dd7e888f06393fa361967db4e243) ### Visual Description ## Line Chart: Time Series Data ### Overview The image is a line chart displaying a time series. The x-axis represents "Step" and the y-axis represents an unspecified value ranging from approximately 2800 to 3300. The blue line shows a fluctuating trend, initially decreasing, then sharply increasing, and finally stabilizing with continued fluctuations. ### Components/Axes * **X-axis:** Labeled "Step". The axis ranges from approximately 0 to 120, with tick marks at intervals of 20. * **Y-axis:** The y-axis ranges from 2800 to 3300, with tick marks at intervals of 100. * **Data Series:** A single blue line represents the data series. There is no legend. ### Detailed Analysis The blue line representing the data series exhibits the following behavior: * **Steps 0-55:** The line shows a decreasing trend with fluctuations. It starts around 3120 and decreases to a minimum value of approximately 2750 at step 55. * **Steps 55-75:** The line shows a sharp increasing trend. It increases from approximately 2750 to a maximum value of approximately 3300 at step 75. * **Steps 75-125:** The line shows a relatively stable trend with fluctuations. It fluctuates between approximately 3050 and 3250. Specific data points (approximate): * Step 0: 3120 * Step 20: 2950 * Step 40: 3000 * Step 55: 2750 * Step 75: 3300 * Step 90: 3150 * Step 110: 3080 * Step 125: 3150 ### Key Observations * The data series experiences a significant drop before step 60. * A sharp increase occurs between steps 60 and 75. * The data stabilizes after step 75, fluctuating within a narrower range. ### Interpretation The chart illustrates a time series with three distinct phases: a decline, a rapid increase, and a stabilization period. The initial decline might represent a period of instability or learning, the rapid increase could indicate a significant event or intervention, and the stabilization suggests a state of equilibrium or convergence. Without further context, the specific meaning of the "Step" and the y-axis value remains unclear, but the chart clearly demonstrates a dynamic process with notable transitions. </details> (b) Response Length <details> <summary>x10.png Details</summary> ![9f5491e8](/v1/image/9f5491e810cc266735905dbaba56bc52ec51228de7ba5cf6a03eb7363d216bc3) ### Visual Description ## Line Chart: Loss over Steps ### Overview The image is a line chart showing the trend of a loss function over a number of steps. The x-axis represents the "Step" number, and the y-axis represents the loss value. The loss generally decreases over time, with some fluctuations. ### Components/Axes * **X-axis:** "Step". The x-axis ranges from approximately 0 to 120, with tick marks at intervals of 20. * **Y-axis:** The y-axis ranges from 1.9 to 2.3, with tick marks at intervals of 0.1. * **Data Series:** A single blue line represents the loss value at each step. ### Detailed Analysis * **Initial Trend:** The blue line starts at approximately 2.28 at step 0. It decreases with some fluctuations until around step 60. * **Fluctuations:** There are notable spikes at approximately step 35 and step 40. * **Stabilization:** After step 60, the line stabilizes and fluctuates around a value of approximately 1.95. **Specific Data Points (Approximate):** * Step 0: Loss ≈ 2.28 * Step 10: Loss ≈ 2.18 * Step 20: Loss ≈ 2.08 * Step 35: Loss ≈ 2.13 * Step 40: Loss ≈ 2.00 * Step 60: Loss ≈ 1.94 * Step 80: Loss ≈ 1.95 * Step 100: Loss ≈ 1.94 * Step 120: Loss ≈ 1.93 ### Key Observations * The loss function decreases significantly in the first 60 steps. * The loss function stabilizes after 60 steps, indicating convergence. * There are some spikes in the loss function before it stabilizes. ### Interpretation The chart illustrates the training process of a model, where the loss function is being minimized over time. The initial decrease in loss indicates that the model is learning effectively. The stabilization of the loss function suggests that the model has converged and is no longer improving significantly. The spikes in the loss function could be due to various factors, such as changes in the training data or learning rate. Overall, the chart shows a successful training process with a decreasing loss function that eventually converges. </details> (c) Number of Logic Verification Figure 8: Training dynamics during FLV-RL optimization over 120 steps, showing (a) composite reward improvement, (b) response length evolution, and (c) number of logic verification optimization. Appendix E Analysis of Verification Overhead in Mathematical Reasoning Table 5: Failure case showing enforced verification overhead in mathematical reasoning (MATH_500). Ground truth answer is 27, but the model predicted 81. Table 5 illustrates a critical failure mode induced by enforced formal verification. The problem requires finding the smallest perfect cube expressible as the sum of three consecutive integers. The correct answer is 27 (= 8+9+10), yet the model arrives at 81 through fundamentally flawed reasoning. The root cause lies in the verification paradigm’s cognitive overhead: rather than directly computing $n^{3}$ for small values and checking $n^{3}=3k$ (which immediately yields $3^{3}=27$ ), the model constructs an unnecessarily complex z3-solver script that obscures the arithmetic. Notice how the code attempts to verify $3× 27^{3}=(3k)^{3}$ —a nonsensical constraint that conflates the problem statement (finding a cube equal to $3n$ ) with an arbitrary symbolic manipulation. The verification framework, instead of catching this error, produces a “DISPROVED” output that the model then rationalizes away (“upon closer inspection, it confirmed that 81 works”), demonstrating how mandatory verification can paradoxically reduce error-detection capability. This case exemplifies why formal verification tools become liabilities in computational contexts: they introduce syntactic complexity (z3 constraint formulation) that distracts from semantic correctness (direct enumeration: $1^{3}=1≠ 3k$ , $2^{3}=8≠ 3k$ , $3^{3}=27=3× 9\checkmark$ ), ultimately degrading performance on problems solvable through elementary arithmetic. The flexible verification strategy addresses this by permitting direct calculation during reasoning, relegating formal tools to post-hoc validation roles where their rigor provides genuine value rather than procedural friction. Appendix F Analysis of Package Usage Distribution Table 6: Categorization of Python packages by problem-solving paradigm. | Symbolic & Logic | Handles abstract symbols, constraint satisfaction, formal logic, and graph structures. | sympy, z3-solver, networkx, constraint | Reasoning: Abstract deduction & proofs. | | --- | --- | --- | --- | | Numerical & Scientific | Performs high-precision arithmetic, matrix operations, and statistical analysis. | numpy, math, scipy, pandas, fractions | Calculation: Quantitative modeling. | | Algorithmic & Search | Focuses on combinatorial generation, iteration, and discrete optimization strategies. | itertools, collections, random, heapq | Search: Brute-force & simulation. | | Domain & Utilities | Tools for specific non-mathematical domains (text, time, web) and system operations. | datetime, re, requests, nltk, bs4 | Knowledge: Information retrieval. | <details> <summary>x11.png Details</summary> ![775dd634](/v1/image/775dd634cfc4ecc42dccaa53999414e3ba8405a4ecec829b46677f2975ac73bc) ### Visual Description ## Bar Chart: Package Usage by Category ### Overview The image is a bar chart comparing the percentage of package usage across different categories for two systems: SimpleTIR and FLV-GRPO. The x-axis represents the package category, and the y-axis represents the percentage of package usage. ### Components/Axes * **X-axis:** Package Category * Categories: Symbolic & Logic, Algorithmic & Search, Numerical & Scientific, Text & NLP, Domain & Utils * **Y-axis:** Percentage of Package Usage (%) * Scale: 0% to 60%, with implicit increments of 10%. * **Legend:** Located at the top-right corner. * SimpleTIR (light blue) * FLV-GRPO (light red) ### Detailed Analysis The chart presents a comparison of package usage between SimpleTIR and FLV-GRPO across five categories. * **Symbolic & Logic:** * SimpleTIR: 42.5% * FLV-GRPO: 62.5% * **Algorithmic & Search:** * SimpleTIR: 20.2% * FLV-GRPO: 6.5% * **Numerical & Scientific:** * SimpleTIR: 21.8% * FLV-GRPO: 20.4% * **Text & NLP:** * SimpleTIR: 4.4% * FLV-GRPO: Approximately 0% (very small bar) * **Domain & Utils:** * SimpleTIR: 9.9% * FLV-GRPO: 10.0% ### Key Observations * FLV-GRPO has significantly higher usage in the "Symbolic & Logic" category compared to SimpleTIR. * SimpleTIR has higher usage in the "Algorithmic & Search" category compared to FLV-GRPO. * The usage in "Numerical & Scientific" and "Domain & Utils" categories is similar for both systems. * Both systems have very low usage in the "Text & NLP" category, with FLV-GRPO being near zero. ### Interpretation The data suggests that FLV-GRPO relies more heavily on "Symbolic & Logic" packages, while SimpleTIR utilizes "Algorithmic & Search" packages more. The similar usage in "Numerical & Scientific" and "Domain & Utils" indicates these categories are equally important for both systems. The low usage of "Text & NLP" packages suggests that neither system heavily incorporates text processing or natural language processing functionalities. The difference in package usage likely reflects the different design and application focuses of the two systems. </details> Figure 9: Comparison of package usage distribution: SimpleTIR vs. FLV-RL The comparison between SimpleTIR and FLV-RL reveals a fundamental shift in problem-solving paradigms, moving from iterative search to abstract reasoning. As defined in Table 6, the Symbolic & Logic category encompasses tools for constraint satisfaction and formal proofs (e.g., z3-solver, sympy). The chart demonstrates a substantial increase in this category, rising from 42.5% in SimpleTIR to 62.5% in FLV-RL (Figure 9). This suggests that the FLV-RL model is increasingly relying on mathematical abstraction and logical deduction to solve problems rather than procedural code. Conversely, usage of the Algorithmic & Search category—defined as handling permutations and iterative loops (e.g., itertools)—drops precipitously from 20.2% to 6.5%. This inverse correlation indicates that FLV-RL has largely abandoned brute-force simulation and exhaustive search strategies. Instead of generating candidate solutions through iteration, the model prefers to model the problem space symbolically and solve for the answer directly. Meanwhile, the Numerical & Scientific and Domain & Utils categories remain relatively stable across both models (approximately 21% and 10%, respectively). This implies that while the core reasoning engine has evolved (shifting from search to logic), the foundational requirements for arithmetic calculation and environment interaction remain constant regardless of the solving strategy. Overall, these metrics quantify a qualitative leap: the model has transitioned from a “compute-heavy” search approach to a “reasoning-heavy” symbolic approach. Appendix G GPQA Problems Zeng (2026) identifies GPQA (Diamond) as having an inherent error rate lower bound of 26.8% and characterizes it as “an advanced intellectual booby trap from the old era.” The systematic flaws include: (i) Factual errors in questions/answers. (ii) Missing parameters needed for correct solutions. (iii) Transcription mistakes in data. The report concludes that GPQA has transformed “from a ruler for measuring intelligence into a noise generator” that measures how well models fit to logically flawed questions rather than actual reasoning ability. Appendix H Case Study: Formal Verification in Economic Reasoning Table 7: Error Analysis: Baseline model (Qwen 2.5-7B) reasoning trajectory with systematic failures | Question | | --- | | As we move to the northwest (up and to the left) along a linear demand curve, elasticity | | a. goes from inelastic to elastic and the line’s slope is constant | | b. goes from elastic to inelastic and the line’s slope is constant | | c. is constant and so is the line’s slope | | d. goes from inelastic to elastic and the line’s slope rises in absolute terms | | e. is constant and the line’s slope falls in absolute terms | | Asnwer Generated by Qwen-2.5-7b | | The correct answer is: | | b. goes from elastic to inelastic and the line’s slope is constant. | | Here’s the reasoning: | | Elasticity of Demand: Elasticity of demand measures the responsiveness of the quantity demanded to a change in price. It is calculated as the percentage change in quantity demanded divided by the percentage change in price. Along a linear demand curve, as you move from right to left (decreasing price and increasing quantity), the elasticity of demand changes. | | Slope of the Demand Curve: The slope of a linear demand curve is constant, meaning the change in price per unit change in quantity is the same throughout the curve. | | Elasticity and Position on the Demand Curve: | | At higher prices (and lower quantities), the demand is more elastic because a small change in price leads to a relatively larger change in quantity demanded. | | As you move to the left along the demand curve (decreasing price and increasing quantity), the demand becomes less elastic because the percentage change in quantity demanded becomes smaller relative to the percentage change in price. | | Therefore, as you move northwest (up and to the left) along a linear demand curve, the elasticity of demand goes from elastic to inelastic, and the slope of the line remains constant. | To illustrate how formal logic verification improves economic reasoning, we use a standard microeconomics question that exposes typical failure modes of free-form chain-of-thought and shows how verification-guided training corrects them. H.1 Problem and Baseline Error Analysis Problem. Consider the multiple-choice question: “For a linear demand curve, as we move northwest along the curve, what happens to price elasticity of demand?” In the conventional $(Q,P)$ diagram (quantity on the horizontal axis, price on the vertical axis), “northwest” means higher price and lower quantity. Baseline behavior. The baseline model (Qwen 2.5-7B) selects option (b) “goes from elastic to inelastic” and exhibits three systematic failures (Table 7). Failure Mode 1: Semantic grounding error (directional mis-mapping). The core mistake is a mis-grounding of the spatial term “northwest.” The correct executable semantics in the $(Q,P)$ plane is: $$ \text{northwest}(1\rightarrow 2)\;\Rightarrow\;(P_{2}>P_{1})\ \wedge\ (Q_{2}<Q_{1}). \tag{11} $$ The baseline instead behaves as if “northwest” implied the opposite comparative direction (effectively a southeast move), e.g., $$ \text{northwest}(1\rightarrow 2)\;\mapsto\;(P_{2}<P_{1})\ \wedge\ (Q_{2}>Q_{1}), \tag{12} $$ which flips the economic interpretation and deterministically pushes the subsequent reasoning toward the wrong answer. Failure Mode 2: Undetected logical inconsistency. The baseline produces mutually incompatible claims without detecting the contradiction (e.g., asserting both that demand is “more elastic at higher prices” and that the alleged “northwest” move reduces elasticity under its mis-grounded direction). Formally, if a reasoning chain asserts propositions $\{\phi_{i}\}_{i=1}^{n}$ , the chain should satisfy global consistency: $$ \bigwedge_{i=1}^{n}\phi_{i}\not\equiv\bot. \tag{13} $$ Pure next-token generation optimizes local likelihood and does not enforce Eq. (13), allowing contradictory statements to coexist. Failure Mode 3: Conceptual conflation (slope vs. elasticity). The baseline conflates constant slope with constant elasticity. For a linear demand curve $$ P=a-bQ,\qquad a>0,\;b>0, \tag{14} $$ the slope $\frac{dP}{dQ}=-b$ is constant, but point price elasticity (in magnitude) is $$ \lvert\varepsilon(Q,P)\rvert=\left\lvert\frac{dQ}{dP}\right\rvert\frac{P}{Q}=\frac{1}{b}\cdot\frac{P}{Q}, \tag{15} $$ which varies with the ratio $\frac{P}{Q}$ . Along the same line, moving northwest increases $P$ and decreases $Q$ , hence increases $\frac{P}{Q}$ and therefore increases $\lvert\varepsilon\rvert$ . H.2 Verification-Guided Correction Our framework corrects these errors by interleaving generation with SMT-based verification (Table LABEL:tab:case_flv_correct). We summarize four mechanisms. Mechanism 1: Executable semantic grounding. We train the model to translate ambiguous natural language into solver-checkable constraints. For this task, the semantic parser is encouraged to map “northwest” to Eq. (11) (rather than Eq. (12)), together with domain assumptions typical in economics: $$ P_{1}>0,\ Q_{1}>0,\ P_{2}>0,\ Q_{2}>0,\ \text{and}\ P_{i}=a-bQ_{i}\ \text{for}\ i\in\{1,2\}. \tag{16} $$ Incorrect mappings are rejected because they become inconsistent with the intended move or with the economic domain; the solver provides immediate feedback via $\textsc{sat}/\textsc{unsat}$ . One simple way to incorporate verification is to weight learning by verification success: $$ \mathcal{L}_{\text{ground}}=-\mathbb{E}_{x\sim\mathcal{D}}\Bigl[\mathbb{I}\bigl(\text{verify}(f_{\theta}(x))=\textsc{sat}\bigr)\cdot\log P_{\theta}\bigl(f_{\theta}(x)\mid x\bigr)\Bigr], \tag{17} $$ where $f_{\theta}$ is the model-produced formalization and $\text{verify}(·)$ calls the SMT solver. Mechanism 2: Global consistency enforcement over a chain. Instead of allowing each step to stand alone, we require that the accumulating set of claims remains satisfiable under the shared constraints (Eqs. (14)–(16)). Concretely, we can test competing hypotheses such as: $$ H_{\downarrow}:\lvert\varepsilon_{2}\rvert<\lvert\varepsilon_{1}\rvert \tag{18} $$ under the “northwest” constraints. The solver returns unsat for $H_{\downarrow}$ (given standard domain assumptions), steering the model away from inconsistent chains and toward the correct alternative: $$ H_{\uparrow}:\lvert\varepsilon_{2}\rvert>\lvert\varepsilon_{1}\rvert. \tag{19} $$ Mechanism 3: Counterexample-driven learning (numerical witnesses). When a hypothesis fails, the solver can provide concrete satisfying assignments for the correct hypothesis, serving as a numerical witness that bridges symbolic proof and intuition. For example, pick $b>0$ and two points on the same demand curve with $P_{2}>P_{1}$ and $Q_{2}<Q_{1}$ , such as $(P_{1},Q_{1})=(6,4)$ and $(P_{2},Q_{2})=(8,2)$ . Then by Eq. (15), $$ \lvert\varepsilon_{1}\rvert=\frac{1}{b}\cdot\frac{6}{4}=\frac{1.5}{b},\qquad\lvert\varepsilon_{2}\rvert=\frac{1}{b}\cdot\frac{8}{2}=\frac{4.0}{b}, \tag{20} $$ hence $\lvert\varepsilon_{2}\rvert>\lvert\varepsilon_{1}\rvert$ . Training on triples $(\text{hypothesis},\text{verification result},\text{witness})$ provides richer supervision than final-answer labels alone. Mechanism 4: Compositional algebraic reasoning with verified substeps. We structure the explanation into verifiable primitives: (i) linear constraint $P=a-bQ$ ; (ii) derivative $\frac{dQ}{dP}=-\frac{1}{b}$ ; (iii) elasticity definition $\lvert\varepsilon\rvert=\left\lvert\frac{dQ}{dP}\right\rvert\frac{P}{Q}$ ; (iv) monotonicity: under $P_{2}>P_{1}$ and $Q_{2}<Q_{1}$ , we have $\frac{P_{2}}{Q_{2}}>\frac{P_{1}}{Q_{1}}$ , therefore $\lvert\varepsilon_{2}\rvert>\lvert\varepsilon_{1}\rvert$ . This decomposition yields reusable, solver-checkable reasoning components rather than brittle pattern matching. Table 8: Formal logic verification-guided reasoning trajectory. Appendix I Prompts This section presents the key prompts used throughout our training pipeline. Table 9 shows the reinforcement learning rollout prompt that guides models to generate formal verification-augmented reasoning during policy optimization. Table LABEL:lst:formal-incert detail the prompt used for formal logic verification-guided reasoning dataset construction. Table 9: Prompt for Model Training | Solve the following problem step by step. You can selectively use Python with z3 (for logic) and sympy (for calculations) to verify reasoning. Python code will run in an external sandbox, returning output as <interpreter>output</interpreter>. The python code should be complete scripts, including necessary imports. | | --- | | Revise reasoning if sandbox returns ’disproved’ or fix code if execution errors occur. | | Code Format: | | Each code snippet is wrapped with | | <code> | | ‘‘‘ python | | code snippet | | ‘‘‘ | | </code> | | Response must end exactly as: | | <answer> | | [Summary of all reasoning steps] | | \\boxed{[Final answer]} | | </answer> | | [Question] | ⬇ You are a helpful AI assistant. Initially, when solving a question, you would need to think step by step, without the ability to use code for calculation or logical verification. Now, you have enhanced capabilities to write code for both computational tasks and logical reasoning verification. The code will be executed by a sandbox, and the result can be returned to enhance your reasoning process. ** Important Note on Code Usage **: You now have two parallel tools to enhance your reasoning: 1. ** Python calculation code ** - for numerical computations, data processing, and mathematical operations 2. ** Z3 logical verification ** - for verifying logical reasoning, constraints, and formal proofs using the Z3 theorem prover These are complementary tools serving different purposes. Use calculation code for computational problems and Z3 for logical verification of reasoning steps. ** Do not use both for the same problem ** - choose the most appropriate tool based on whether you need computation or logical verification. The thinking process can have multiple code snippets. Each code snippet is wrapped with: < code > ‘‘‘ python code snippet ‘‘‘ </ code >, and should be executable. The returned result is wrapped with < interpreter > execution results </ interpreter >. Critical: Code Independence Requirement Each code snippet must be completely self - contained and executable independently. This means: - Each code block should include all necessary imports - Each code block should define all variables it uses (do not rely on variables from previous code blocks) - Each code block should be able to run successfully if executed in isolation - If you need values from previous calculations, redefine or recalculate them in the new code block - Think of each < code > block as being executed in a fresh Python environment ** Guidelines for Z3 Usage:** Z3 verification should ONLY be used when it provides genuine formal verification value: Do NOT use Z3 for: - Verifying simple arithmetic calculations (e. g., 2 + 2 = 4, or 1.0 * (-2.0 - 4.0) = -6.0) - Checking calculations with concrete numbers that Python already computed - Adding concrete values as constraints and then verifying them (this is circular reasoning) - Repeating what numerical computation already verified DO use Z3 for: - Proving general mathematical properties or identities that hold for ALL values (using symbolic variables) - Verifying complex logical relationships with multiple interrelated constraints - Checking satisfiability of constraint systems or finding whether solutions exist - Proving inequalities or optimization conditions symbolically - Verifying that no counterexamples exist for a general claim - Formal verification of logical reasoning steps that involve symbolic relationships - Key principle: Use symbolic variables (not concrete numbers) to prove general statements. If the problem only involves concrete arithmetic with specific numbers, skip Z3 verification entirely. Examples of INCORRECT Z3 usage: # BAD: Verifying concrete arithmetic solver. add (m == 1.0) solver. add (v_i == 4.0) solver. add (delta_p == -6.0) solver. check () # This just checks if -6.0 == -6.0, no value added Examples of CORRECT Z3 usage: # GOOD: Proving general algebraic equivalence from z3 import * # Remember: each code block needs its imports m, v1, v2 = Reals (’ m v1 v2 ’) solver = Solver () solver. add (m > 0) # General constraint, not specific value # Prove: Delta_p = m (v2 - v1) is equivalent to Delta_p = m * v2 - m * v1 for ALL values solver. add (m * (v2 - v1) != m * v2 - m * v1) result = solver. check () print (" Verification result:", result) assert result == unsat # Proves no counterexample exists ** Guidelines for Python code calculation Usage:** When NOT to use code: - Simple arithmetic that can be done mentally - Basic algebra or formula substitution - Straightforward unit conversions - Verifying obvious mathematical identities - Problems where all steps are elementary calculations Code Usage Limit: - For problems solvable with basic math: Use code AT MOST 1-2 times (or not at all) - For complex computational problems: Use code AT MOST 3-4 times - Each code block should serve a distinct, necessary purpose - ** Never use multiple code blocks to verify the same calculation in different ways ** Goal: Modify the original thinking process to make it more accurate by: - Replacing manual calculation steps with Python code snippets and their execution results - Adding Z3 logical verification when it provides genuine formal verification beyond simple arithmetic - Keeping the core reasoning logic intact, including any unsuccessful attempts - Adding code only where it provides genuine value - Ensuring each code snippet serves a unique, necessary purpose - use Python code or Z3 verification for a combined total of no more than 4 times - Wrap the revised thinking process within < revised_thinking_process > and </ revised_thinking_process >. User Question: {question} Original Thinking Process (without code interpreter ’ s support): < original_thinking_process > {original_response} </ original_thinking_process > Details: 1. Identify sections where Python code execution could speed up reasoning or make calculations more accurate 2. Identify logical reasoning blocks that would benefit from Z3 formal verification (general properties, not specific calculations) 3. Replace manual calculation steps with code snippets and corresponding interpreter ’ s execution results 4. Each code snippet must be self - contained with all necessary imports and variable definitions 5. Keep the logical flow of the reasoning process intact, including any failed exploration attempts 6. Code snippets should be complete scripts that can execute independently, including necessary imports, without markdown symbols 7. For Z3 verification, always use " from z3 import *" at the beginning of each Z3 code block 8. Outputs in code snippets must explicitly call the print function 9. Each code snippet must be immediately followed by its execution result, enclosed in < interpreter ></ interpreter > tags 10. Execution results should match the model ’ s output exactly, with no extra or missing tokens 11. Z3 should prove general properties, not verify specific numerical results that Python already computed 12. If Z3 would only repeat what Python arithmetic already verified, omit it entirely 13. Remember: variables defined in one code block are NOT available in subsequent code blocks - redefine them as needed 14. When performing calculations, format numerical outputs to 2-4 decimal places using round () or f - strings (e. g., print (f "{result:.2 f}")) to avoid displaying unnecessary floating - point digits. Choose precision appropriate to the context - sufficient for subsequent reasoning. Revised Thinking Process (With independent selective Python computation blocks and Z3 formal verification): Listing 1: Prompt for formal logic verification-guided reasoning chain generation

Rendering Paper...