# Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification in Language Models
**Authors**: Chuxue Cao, Jinluan Yang, Haoran Li, Kunhao Pan, Zijian Zhao, Zhengyu Chen, Yuchen Tian, Lijun Wu, Conghui He, Sirui Han, Yike Guo
## Abstract
Large Language Models (LLMs) show remarkable capabilities, yet their stochastic next-token prediction creates logical inconsistencies and reward hacking that formal symbolic systems avoid. To bridge this gap, we introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process, providing real-time feedback to detect and rectify errors as they occur. Distinguished from previous neuro-symbolic methods limited by passive post-hoc validation, our approach actively penalizes intermediate fallacies during the reasoning chain. We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization. Extensive evaluation on six benchmarks spanning mathematical, logical, and general reasoning demonstrates that our 7B and 14B models outperform state-of-the-art baselines by average margins of 10.4% and 14.2%, respectively. These results validate that formal verification can serve as a scalable mechanism to significantly push the performance boundaries of advanced LLM reasoning. We will release our data and models for further exploration soon.
Machine Learning, ICML
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Logic Consistency in NL Reasoning Chains
### Overview
The image presents a comparison of Natural Language (NL) reasoning and Formal Logic reasoning in solving a simple comparative problem ("Alice > Bob, Charlie < Alice, Diana > Charlie. Who scores higher: Bob or Diana?"). It highlights inconsistencies in NL reasoning chains and contrasts them with the output of formal logic. A bar chart illustrates the logic consistency in correct and wrong CoT (Chain of Thought) reasoning.
### Components/Axes
The diagram is divided into four main sections, arranged in a 2x2 grid.
* **Top-Left:** "NL Reasoning" with a chain of reasoning and an incorrect answer marked with a red "X".
* **Top-Right:** "NL Reasoning" with a chain of reasoning and "Formal Logic Reasoning" code snippet.
* **Bottom-Left:** A bar chart titled "Logic Consistency in NL Reasoning Chains".
* **X-axis:** "Logic Consistency in NL Reasoning Chains" with categories "Correct CoT" and "Wrong CoT".
* **Y-axis:** "Percentage (%)".
* **Bottom-Right:** "Compiler Output" and "Answer" with a correct answer marked with a green checkmark.
The bar chart legend is positioned in the top-right corner of the chart itself.
* **Legend:**
* Blue: "Consistent Logic"
* Red: "Inconsistent Logic"
### Detailed Analysis or Content Details
**Problem Statement:** "Alice > Bob, Charlie < Alice, Diana > Charlie. Who scores higher: Bob or Diana?"
**NL Reasoning (Incorrect):**
"Charlie < Diana < Alice > Bob â Therefore: Diana > Bob"
Answer: "Diana scores higher than Bob" (marked with a red "X")
**NL Reasoning (Correct) & Formal Logic Reasoning:**
"Charlie < Diana < Alice > Bob â Therefore: Diana > Bob"
Formal Logic Reasoning:
```
solver.add(bob > diana)
result = solver.check()
solver.add(diana > bob)
result = solver.check()
```
**Bar Chart Data:**
* **Correct CoT:**
* Consistent Logic: Approximately 60.7%
* Inconsistent Logic: Approximately 39.3%
* **Wrong CoT:**
* Consistent Logic: Approximately 47.6%
* Inconsistent Logic: Approximately 52.4%
**Compiler Output:** "Unknown"
Answer: "Relationship is undetermined" (marked with a green checkmark)
### Key Observations
* NL reasoning, even when following a logical chain, can lead to incorrect conclusions (as demonstrated by the first NL Reasoning example).
* Formal logic provides a deterministic approach to solving the problem.
* The bar chart shows that even in "Correct CoT" reasoning, there's a significant percentage (around 39.3%) of inconsistent logic.
* "Wrong CoT" reasoning has a slightly higher percentage of inconsistent logic (around 52.4%) than consistent logic (around 47.6%).
* The compiler output is "Unknown", indicating the formal logic approach cannot definitively determine the relationship.
### Interpretation
The diagram illustrates the challenges of relying solely on natural language reasoning for logical deduction. While humans can often intuitively arrive at correct answers, the process is prone to inconsistencies and errors. The formal logic approach, while more rigorous, can sometimes yield inconclusive results ("Unknown"). The bar chart highlights that even when a chain of thought *appears* correct, underlying logical inconsistencies can still exist. This suggests that a combination of NL reasoning and formal verification might be necessary for reliable decision-making in complex scenarios. The difference in percentages between "Correct CoT" and "Wrong CoT" suggests that the CoT approach itself is not a guarantee of logical consistency. The "Unknown" compiler output suggests the problem may be underconstrained or require additional information to resolve.
</details>
Figure 1: Comparison between natural language reasoning and formal logic verification-guided reasoning. Formal verification detects logical errors in natural language reasoning and provides corrected reasoning paths. âNLâ means Natural Language.
Large Language Models (LLMs) demonstrate impressive proficiency in mathematical and logical reasoning (Ahn et al., 2024; Ji et al., 2025; Liu et al., 2025b; Chen et al., 2025a), yet their probabilistic decoding process lacks inherent mechanisms to ensure consistency (Sheng et al., 2025b; Baker et al., 2025). This tension creates significant risks, including hallucinations (Li and Ng, 2025; Sheng et al., 2025b), safety vulnerabilities (Zhou et al., 2025; Cao et al., 2025b), and reward hacking (Chen et al., 2025b; Baker et al., 2025). Although recent efforts have employed model-based verifiers to offer denser feedback than sparse ground-truth labels (Ma et al., 2025; Liu et al., 2025c; Guo et al., 2025b), they often overlook intermediate reasoning steps. To enforce more rigorous supervision, subsequent research has incorporated formal tools like theorem provers and code interpreters (Ospanov et al., 2025; Kamoi et al., 2025; Liu et al., 2025a) to address this drawback. However, existing formal approaches face critical limitations: they are often restricted to specific domains (e.g., Mathematics) (Ospanov et al., 2025; Liu et al., 2025a), rely on uncertain autoformalization (Ospanov et al., 2025; Feng et al., 2025b), or utilize post-hoc verification that cannot actively prevent error propagation (Kamoi et al., 2025; Feng et al., 2025b). This yields the primary question to be explored:
(Q) Can we utilize the formal verification to further enhance the LLM reasoning across diverse domains?
To explore this question, we first quantified the logical consistency gap in current LLMs by conducting a formal verification analysis of generated reasoning chains. A critical finding emerges as shown in Figure 1: even chains that arrive at correct final answers suffer from substantial logical inconsistency, with 39.3% of steps formally disproved, a trend consistent with previous research (Sheng et al., 2025b; Leang et al., 2025). For chains leading to incorrect answers, this failure rate rises to 52.4%. The comparative example in Figure 1 illustrates this gap: natural language reasoning incorrectly concludes âDiana $>$ Bobâ from the given constraints, while formal verification identifies the incorrect conclusion and lead to an correct answer. This phenomenon reveals pervasive âreward hacking,â where models exploit superficial patterns to reach correct labels without establishing sound logical foundations (Skalse et al., 2022). Ultimately, these results expose a fundamental limitation of natural language reasoning: without explicit verification mechanisms, models cannot guarantee reasoning validity or global logical coherence across multi-step inference.
To bridge this gap, we propose a novel framework that synergizes probabilistic natural language reasoning with formal symbolic verification. Distinguished from prior approaches relying on static filtering or narrow domains, our method dynamically interleaves formal verification into the generation process. By incorporating feedback from satisfiability results, counterexamples, and execution outputs, we extend the standard chain-of-thought paradigm to enable real-time error detection and rectification. To effectively operationalize this interleaved reasoning approach, we introduce a formal logic verification-guided training framework comprising two synergistic stages: (i) Supervised Fine-tuning (SFT), which employs a hierarchical data synthesis pipeline. Crucially, we mitigate the noise of raw autoformalization by enforcing execution-based validation, thereby ensuring high alignment between natural language and formal proofs; and (ii) Reinforcement Learning (RL), which utilizes Group Relative Policy Optimization (GRPO) with a composite reward function to enforce structural integrity and penalize logical fallacies. Empirical evaluations across logical, mathematical, and general reasoning domains demonstrate that this framework substantially enhances reasoning accuracy, highlighting the potential of formal verification to push the performance boundaries of LLM reasoning. Our contributions can be concluded as follows:
- We propose the first framework that dynamically interleaves formal verification into LLM reasoning across diverse domains, utilizing the real-time feedback via symbolic interpreters to transcend the limitations of passive post-hoc filtering and domain-specific theorem proving.
- We introduce a two-stage training framework that combines formal verification-guided supervised fine-tuning with policy optimization, featuring a novel data synthesis pipeline with execution-based validation to enforce logical soundness and structural integrity.
- Extensive evaluations on six benchmarks demonstrate that our models break performance ceilings, surpassing SOTA by 10.4% (7B) and 14.2 % (14B). This validates the scalability and effectiveness of our proposed method.
## 2 Related Works
### 2.1 Large Language Models for Natural Reasoning
Supervised fine-tuning (SFT) on chain-of-thought examples (Wei et al., 2022) and step-by-step solutions (Cobbe et al., 2021) has been foundational for developing reasoning capabilities in LLMs, with recent efforts curating high-quality datasets across mathematics (LI et al., 2024), code (Xu et al., 2025), and science (Wang et al., 2022). However, SFT alone cannot effectively optimize for complex objectives beyond imitation and struggles with multi-step error correction (Lightman et al., 2023; Uesato et al., 2022; Zhou et al., 2026). Recent RL advances using outcome-based optimization methods have achieved remarkable success in mathematical reasoning (Cobbe et al., 2021; Zeng et al., 2025), code generation (Le et al., 2022; Feng et al., 2025a), and general-domain reasoning (Ma et al., 2025; Chen et al., 2025c). However, optimizing solely for final answer correctness creates perverse incentives where models learn correct conclusions through logically invalid pathways (Uesato et al., 2022), leading to reward hacking (Skalse et al., 2022) and brittleness under distribution shift (Hendrycks et al., 2021). To address these limitations, process-based rewards incorporate feedback from intermediate steps, providing dense supervision through human-annotated judgments (Uesato et al., 2022; Lightman et al., 2023; She et al., 2025; Khalifa et al., 2025). However, the probabilistic nature of language model-based verifiers introduces errors and biases (Zheng et al., 2023), limiting their ability to detect subtle logical inconsistencies that emerge during multi-step reasoning.
### 2.2 Formal Reasoning and Verification
Recent work has integrated formal verification tools, including theorem provers (Yang et al., 2023; Cao et al., 2025a; Tian et al., 2025), code interpreters (Feng et al., 2025a), and symbolic solvers (Li et al., 2025a), to provide machine-checkable validation beyond the biases of LLM-as-a-judge approaches (Li and Ng, 2025; Uesato et al., 2022; Lightman et al., 2023). This direction is increasingly recognized as critical for grounding generative models in verifiable systems (Ren et al., 2025; Wang et al., 2025; Hu et al., 2025). Existing approaches differ in how verification is applied. HERMES (Ospanov et al., 2025) interleaves informal reasoning with Lean-verified steps, ensuring real-time soundness but requiring mature formal libraries. Safe (Liu et al., 2025a) applies post-hoc verification to audit completed reasoning chains, though this passive mode cannot prevent error accumulation during generation. VeriCoT (Feng et al., 2025b) checks logical consistency on extracted first-order logic, while others train verifier models using formal tool signals (Kamoi et al., 2025). Tool-integrated methods (Xue et al., 2025; Zeng et al., 2025; Li et al., 2025b; Feng et al., 2025a) embed interpreter calls into generation for calculation or simulation, but often lack strict logical guarantees. These approaches face key limitations: specialization to mathematical theorem proving, treating verification as separate filtering without guiding generation, or relying on uncertain logic extraction and neural verifiers. In contrast, we propose verification-guided reasoning that extends formal verification to general logical domains and employs real-time feedback as dynamic, in-process guidance to steer reasoning trajectories and enable self-correction beyond specialized formal tasks.
## 3 Preliminaries
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Bar Chart: Performance Comparison of Natural-SFT and FLV-SFT across Domains
### Overview
This bar chart compares the number of correct answers achieved by two models, "Natural-SFT" and "FLV-SFT", across three different domains: "Logical", "Mathematical", and "General". The chart uses paired bars for each domain, with percentage improvements indicated above the FLV-SFT bars.
### Components/Axes
* **X-axis:** "Domain" with categories: "Logical", "Mathematical", "General".
* **Y-axis:** "Number of Correct Answers" with a scale ranging from 0 to 350, incrementing by 50.
* **Legend:** Located in the top-right corner.
* "Natural-SFT" - represented by a light blue color.
* "FLV-SFT" - represented by a red color.
* **Data Labels:** Numerical values are displayed above each bar, indicating the number of correct answers.
* **Percentage Change Labels:** Percentage increases are displayed above the "FLV-SFT" bars, indicating the improvement over "Natural-SFT".
### Detailed Analysis
The chart consists of three sets of paired bars, one set for each domain.
**Logical Domain:**
* "Natural-SFT" bar: The light blue bar reaches approximately 219 correct answers.
* "FLV-SFT" bar: The red bar reaches approximately 291 correct answers.
* Percentage Change: "+32.8%" is displayed above the "FLV-SFT" bar.
**Mathematical Domain:**
* "Natural-SFT" bar: The light blue bar reaches approximately 163 correct answers.
* "FLV-SFT" bar: The red bar reaches approximately 243 correct answers.
* Percentage Change: "+49.3%" is displayed above the "FLV-SFT" bar.
**General Domain:**
* "Natural-SFT" bar: The light blue bar reaches approximately 166 correct answers.
* "FLV-SFT" bar: The red bar reaches approximately 213 correct answers.
* Percentage Change: "+28.5%" is displayed above the "FLV-SFT" bar.
### Key Observations
* "FLV-SFT" consistently outperforms "Natural-SFT" across all three domains.
* The largest performance improvement of "FLV-SFT" over "Natural-SFT" is observed in the "Mathematical" domain (+49.3%).
* The smallest performance improvement is observed in the "General" domain (+28.5%).
* The absolute difference in the number of correct answers is largest in the "Mathematical" domain (243 - 163 = 80).
### Interpretation
The data suggests that the "FLV-SFT" model demonstrates superior performance compared to the "Natural-SFT" model across all tested domains. The significant improvement in the "Mathematical" domain indicates that "FLV-SFT" may have a stronger capability in handling quantitative reasoning or mathematical problem-solving. The consistent positive percentage changes across all domains suggest that the enhancements implemented in "FLV-SFT" are broadly beneficial. The chart provides a clear visual comparison of the models' capabilities, highlighting the advantages of "FLV-SFT" in terms of accuracy and performance. The inclusion of percentage changes allows for a quick assessment of the relative improvement offered by "FLV-SFT" in each domain.
</details>
Figure 2: Number of correct answers using natural language reasoning versus formal logic verification. We randomly sampled 500 instances from each domain for this comparison.
We first present empirical evidence illustrating the gap between probabilistic natural language reasoning and formal verification in LLM reasoning. We then formally define our reasoning paradigm and introduce the preliminaries on the symbolic verification methods utilized in our framework.
### 3.1 Natural vs. Formal Reasoning in LLMs
Our previous experiments demonstrate that LLMs lack mechanisms to ensure global logical consistency (Figure 1), motivating us to explore formal logic verification. Formal logic provides a rigorous framework where reasoning steps can be reliably validated using formal solvers. As shown in Figure 2, integrating formal logic verification with natural language reasoning yields significant performance improvements across diverse domains. We compare two approaches: Natural-SFT, which relies solely on natural language reasoning, and FLV-SFT, which incorporates formal logic verification. Across 500 randomly sampled instances from each domain, FLV-SFT consistently outperforms Natural-SFT, achieving 291 vs. 219 correct answers in the Logical domain (+32.8%), 243 vs. 163 in the Mathematical domain (+49.3%), and 213 vs. 166 in the General domain (+28.5%). These substantial improvements across all three categories demonstrate that formal verification effectively addresses the consistency gaps inherent in purely neural approaches. These results underscore the significant potential of formal verification to bridge the reasoning gap and strongly motivate our approach of interleaving natural language reasoning with formal verification throughout the reasoning process.
### 3.2 Problem Formulation
Formally, let $\mathcal{D}=\{(x,y)\}$ be a dataset of reasoning tasks, where $x$ denotes the input context (e.g., problem description) and $y$ denotes the ground-truth answer.
Standard CoT Paradigm. In conventional Chain-of-Thought reasoning, an LLM $P_{\theta}$ generates a sequence of reasoning steps $z=(s_{1},s_{2},\dots,s_{n})$ in natural language, aiming to maximize:
$$
P_{\theta}(y,z\mid x)=P_{\theta}(y\mid z,x)\cdot\prod_{i=1}^{n}P_{\theta}(s_{i}\mid s_{<i},x) \tag{1}
$$
However, this objective does not guarantee that $z$ is logically valid or formally verifiable.
Our Paradigm: Formal Logic Verification-Guided Reasoning. We propose augmenting the reasoning chain with formal verification at each step. Specifically, we define an extended reasoning chain $z^{\prime}=(s_{1},f_{1},v_{1},s_{2},f_{2},v_{2},\dots,s_{n},f_{n},v_{n})$ , where:
- $s_{i}$ : Natural language reasoning step (as in standard CoT)
- $f_{i}$ : Formal specification that encodes the logical correctness of $s_{i}$ (e.g., symbolic constraints, SAT clauses, SMT formulas, or executable code)
- $v_{i}$ : Formal Logic Verification result returned by a formal verifier $\mathcal{V}$ when applied to $f_{i}$
During training, our objective is to maximize the likelihood of correct final answers:
$$
\max_{\theta}\mathbb{E}{(x,y)\sim\mathcal{D}}\left[\log P\theta(y,z^{\prime}\mid x)\right] \tag{2}
$$
During inference, the verification function $\mathcal{V}$ takes the formal reasoning as input and returns detailed feedback at each reasoning step. This feedback may include satisfiability results, counterexamples, proof traces, execution outputs, or error messages, which guide the model to generate logically sound and verifiable subsequent reasoning steps.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Formal Logic Verification Pipeline
### Overview
The image depicts a two-stage pipeline for training a model, combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) with a formal logic verification loop. The pipeline aims to improve model reliability by verifying reasoning steps using formal logic. The diagram illustrates the flow of data and control between different components, including input, distillation, filtering, formal logic insertion, verification, and policy updates.
### Components/Axes
The diagram is structured into two main stages: "Stage 1: SFT" and "Stage 2: RL". Within each stage, several components are interconnected. Key elements include:
* **Input q:** Represents the initial input query.
* **Distill:** A process that distills information from the input.
* **Filter by Fail rate:** A filtering mechanism based on failure rate.
* **Difficult Questions:** Questions that pose a challenge to the model.
* **Formal Logic Insert:** A component that inserts formal logic into the reasoning process.
* **Formal Logic Verification Trace:** A trace of the formal logic verification process, including Natural Language Reasoning and Formal Logic Reasoning.
* **Verifier:** A component that verifies the reasoning steps.
* **Verifier Observation:** The output of the verifier.
* **SFT:** Supervised Fine-Tuning stage.
* **FLV-SFT Model:** The resulting model after SFT with formal logic verification.
* **Policy Model:** The model used in the RL stage.
* **Step tn, tn+1, tm:** Represent iterative steps within the RL loop.
* **GRPO Advantage Ai:** Represents the advantage gained in the RL stage.
* **Backprop:** Backpropagation for policy updates.
* **Correct Reward/Instruct Reward:** The reward signal used in RL.
* **Interleaved Reasoning Loop:** The core loop of the RL stage.
### Detailed Analysis or Content Details
**Stage 1: SFT**
1. **Input q** is processed through **Distill**.
2. The distilled output is **Filtered by Fail rate**. Failed outputs (marked with a red 'X') are discarded.
3. Remaining outputs are passed to **Formal Logic Insert**.
4. The output of **Formal Logic Insert** is fed into the **Formal Logic Verification Trace**, which includes **Natural Language Reasoning** and **Formal Logic Reasoning**.
5. The **Verifier** evaluates the trace and produces a **Verifier Observation**.
6. If the verification is successful (green checkmark), the output is stored in **SFT**.
7. The verified data is used to train the **FLV-SFT Model**. If verification fails (red 'X'), the output is **Rejected**.
**Stage 2: RL**
1. The **Policy Model** receives input and generates an action.
2. **Step tn** shows an example of the reasoning process:
* **(Natural Language Reasoning & Formal Action):** "Let's verify the logic using z3. Implies (And(both_optimal, duality_holds), both_feasible) solver = Solver()"
* "result = solver.check()"
3. **Step tn+1** displays the **Verifier Observation**: "Are the triple iff and strong duality equivalent? False"
4. **Step tm** shows the **Next Natural Language Reasoning**: "The result indicates that the triple iff structure (Both the primal LP and the dual LP have feasible solutions then...)"
5. The **Correct Reward/Instruct Reward** is used to update the **Policy Model** via **Backprop**.
6. The **GRPO Advantage Ai** represents the improvement in the policy.
7. The entire process is encapsulated within an **Interleaved Reasoning Loop**.
### Key Observations
* The pipeline emphasizes the integration of formal logic verification into both SFT and RL stages.
* The RL stage involves an iterative loop where the policy model is refined based on verification results and rewards.
* The diagram highlights the importance of identifying and filtering out incorrect reasoning steps.
* The use of "z3" suggests a specific theorem prover is being employed for formal verification.
* The "Interleaved Reasoning Loop" suggests a tight coupling between reasoning, verification, and policy updates.
### Interpretation
This diagram illustrates a sophisticated approach to training language models that prioritizes correctness and reliability. By incorporating formal logic verification, the pipeline aims to mitigate the risk of generating incorrect or misleading outputs. The two-stage process â SFT followed by RL â allows for initial learning from supervised data and subsequent refinement through reinforcement learning guided by formal verification. The "Interleaved Reasoning Loop" is crucial, as it enables continuous feedback and improvement of the policy model. The use of a specific theorem prover (z3) indicates a commitment to rigorous verification. The diagram suggests a research effort focused on building more trustworthy and verifiable AI systems. The inclusion of "Difficult Questions" and "Fail rate" filtering suggests an attempt to focus training on challenging cases where errors are more likely to occur. The overall design reflects a desire to move beyond purely statistical learning and incorporate symbolic reasoning into the training process.
</details>
Figure 3: Overview of the formal logic verification-guided training framework. The framework operates in two stages: (1) SFT: A teacher model synthesizes formal logic verification traces, which are validated by a verifier. A subset of verified samples is used to fine-tune the model, while challenging samples are reserved for RL training. (2) RL: The policy model generates natural language reasoning followed by formal reasoning. A formal interpreter verifies the formal reasoning and provides feedback, enabling iterative refinement until the model produces a final answer or reaches the maximum number of interpreter calls. Rewards computed from verification outcomes are used to calculate advantages and update the policy model via reinforcement learning.
## 4 Methodology
To address logical inconsistencies and hallucinations in pure natural language reasoning (Section 3), we propose integrating formal logic verification into the reasoning process. Our approach consists of two stages: (i) Supervised Fine-tuning to enable the model to generate interleaved natural language and formal proofs, and (ii) Reinforcement Learning to optimize the model using composite rewards that enforce logical soundness and correctness.
### 4.1 Formal Logic Verification-Guided SFT
The primary goal of the SFT stage is to align the modelâs output distribution with a structured reasoning format that supports self-verification. Since large-scale datasets containing interleaved reasoning and formal proofs are scarce, we employ a hierarchical formal proof data synthesis pipeline.
#### 4.1.1 Data Synthesis Pipeline
CoT Generation. Given a raw reasoning problem $q$ , we first employ a strong teacher model to generate $K=4$ candidate reasoning chains. A judge model evaluates the correctness of the final answers to calculate pass rates. We select a subset of verified chains that yield correct solutions for subsequent processing. Let $z$ denote a selected correct reasoning chain. To incorporate formal logic, we utilize an LLM to decompose $z$ into discrete logical modules $\{s_{k}\}_{k=1}^{N}$ . For each module $s_{k}$ , the LLM synthesizes a corresponding formal proof $f_{k}$ and predicts an expected execution output $v_{k}^{\text{exp}}$ . The prompt template is provided in Figure 9.
Execution-based Validation and Correction. To ensure the fidelity of synthesized formal proofs, we implement a rigorous validation mechanism. Each generated formal proof $f_{k}$ is executed in a sandbox to obtain the actual output $v_{k}^{\text{act}}$ . We then perform a three-stage validation:
Stage 1: Exact Match. If the actual output exactly matches the expected output ( $v_{k}^{\text{act}}=v_{k}^{\text{exp}}$ ), the proof is accepted immediately and integrated into the training data.
Stage 2: Semantic Equivalence Check. In cases where $v_{k}^{\text{act}}\neq v_{k}^{\text{exp}}$ , we employ a verification model to assess whether the discrepancy is semantically negligible (e.g., differences in capitalization, output ordering, or numerical precision). If the outputs are deemed equivalent under mathematical or logical semantics, we proceed to Stage 3.
Stage 3: Proof Rewriting. When minor inconsistencies are detected, we require the LLM to regenerate the natural language reasoning $s_{k}^{\prime}$ conditioned on the actual execution result $v_{k}^{\text{act}}$ . This ensures that the natural language reasoning module $s_{k}^{\prime}$ , the formal proof $f_{k}^{\prime}$ , and the execution output $v_{k}^{\text{act}}$ maintain strict logical coherence. Proofs that fail both exact match and semantic equivalence checks are discarded. The resulting training instance is structured as follows:
$$
\begin{split}z_{\text{aug}}=\bigoplus_{k=1}^{N}\Big(s_{k}\oplus\texttt{<code>}f_{k}^{\prime}\texttt{</code>}\\
\oplus\texttt{<interpreter>}v_{k}^{\text{act}}\texttt{</interpreter>}\Big)\end{split} \tag{3}
$$
where $f_{k}^{\prime}$ denotes the validated formal proof and $v_{k}^{\text{act}}$ is the verified execution output. This pipeline ensures that every training example $(s,f,v)$ exhibits perfect alignment between natural language hypotheses, formal logic reasoning, and execution feedback, thereby providing high-quality supervision for the model to learn reliable reasoning patterns. See Appendix B for dataset construction details.
#### 4.1.2 Optimization Objective
Given the augmented training dataset $\mathcal{D}_{\text{SFT}}=\{(q_{i},z_{\text{aug},i})\}_{i=1}^{M}$ , we optimize the model parameters $\theta$ by maximizing the log-likelihood of generating structured reasoning sequences:
$$
\mathcal{L}_{\text{SFT}}(\theta)=-\mathbb{E}_{(q,z_{\text{aug}})\sim\mathcal{D}_{\text{SFT}}}\left[\log P_{\theta}(z_{\text{aug}}\mid q)\right] \tag{4}
$$
This can be decomposed into the sequential generation of reasoning modules, formal proofs, and execution outputs:
$$
\begin{split}\mathcal{L}_{\text{SFT}}(\theta)=-\mathbb{E}_{(q,z_{\text{aug}})\sim\mathcal{D}_{\text{SFT}}}\Bigg[\sum_{k=1}^{N}\Big(\log P_{\theta}(s_{k}\mid q,z_{<k})\\
+\log P_{\theta}(f_{k}^{\prime}\mid q,z_{<k},s_{k})+\log P_{\theta}(v_{k}^{\text{act}}\mid q,z_{<k},s_{k},f_{k}^{\prime})\Big)\Bigg]\end{split} \tag{5}
$$
where $z_{<k}$ denotes all previous modules. We train using AdamW with cosine learning rate scheduling and gradient clipping. This stage enables the model to generate verifiable, formally grounded reasoning chains.
### 4.2 Formal Verification-Guided Policy Optimization
To further enhance the formal logic verification-guided reasoning capabilities of LLMs, we employ reinforcement learning. The core of this stage is a multi-dimensional reward function that provides fine-grained feedback on structure, semantics, and computational efficiency.
#### 4.2.1 Hierarchical Reward Design
To ensure both generation stability and reasoning rigor, we design a hierarchical reward function $R(y)$ that evaluates responses in a strictly prioritized order: Format Integrity $\succ$ Structural Compliance $\succ$ Logical Correctness. The unified reward is formulated as:
$$
R(y)=\begin{cases}R_{\text{fatal}}&y\in\mathbb{C}_{\text{fatal}}\qquad\text{(L1: Fatal)}\\
R_{\text{invalid}}&y\in\mathbb{C}_{\text{invalid}}\qquad\text{(L2: Format)}\\
R_{\text{total}}(y)&\text{otherwise}\qquad\text{(L3: Valid)}\end{cases} \tag{6}
$$
The total reward for valid responses is:
$$
R_{\text{total}}(y)=R_{\text{struct}}(y)+R_{\text{logic}}(y) \tag{7}
$$
Level 1 & 2: Penalties for Malformed Generation. We first filter out pathological behaviors to prevent reward hacking and infinite loops during training.
- Fatal Errors ( $\mathbb{C}_{\text{fatal}}$ ): Responses with severe and unrecoverable execution failures (e.g., timeouts, repetition loops, excessive tool calls). We assign a harsh penalty $R_{\text{fatal}}=-\gamma_{\text{struct}}-W$ to strictly inhibit these states, where $W>0$ is a correctness weight hyperparameter.
- Format Violations ( $\mathbb{C}_{\text{invalid}}$ ): Responses that are technically executable but structurally flawed (e.g., missing termination tags, no extractable final answer, excessive verbosity). These incur a moderate penalty $R_{\text{invalid}}=-\beta_{\text{struct}}-W$ , where $\gamma_{\text{struct}}>\beta_{\text{struct}}>0$ .
Level 3: Incentives for Valid Reasoning. For responses that pass the format checks ( $y\notin\mathbb{C}_{\text{fatal}}\cup\mathbb{C}_{\text{invalid}}$ ), the reward is a composite of structural quality and logical correctness.
(i) Structural Reward $R_{\text{struct}}(y)$ : Encourages concise and compliant tool usage.
$$
R_{\text{struct}}(y)=\alpha-\lambda_{\text{tag}}\cdot N_{\text{undef}}-\lambda_{\text{call}}\cdot\max(N_{\text{call}}-N_{\text{max}},0) \tag{8}
$$
Here, $\alpha$ is a base bonus, $N_{\text{undef}}$ tracks undefined tags, and the last term penalizes excessive tool invocations beyond a threshold $N_{\text{max}}$ .
(ii) Logical Correctness Reward $R_{\text{logic}}(y)$ : Evaluates the final answer $\hat{a}$ against the ground truth $a^{*}$ .
$$
R_{\text{logic}}(y)=\begin{cases}W-\lambda_{\text{len}}\cdot\Delta_{\text{len}}(\hat{a},a^{*})&\text{if }\hat{a}=a^{*}\\
-W&\text{if }\hat{a}\neq a^{*}\end{cases} \tag{9}
$$
where $\Delta_{\text{len}}$ penalizes length discrepancies to discourage verbose reasoning and promote conciseness. Detailed hyperparameter settings are provided in Appendix C.
Table 1: Comparative evaluation between our proposed formal-language verification (FLV) methods (gray background) and other approaches. Bold values denote the best results. KOR-Bench and BBH contain multiple subfields and report macro-averaged scores.
| Model | Logical | Mathemathcal | General | AVG | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| KOR-Bench | BBH | MATH500 | AIME24 | GPQA-D | TheoremQA | | |
| Qwen2.5-7B | | | | | | | |
| Base | 13.2 | 41.9 | 60.2 | 6.5 | 29.3 | 29.1 | 30.0 |
| Qwen2.5-7B-Instruct | 40.2 | 67.0 | 75.0 | 9.4 | 33.8 | 36.6 | 43.7 |
| SimpleRL-Zoo | 34.2 | 59.8 | 74.0 | 14.8 | 24.2 | 43.1 | 41.7 |
| General-Reasoner | 43.9 | 61.9 | 73.4 | 12.7 | 38.9 | 45.3 | 46.0 |
| RLPR | 42.2 | 66.2 | 77.2 | 14.2 | 37.9 | 44.3 | 47.0 |
| Synlogic | 48.1 | 66.5 | 74.6 | 15.4 | 27.8 | 39.2 | 45.3 |
| ZeroTIR | 30.0 | 40.0 | 62.4 | 28.5 | 28.8 | 36.4 | 37.7 |
| SimpleTIR | 37.0 | 62.0 | 82.6 | 41.0 | 22.7 | 51.1 | 49.4 |
| [HTML]E0E0E0 FLV-SFT (Ours) | 48.0 | 68.5 | 77.2 | 20.0 | 32.3 | 53.0 | 49.8 |
| [HTML]E0E0E0 FLV-RL (Ours) | 51.0 | 70.0 | 78.6 | 20.8 | 35.4 | 55.7 | 51.9 |
| Qwen2.5-14B | | | | | | | |
| Base | 37.4 | 52.0 | 65.4 | 3.6 | 32.8 | 33.0 | 37.4 |
| Qwen2.5-14B-Instruct | 51.5 | 72.9 | 77.4 | 12.2 | 41.4 | 41.9 | 49.6 |
| SimpleRL-Zoo | 37.2 | 72.7 | 77.2 | 12.9 | 39.4 | 48.9 | 48.1 |
| General-Reasoner | 41.3 | 71.5 | 78.6 | 17.5 | 43.4 | 55.3 | 51.3 |
| [HTML]E0E0E0 FLV-SFT (Ours) | 54.0 | 77.5 | 79.8 | 21.9 | 40.4 | 60.6 | 55.7 |
| [HTML]E0E0E0 FLV-RL (Ours) | 57.0 | 78.0 | 81.4 | 30.2 | 41.4 | 63.5 | 58.6 |
#### 4.2.2 Optimization Objective
We optimize $\pi_{\theta}$ using GRPO. For each input $x\sim\mathcal{D}_{\text{difficult}}$ , we sample $G$ outputs ${y_{1},\dots,y_{G}}$ and compute:
$$
\begin{split}\mathcal{L}_{\text{GRPO}}(\theta)&=\mathbb{E}_{x\sim\mathcal{D}_{\text{difficult}}}\Bigg[\frac{1}{G}\sum_{i=1}^{G}\sum_{t=1}^{|y_{i}|}\bigg(\\
&\qquad\min\big(r_{i,t}\hat{A}_{i},\mathrm{clip}(r_{i,t},1-\epsilon,1+\epsilon)\hat{A}_{i}\big)\\
&\qquad-\beta\log\frac{\pi_{\theta}(y_{i,t}|x,y_{i,<t})}{\pi_{\text{ref}}(y_{i,t}|x,y_{i,<t})}\bigg)\Bigg]\end{split} \tag{10}
$$
where $r_{i,t}=\pi_{\theta}(y_{i,t}|x,y_{i,<t})/\pi_{\text{old}}(y_{i,t}|x,y_{i,<t})$ . The advantage $\hat{A}_{i}$ is group-normalized on $R(y_{i})$ , stabilizing training by emphasizing relative quality over absolute reward.
## 5 Experiment
### 5.1 Experimental Setup
Models. We utilize the Qwen2.5-7B and Qwen2.5-14B (Qwen-Team, 2025) as our backbone architectures. These models serve as the initialization point for both the SFT and subsequent Policy Optimization stages.
Training Data. Our training corpus is constructed using three datasets: WebInstruct-Verified (Ma et al., 2025), K&K (Xie et al., 2024), and NuminaMath-TIR (LI et al., 2024). These sources provide a collection of diverse, verifiable reasoning tasks across multiple domains. We employ DeepSeek-R1 (Guo et al., 2025a) for data distillation and difficulty assessment via pass-rate. We utilize GPT-4o (Hurst et al., 2024) as a judge for answer correctness and Claude-Sonnet-4.5 (Anthropic, 2024) to synthesize the interleaved formal logic steps as detailed in Section 4. The RL data is selected based on the pass rate of answers generated by the teacher model DeepSeek-R1, retaining only questions with a pass rate below 50%. The categorical distribution of our curated training data is illustrated in Figure 7.
Evaluation. We conduct a comprehensive evaluation across three distinct reasoning domains to assess models:
- Logical Reasoning: We employ KOR (Ma et al., 2024) to evaluate knowledge-grounded logical reasoning across diverse domains and BBH (Suzgun et al., 2023) for challenging tasks requiring multi-step deduction.
- Mathematical Reasoning: We evaluate performance on MATH-500 (Hendrycks et al., 2024) for competition-level mathematics problems and AIME 2024 for Olympiad-level mathematical reasoning challenges.
- General Reasoning: We utilize GPQA-Diamond (Rein et al., 2024) for graduate-level reasoning in subdomains including physics, chemistry, and biology. Additionally, we use TheoremQA (Chen et al., 2023) to assess graduate-level theorem application across mathematics, physics, Electrical Engineering & Computer Science (EE&CS), and Finance, testing the modelâs ability to correctly apply and reason with formal theorems.
All evaluations use OpenCompass (Contributors, 2023) with greedy decoding, except AIME24 which reports avg@16 from sampling runs following RLPR (Yu et al., 2025).
Baselines. To validate the effectiveness of our framework, we compare our approach against five categories of models: (i) Base Models: Qwen2.5-7B/14B (Qwen-Team, 2025), establishing the baseline performance to measure the net gain of our training methodology. (ii) Simple-RL-Zoo (Zeng et al., 2025), a comprehensive collection of mathematics-focused RL models. (iii) General-Reasoner (Ma et al., 2025), a suite of general-domain RL models trained using a model-based verifier. (iv) RLPR-7B (Yu et al., 2025), a general-domain RL model trained via a simplified verifier-free framework. (v) SynLogic-7B (Liu et al., 2025b), a specialized model trained to enhance the logical reasoning capabilities of LLMs. (vi) ZeroTIR (Mai et al., 2025), a tool-integrated reasoning model specifically designed to execute Python code for solving mathematical problems. (vii) SimpleTIR (Xue et al., 2025), a multi-turn tool-integrated reasoning model for mathematical reasoning problems.
Implementation Details. We implement our RL training using the verl framework (Sheng et al., 2025a), following ToRL (Li et al., 2025b). SFT Stage: We use a learning rate of $1\times 10^{-5}$ with cosine scheduling and a global batch size of 32. The model is trained for 3 epochs. RL Stage: We employ a learning rate of $5\times 10^{-7}$ . We generate 8 rollouts per prompt with a temperature of 1.0. To prevent policy divergence, we set the KL coefficient to 0.05 and the clip ratio to 0.3. The training utilizes a batch size of 1024 and a context length of 16,384 tokens. Training proceeds for 120 steps on a cluster of 16 NVIDIA H800 GPUs. To manage computational overhead, we limit the formal verification process to a maximum of 4 iterative rounds.
### 5.2 Main Results
Table 2: We compare the performance of our proposed method (FLV) against natural language baselines across two training stages: SFT and GRPO. Natural-SFT/GRPO denotes models trained on the same data but without formal logic verification components. FLV-SFT/GRPO denotes our method incorporating formal logic modules and execution feedback.
| Model | Logical | Mathematical | General | AVG | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| KOR-Bench | BBH | MATH500 | AIME24 | GPQA-D | TheoremQA | | |
| Base | 13.2 | 41.9 | 60.2 | 6.5 | 29.3 | 29.1 | 30.0 |
| Natural-SFT | 30.4 | 55.9 | 56.6 | 8.5 | 27.3 | 39.1 | 36.3 |
| FLV-SFT (Ours) | 48.0 | 68.5 | 77.2 | 20.0 | 32.3 | 53.0 | 49.8 |
| Natural-RL | 35.7 | 55.4 | 54.4 | 4.8 | 30.3 | 41.2 | 37.0 |
| FLV-RL (Ours) | 51.0 | 70.0 | 78.6 | 20.8 | 35.4 | 55.7 | 51.9 |
Table 1 presents the comprehensive evaluation of our Formal Logic Verification (FLV) approach against standard baselines across Qwen2.5-7B and 14B scales.
Formal logic verification-guided methods outperform traditional natural language-based methods. As shown in the results, our proposed FLV framework demonstrates superior performance compared to standard natural language reasoning approaches. Notably, even our supervised fine-tuning stage (FLV-SFT) surpasses all comparative RL baselines on the 7B scale. On Qwen2.5-7B, FLV-SFT achieves an average score of 49.8, outperforming the strongest natural language baseline (RLPR, 47.0) by 2.8 points. This suggests that integrating formal logic verification during the SFT phase provides a more robust reasoning foundation than standard RL training on natural language chains alone. Furthermore, the application of Group Relative Policy Optimization (FLV-RL) yields consistent improvements over the SFT stage. For the 14B model, FLV-RL improves upon FLV-SFT by increasing the average score from 55.7 to 58.6, with significant gains in hard mathematical tasks like AIME 2024 (+8.3%) and general theorem application in TheoremQA (+2.9%). This confirms that our verifier-guided RL effectively refines the policy beyond the supervised baseline.
Formal logic verification unlocks model reasoning potential, achieving SOTA with limited data. Despite utilizing a concise training set (approx. 17k samples total), our approach establishes new state-of-the-art performance among models of similar size, significantly outperforming baselines that typically rely on larger-scale data consumption. (i) On the challenging AIME 2024 benchmark, our FLV-RL-14B model achieves 30.2%, nearly doubling the performance of the General-Reasoner baseline (17.5%) and far exceeding the Base model (3.6%). Similarly, on MATH-500, we achieve 81.4%, surpassing all baselines. (ii) We observe dominant performance on TheoremQA (63.5% on 14B), outperforming the nearest competitor by over 8 points. In logical reasoning (KOR-Bench), our method achieves a 15.7% improvement over the General-Reasoner on the 14B scale (57.0 vs 41.3). While FLV shows a slight weakness on GPQA-Diamond (likely due to benchmark reliability issues discussed in Appendix G), our method consistently excels in tasks requiring rigorous multi-step deduction and symbolic manipulation, validating the hypothesis that formal verification serves as a catalyst for deep reasoning capabilities.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Bar Chart: Comparison of SimpleTIR and FLV-RL Usage Across Categories
### Overview
This is a horizontal bar chart comparing the percentage usage of two methods, SimpleTIR and FLV-RL, across three categories: Symbolic/Logic, Numerical/Scientific, and Other. The chart uses a diverging bar design, with SimpleTIR represented by blue bars extending to the left and FLV-RL represented by red bars extending to the right. The x-axis represents the percentage, ranging from 0% to 80%.
### Components/Axes
* **Y-axis (Categories):**
* Symbolic/Logic
* Numerical/Scientific
* Other
* **X-axis (Percentage):** Labeled "Percentage (%)", ranging from 0 to 80.
* **Legend:** Located at the top-right of the chart.
* Blue: SimpleTIR
* Red: FLV-RL
* **Annotations:**
* Left arrow with text: "SimpleTIR uses more" (pointing left)
* Right arrow with text: "FLV-RL uses more" (pointing right)
### Detailed Analysis
The chart consists of three sets of paired horizontal bars, one set for each category.
* **Symbolic/Logic:**
* SimpleTIR: The blue bar extends to approximately 42.5%.
* FLV-RL: The red bar extends to approximately 62.5%.
* Trend: FLV-RL usage is significantly higher than SimpleTIR usage in this category.
* **Numerical/Scientific:**
* SimpleTIR: The blue bar extends to approximately 21.8%.
* FLV-RL: The red bar extends to approximately 20.4%.
* Trend: Usage is relatively similar between SimpleTIR and FLV-RL in this category, with SimpleTIR slightly more prevalent.
* **Other:**
* SimpleTIR: The blue bar extends to approximately 35.7%.
* FLV-RL: The red bar extends to approximately 17.1%.
* Trend: SimpleTIR usage is significantly higher than FLV-RL usage in this category.
### Key Observations
* FLV-RL is used more frequently in the "Symbolic/Logic" category.
* SimpleTIR is used more frequently in the "Other" category.
* Usage is nearly equal in the "Numerical/Scientific" category.
* The difference in usage between the two methods is most pronounced in the "Symbolic/Logic" and "Other" categories.
### Interpretation
The data suggests that the choice between SimpleTIR and FLV-RL depends on the type of task being performed. FLV-RL appears to be better suited for tasks involving symbolic or logical reasoning, while SimpleTIR is preferred for other types of tasks. The near-equal usage in the "Numerical/Scientific" category indicates that either method can be effectively used for these types of tasks. The diverging bar chart effectively highlights these differences in usage patterns across the different categories. The annotations clearly indicate the direction of preference for each method within each category. The chart provides a clear visual representation of the strengths of each method in different application areas.
</details>
Figure 4: Python packages type distribution invoked by SimpleTIR (blue) vs. FLV-RL (red) across three domains.
Formal verification instills a shift from calculation to symbolic reasoning, enabling superior generalization. While tool-integrated baselines like SimpleTIR primarily utilize tools as âsolversâ for direct computation (achieving 41.0 on AIME24), this paradigm struggles with tasks requiring rigorous logical deduction. In contrast, our FLV framework employs formal methods as a âverifierâ to enforce logical consistency. This approach yields dominant performance on logic-heavy benchmarks such as KOR-Bench (51.0 vs. 37.0 for SimpleTIR) and GPQA-Diamond (35.4 vs. 28.8 for ZeroTIR). To understand the mechanism behind this reliability, we analyze the distribution of invoked Python packages in Figure 4. The data reveals a distinct behavioral shift: whereas SimpleTIR relies significantly on generic utility packages (Other), FLV-RL demonstrates a massive surge in the usage of Symbolic/Logic libraries. These formal tools constitute 62.5% of FLV-RLâs callsâa 20-point increase over SimpleTIR. Meanwhile, the usage of Numerical/Scientific libraries remains stable ( $\sim$ 21%), indicating that our methodâs gains are driven specifically by the adoption of symbolic logic engines to verify reasoning processes, rather than merely computing numerical answers. See Appendix 9 for the package categorization principles.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Box Plot: Token Length Comparison
### Overview
The image presents a box plot comparing the token length distributions for three different models: General-Reasoner, SimpleTIR, and FLV-RL. The y-axis represents "Token length," and the x-axis labels the three models. Each model's distribution is visualized using a box plot, showing the median, quartiles, and potential outliers.
### Components/Axes
* **Y-axis:** "Token length" - Scale ranges from 0 to 20000, with gridlines at 5000-unit intervals.
* **X-axis:** Model names: "General-Reasoner", "SimpleTIR", "FLV-RL".
* **Box Plots:** Each model is represented by a box plot with the following components:
* Box: Represents the interquartile range (IQR).
* Line inside the box: Represents the median.
* Whiskers: Extend to the furthest data point within 1.5 times the IQR.
* Points beyond the whiskers: Represent potential outliers.
* **Colors:**
* General-Reasoner: Teal
* SimpleTIR: Light Blue
* FLV-RL: Light Red/Salmon
### Detailed Analysis
**General-Reasoner (Teal):**
The box plot for General-Reasoner is positioned on the left. The box is relatively small, indicating a tight distribution of token lengths.
* Minimum: Approximately 562
* First Quartile (Q1): Approximately 933
* Median: Approximately 1344
* Third Quartile (Q3): Approximately 1344
* Maximum: Approximately 1344 (no outliers visible)
**SimpleTIR (Light Blue):**
The box plot for SimpleTIR is in the center. The box is larger than General-Reasoner's, suggesting a wider spread of token lengths.
* Minimum: Approximately 2828
* First Quartile (Q1): Approximately 4352
* Median: Approximately 6985
* Third Quartile (Q3): Approximately 6985
* Maximum: Approximately 6985 (no outliers visible)
**FLV-RL (Light Red/Salmon):**
The box plot for FLV-RL is on the right. This box is the largest of the three, indicating the most significant variability in token lengths.
* Minimum: Approximately 3478
* First Quartile (Q1): Approximately 6180
* Median: Approximately 6180
* Third Quartile (Q3): Approximately 9862
* Maximum: Approximately 9862 (no outliers visible)
### Key Observations
* The median token length increases from General-Reasoner to SimpleTIR to FLV-RL.
* FLV-RL exhibits the largest spread in token lengths, as indicated by the size of its box and the distance between its quartiles.
* General-Reasoner has the smallest spread in token lengths, suggesting more consistent output lengths.
* There are no visible outliers in any of the box plots.
### Interpretation
The data suggests that the FLV-RL model tends to generate longer tokens, with a wider range of lengths, compared to SimpleTIR and General-Reasoner. General-Reasoner consistently produces the shortest tokens with the least variability. This could indicate differences in the complexity of the generated content, the model's architecture, or the training data used for each model. The increasing median token length from General-Reasoner to FLV-RL might suggest that the models are becoming more verbose or are generating more detailed responses. The lack of outliers suggests that the token length distributions are relatively stable for each model, without extreme deviations. The differences in spread could be important for downstream tasks, such as memory usage or processing time.
</details>
Figure 5: Token length distribution comparison across General-Reasoner, SimpleTIR, and FLV-RL. The box plots illustrate the median token usage (center line) and interquartile ranges.
Efficiency Analysis. We analyze the computational cost of our approach by comparing the token length distributions of the natural language baseline (General-Reasoner), tool-integrated reasoning (SimpleTIR), and our FLV-RL method (Figure 5). While FLV-RL incurs a moderate computational overhead, we argue that this cost is justified by the substantial performance gains observed across diverse domains. The increased token consumption represents a necessary trade-off for achieving breakthrough generalization and ensuring logical soundness in high-stakes reasoning tasks.
### 5.3 Ablation Studies
To evaluate the individual contributions of our proposed components, we conducted an ablation study examining two critical dimensions: the impact of FLV versus pure natural language reasoning, and the effectiveness of different training paradigms. Results are presented in Table 2.
Impact of formal logic verification. Comparing FLV-based models against natural language baselines trained on identical data reveals substantial improvements. FLV-SFT achieves 49.8% average accuracy versus 36.5% for Natural-SFT, with particularly strong gains on logic-intensive tasks (KOR-Bench: +16.2 points, TheoremQA: +13.9 points). This demonstrates that formal proofs and execution validation fundamentally improve reasoning by grounding outputs in verifiable logic rather than probabilistic patterns.
Impact of multi-stage training We can observe that supervised fine-tuning establishes strong foundations, improving from 30.0% (Base) to 49.8% (FLV-SFT). Policy optimization yields further substantial gains to 51.9% (FLV-RL). Notably, natural language baselines barely improve with RL (37.0% vs 36.5%), while FLV-RL substantially outperforms FLV-SFT, indicating formal verification provides more stable and reliable reward signals for policy optimization.
### 5.4 Verification Paradigm: Balancing Formalism and Computational Fluency
Our initial data construction enforced explicit verification outputs (e.g., proved / disproved) after each logical module. However, this rigid format introduced two critical issues: (i) formal language redundancy, and (ii) suppression of direct calculation. When computational verification was needed, models would bypass direct arithmetic in favor of indirect validation via z3-solver (e.g., asserting A + B == C is proved rather than computing the sum), significantly degrading mathematical performance. To address this, we adopted a flexible verification strategy that decouples calculation from validation: (i) Calculation as inference: Models invoke numerical tools directly during reasoning without mandatory verification keywords. (ii) Logic as validation: Formal verification serves as post-hoc validation rather than a per-step constraint. Figure 6 compares performance across logic, general, and math subsets under both paradigms. The flexible approach substantially improves math scores while preserving logical reasoning capability. Representative cases are detailed in Appendix E.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Bar Chart: Performance Gains of Verification Paradigms
### Overview
The image presents a comparison of two verification paradigms â "Enforced" and "Flexible" â across three benchmarks: MATH500, BBH, and GPQA-D. The comparison is based on accuracy, measured in percentage (%). The upper portion of the image illustrates the two paradigms visually.
### Components/Axes
* **Title:** (a) Verification Paradigms, (b) Performance Gains
* **X-axis:** Benchmarks - MATH500, BBH, GPQA-D
* **Y-axis:** Accuracy (%) - Scale ranges from approximately 0% to 80%.
* **Legend:**
* Blue: Enforced
* Red: Flexible (Ours)
* **Diagram Elements:** "Step1", "Step2", "Verify", "calculation" labels within boxes representing the paradigms.
### Detailed Analysis
The chart consists of three sets of paired bar graphs, one for each benchmark.
* **MATH500:**
* Enforced: Accuracy is approximately 60.0%.
* Flexible: Accuracy is approximately 71.0%.
* **BBH:**
* Enforced: Accuracy is approximately 51.3%.
* Flexible: Accuracy is approximately 61.0%.
* **GPQA-D:**
* Enforced: Accuracy is approximately 29.8%.
* Flexible: Accuracy is approximately 31.3%.
The upper section of the image shows two rows representing the "Enforced" and "Flexible" paradigms.
* **Enforced Paradigm:** Consists of "Step1" box, a "Verify" box (with a warning symbol), "Step2" box, and another "Verify" box (with a warning symbol).
* **Flexible Paradigm:** Consists of "Step1" box, a "calculation" box, "Step2" box, and a "Verify" box (with a checkmark symbol).
### Key Observations
* The "Flexible" paradigm consistently outperforms the "Enforced" paradigm across all three benchmarks.
* The largest performance gain is observed in the MATH500 benchmark, with a difference of approximately 11.0% in accuracy.
* The smallest performance gain is observed in the GPQA-D benchmark, with a difference of approximately 1.5% in accuracy.
* The "Enforced" paradigm includes a "Verify" step after each step, while the "Flexible" paradigm includes a "calculation" step instead of a "Verify" step after the first step.
### Interpretation
The data suggests that the "Flexible" verification paradigm is more effective than the "Enforced" paradigm in achieving higher accuracy across the tested benchmarks. The inclusion of a "calculation" step in the "Flexible" paradigm, instead of immediate verification, may allow for more robust and accurate results. The consistent outperformance of the "Flexible" paradigm indicates a potential advantage in its approach to verification. The relatively small gain in GPQA-D suggests that this benchmark may be less sensitive to the differences between the two paradigms, or that other factors are influencing performance. The visual representation of the paradigms highlights the key difference in their approach: immediate verification versus a calculation step followed by verification. The warning symbol on the "Verify" boxes in the "Enforced" paradigm could imply potential issues or limitations in that approach.
</details>
Figure 6: Enforced vs. Flexible Verification Paradigms. (a) Enforced verification imposes rigid checkpoints throughout the reasoning process, while flexible verification enables adaptive utilization of logic verification. (b) Performance gains after switching to flexible reasoning across three representative benchmarks.
## 6 Conclusion
In this work, we addressed the fundamental tension between probabilistic language generation and logical consistency in LLM reasoning by introducing a framework that dynamically integrates formal logic verification into the reasoning process. Through our two-stage training methodology combining FLV-SFTâs rigorous data synthesis pipeline with formal logic verification-guided policy optimization, we demonstrated that real-time symbolic feedback can effectively mitigate logical fallacies that plague standard Chain-of-Thought approaches. Empirical evaluation across six diverse benchmarks validates our approach, with our 7B and 14B models achieving average improvements of 10.4% and 14.2% respectively over SOTA baselines, while providing interpretable step-level correctness guarantees. Beyond performance gains, our framework establishes a principled foundation for trustworthy reasoning systems by bridging neural fluency with symbolic rigor, thereby enabling more robust logical inference. This opens pathways toward more reliable AI that scales effectively to complex real-world problems across domains requiring strict logical soundness.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
## References
- J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin (2024) Large language models for mathematical reasoning: progresses and challenges. arXiv preprint arXiv:2402.00157. Cited by: §1.
- A. Anthropic (2024) Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card. External Links: Link Cited by: §5.1.
- B. Baker, J. Huizinga, L. Gao, Z. Dou, M. Y. Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi (2025) Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926. Cited by: §1.
- C. Cao, M. Li, J. Dai, J. Yang, Z. Zhao, S. Zhang, W. Shi, C. Liu, S. Han, and Y. Guo (2025a) Towards advanced mathematical reasoning for llms via first-order logic theorem proving. arXiv preprint arXiv:2506.17104. Cited by: §2.2.
- C. Cao, H. Zhu, J. Ji, Q. Sun, Z. Zhu, Y. Wu, J. Dai, Y. Yang, S. Han, and Y. Guo (2025b) SafeLawBench: towards safe alignment of large language models. arXiv preprint arXiv:2506.06636. Cited by: §1.
- J. Chen, Q. He, S. Yuan, A. Chen, Z. Cai, W. Dai, H. Yu, Q. Yu, X. Li, J. Chen, et al. (2025a) Enigmata: scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914. Cited by: §1.
- W. Chen, M. Yin, M. Ku, P. Lu, Y. Wan, X. Ma, J. Xu, X. Wang, and T. Xia (2023) Theoremqa: a theorem-driven question answering dataset. arXiv preprint arXiv:2305.12524. Cited by: 3rd item.
- Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. (2025b) Reasoning models donât always say what they think. arXiv preprint arXiv:2505.05410. Cited by: §1.
- Z. Chen, J. Yang, T. Xiao, R. Zhou, L. Zhang, X. Xi, X. Shi, W. Wang, and J. Wang (2025c) Reinforcement learning for tool-integrated interleaved thinking towards cross-domain generalization. arXiv preprint arXiv:2510.11184. Cited by: §2.1.
- K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §2.1.
- O. Contributors (2023) OpenCompass: a universal evaluation platform for foundation models. Note: https://github.com/open-compass/opencompass Cited by: §5.1.
- J. Feng, S. Huang, X. Qu, G. Zhang, Y. Qin, B. Zhong, C. Jiang, J. Chi, and W. Zhong (2025a) Retool: reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536. Cited by: §2.1, §2.2.
- Y. Feng, N. Weir, K. Bostrom, S. Bayless, D. Cassel, S. Chaudhary, B. Kiesl-Reiter, and H. Rangwala (2025b) VeriCoT: neuro-symbolic chain-of-thought validation via logical consistency checks. arXiv preprint arXiv:2511.04662. Cited by: §1, §2.2.
- D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025a) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §5.1.
- J. Guo, Z. Chi, L. Dong, Q. Dong, X. Wu, S. Huang, and F. Wei (2025b) Reward reasoning model. arXiv preprint arXiv:2505.14674. Cited by: §1.
- D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), External Links: Link Cited by: §2.1.
- D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2024) Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv. org/abs/2103.03874 2. Cited by: 2nd item.
- J. Hu, J. Zhang, Y. Zhao, and T. Ringer (2025) HybridProver: augmenting theorem proving with llm-driven proof synthesis and refinement. arXiv preprint arXiv:2505.15740. Cited by: §2.2.
- A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. (2024) Gpt-4o system card. arXiv preprint arXiv:2410.21276. Cited by: §5.1.
- Y. Ji, X. Tian, S. Zhao, H. Wang, S. Chen, Y. Peng, H. Zhao, and X. Li (2025) AM-thinking-v1: advancing the frontier of reasoning at 32b scale. arXiv preprint arXiv:2505.08311. Cited by: §1.
- R. Kamoi, Y. Zhang, N. Zhang, S. S. S. Das, and R. Zhang (2025) Training step-level reasoning verifiers with formal verification tools. arXiv preprint arXiv:2505.15960. Cited by: §1, §2.2.
- M. Khalifa, R. Agarwal, L. Logeswaran, J. Kim, H. Peng, M. Lee, H. Lee, and L. Wang (2025) Process reward models that think. arXiv preprint arXiv:2504.16828. Cited by: §2.1.
- H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi (2022) Coderl: mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35, pp. 21314â21328. Cited by: §2.1.
- J. O. J. Leang, G. Hong, W. Li, and S. B. Cohen (2025) Theorem prover as a judge for synthetic data generation. arXiv preprint arXiv:2502.13137. Cited by: §1.
- C. Li, Z. Tang, Z. Li, M. Xue, K. Bao, T. Ding, R. Sun, B. Wang, X. Wang, J. Lin, et al. (2025a) CoRT: code-integrated reasoning within thinking. arXiv preprint arXiv:2506.09820. Cited by: §2.2.
- J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y. Fleureau, G. Lample, and S. Polu (2024) NuminaMath. Numina. Note: [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) Cited by: §2.1, §5.1.
- J. Li and H. T. Ng (2025) The hallucination dilemma: factuality-aware reinforcement learning for large reasoning models. arXiv preprint arXiv:2505.24630. Cited by: §1, §2.2.
- X. Li, H. Zou, and P. Liu (2025b) ToRL: scaling tool-integrated rl. External Links: 2503.23383, Link Cited by: §2.2, §5.1.
- H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Letâs verify step by step. In The Twelfth International Conference on Learning Representations, Cited by: §2.1, §2.2.
- C. Liu, Y. Yuan, Y. Yin, Y. Xu, X. Xu, Z. Chen, Y. Wang, L. Shang, Q. Liu, and M. Zhang (2025a) Safe: enhancing mathematical reasoning in large language models via retrospective step-aware formal verification. arXiv preprint arXiv:2506.04592. Cited by: §1, §2.2.
- J. Liu, Y. Fan, Z. Jiang, H. Ding, Y. Hu, C. Zhang, Y. Shi, S. Weng, A. Chen, S. Chen, et al. (2025b) SynLogic: synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641. Cited by: §1, §5.1.
- S. Liu, H. Liu, J. Liu, L. Xiao, S. Gao, C. Lyu, Y. Gu, W. Zhang, D. F. Wong, S. Zhang, et al. (2025c) Compassverifier: a unified and robust verifier for llms evaluation and outcome reward. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 33454â33482. Cited by: §1.
- K. Ma, X. Du, Y. Wang, H. Zhang, Z. Wen, X. Qu, J. Yang, J. Liu, M. Liu, X. Yue, W. Huang, and G. Zhang (2024) KOR-bench: benchmarking language models on knowledge-orthogonal reasoning tasks. External Links: 2410.06526, Link Cited by: 1st item.
- X. Ma, Q. Liu, D. Jiang, G. Zhang, Z. Ma, and W. Chen (2025) General-reasoner: advancing llm reasoning across all domains. arXiv preprint arXiv:2505.14652. Cited by: Appendix C, §1, §2.1, §5.1, §5.1.
- X. Mai, H. Xu, W. Wang, Y. Zhang, W. Zhang, et al. (2025) Agentic rl scaling law: spontaneous code execution for mathematical problem solving. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: §5.1.
- A. Ospanov, Z. Feng, J. Sun, H. Bai, X. Shen, and F. Farnia (2025) HERMES: towards efficient and verifiable mathematical reasoning in llms. arXiv preprint arXiv:2511.18760. Cited by: §1, §2.2.
- Qwen-Team (2025) Qwen2.5 technical report. External Links: 2412.15115, Link Cited by: §5.1, §5.1.
- D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024) Gpqa: a graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, Cited by: 3rd item.
- Z. Ren, Z. Shao, J. Song, H. Xin, H. Wang, W. Zhao, L. Zhang, Z. Fu, Q. Zhu, D. Yang, et al. (2025) Deepseek-prover-v2: advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801. Cited by: §2.2.
- S. She, J. Liu, Y. Liu, J. Chen, X. Huang, and S. Huang (2025) R-prm: reasoning-driven process reward modeling. arXiv preprint arXiv:2503.21295. Cited by: §2.1.
- G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu (2025a) Hybridflow: a flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp. 1279â1297. Cited by: §5.1.
- J. Sheng, L. Lyu, J. Jin, T. Xia, A. Gu, J. Zou, and P. Lu (2025b) Solving inequality proofs with large language models. arXiv preprint arXiv:2506.07927. Cited by: §1, §1.
- J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger (2022) Defining and characterizing reward gaming. Advances in Neural Information Processing Systems 35, pp. 9460â9471. Cited by: §1, §2.1.
- M. Suzgun, N. Scales, N. SchĂ€rli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, et al. (2023) Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13003â13051. Cited by: 1st item.
- Y. Tian, R. Huang, X. Wang, J. Ma, Z. Huang, Z. Luo, H. Lin, D. Zheng, and L. Du (2025) EvolProver: advancing automated theorem proving by evolving formalized problems via symmetry and difficulty. arXiv preprint arXiv:2510.00732. Cited by: §2.2.
- J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins (2022) Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275. Cited by: §2.1, §2.2.
- H. Wang, M. Unsal, X. Lin, M. Baksys, J. Liu, M. D. Santos, F. Sung, M. Vinyes, Z. Ying, Z. Zhu, et al. (2025) Kimina-prover preview: towards large formal reasoning models with reinforcement learning. arXiv preprint arXiv:2504.11354. Cited by: §2.2.
- Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. (2022) Super-naturalinstructions: generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705. Cited by: §2.1.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, pp. 24824â24837. Cited by: §2.1.
- C. Xie, Y. Huang, C. Zhang, D. Yu, X. Chen, B. Y. Lin, B. Li, B. Ghazi, and R. Kumar (2024) On memorization of large language models in logical reasoning. External Links: 2410.23123, Link Cited by: §5.1.
- Z. Xu, Y. Liu, Y. Yin, M. Zhou, and R. Poovendran (2025) Kodcode: a diverse, challenging, and verifiable synthetic dataset for coding. arXiv preprint arXiv:2503.02951. Cited by: §2.1.
- Z. Xue, L. Zheng, Q. Liu, Y. Li, X. Zheng, Z. Ma, and B. An (2025) Simpletir: end-to-end reinforcement learning for multi-turn tool-integrated reasoning. arXiv preprint arXiv:2509.02479. Cited by: §2.2, §5.1.
- K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar (2023) Leandojo: theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems 36, pp. 21573â21612. Cited by: §2.2.
- T. Yu, B. Ji, S. Wang, S. Yao, Z. Wang, G. Cui, L. Yuan, N. Ding, Y. Yao, Z. Liu, et al. (2025) RLPR: extrapolating rlvr to general domains without verifiers. arXiv preprint arXiv:2506.18254. Cited by: §5.1, §5.1.
- S. Zeng (2026) External Links: Link Cited by: Appendix G.
- W. Zeng, Y. Huang, Q. Liu, W. Liu, K. He, Z. Ma, and J. He (2025) Simplerl-zoo: investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892. Cited by: §2.1, §2.2, §5.1.
- L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. (2023) Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems 36, pp. 46595â46623. Cited by: §2.1.
- K. Zhou, C. Liu, X. Zhao, S. Jangam, J. Srinivasa, G. Liu, D. Song, and X. E. Wang (2025) The hidden risks of large reasoning models: a safety assessment of r1. arXiv preprint arXiv:2502.12659. Cited by: §1.
- Y. Zhou, C. Cao, J. Yang, L. Wu, C. He, S. Han, and Y. Guo (2026) LRAS: advanced legal reasoning with agentic search. arXiv preprint arXiv:2601.07296. Cited by: §2.1.
## Limitations
Despite significant improvements in logical reasoning capabilities, our framework faces two primary limitations. First, integrating real-time formal verification introduces computational overhead, increasing RL training time by approximately 2Ă compared to standard baselines. However, this cost is acceptable given the substantial performance gains (10.4%-14.2% improvement) and superior data efficiencyâwe achieve comparable results using only a fraction of the training data required by existing methods, such that reduced data collection costs offset the increased training time. Second, our data synthesis pipeline faces formalization challenges when translating natural language into verifiable formal representations. While conversion success rates are high in structured domains like mathematics and logic, ambiguous or commonsense-heavy descriptions may produce mapping errors that generate incorrect verification feedback, limiting generalizability to open-ended reasoning tasks and highlighting the need for more robust auto-formalization techniques.
## Appendix A Reward Calculation Pseudocode
Table 3: Hierarchical reward function for formal logic verification-guided policy optimization
| Input: Output $y$ , Ground truth answer $a^{*}$ , Predicted answer $\hat{a}$ |
| --- |
| Output: Total reward $R(y)$ |
| Hyperparameters: |
| $W=3$ (correctness weight) |
| $\gamma_{\text{struct}}=3.0$ , $\beta_{\text{struct}}=1.0$ (severity penalties) |
| $\alpha=1.0$ (base structural score) |
| $\lambda_{\text{tag}}=0.005$ , $\tau_{\text{tag}}=200$ (tag penalty coefficients) |
| $\lambda_{\text{call}}=0.5$ , $N_{\text{max}}=3$ (tool call limits) |
| $\lambda_{\text{len}}=0.04$ , $\delta_{\text{max}}=10$ (length penalty) |
| Step 1: Check Fatal Errors ( $\mathbb{C}_{\text{fatal}}$ ) |
| if token-level repetition detected or |
| execution timeout or |
| tool calls $>2\times N_{\text{max}}$ or |
| multiple termination tags then |
| return $R(y)=-\gamma_{\text{struct}}-W=-8.0$ |
| Step 2: Check Invalid Format ( $\mathbb{C}_{\text{invalid}}\setminus\mathbb{C}_{\text{fatal}}$ ) |
| if solution extraction fails or |
| solution length $>512$ tokens or |
| missing closing tag or |
| $N_{\text{max}}<$ tool calls $\leq 2\times N_{\text{max}}$ then |
| return $R(y)=-\beta_{\text{struct}}-W=-6.0$ |
| Step 3: Compute Structural Reward $R_{\text{struct}}(y)$ |
| $N_{\text{undef}}=$ count of undefined tags |
| $N_{\text{call}}=$ count of tool invocations |
| $R_{\text{struct}}(y)=\alpha-\lambda_{\text{tag}}\cdot\min(N_{\text{undef}},\tau_{\text{tag}})$ |
| $-\lambda_{\text{call}}\cdot\max(N_{\text{call}}-N_{\text{max}},0)$ |
| Step 4: Compute Correctness Reward $R_{\text{correct}}(y)$ |
| $f_{\text{len}}(\hat{a},a^{*})=\min(|\text{len}(\hat{a})-\text{len}(a^{*})|,\delta_{\text{max}})$ |
| if $\hat{a}$ matches $a^{*}$ then |
| $R_{\text{correct}}(y)=W-\lambda_{\text{len}}\cdot f_{\text{len}}(\hat{a},a^{*})$ |
| else |
| $R_{\text{correct}}(y)=-W$ |
| Step 5: Compute Total Reward |
| $R(y)=R_{\text{struct}}(y)+R_{\text{correct}}(y)$ |
Table 3 provides the complete algorithmic implementation of our multi-component reward function used in FLV-RL training. The pseudocode details the step-by-step computation of format rewards, correctness rewards, and formal verification rewards, including all constraint checks and penalty mechanisms described in Section 4.
The time complexity for calculating the reward $R(y)$ is dominated by the verification of structural constraints and semantic correctness. Let $L$ denote the length of the generated response $y$ in tokens. The initial screening for pathological states ( $\mathbb{C}_{\text{fatal}}$ ) and invalid formats ( $\mathbb{C}_{\text{invalid}}$ ) requires a linear scan of the output tokens to detect repetition loops, count tool invocations ( $N_{\text{call}}$ ), and validate tags, resulting in $O(L)$ complexity. If the response is valid, computing $R_{\text{struct}}(y)$ involves constant-time arithmetic operations after the initial scan. The semantic verification $R_{\text{correct}}(y)$ depends on the evaluation metric; assuming string matching or metric comparison between the extracted answer $\hat{a}$ and ground truth $a^{*}$ , this step operates in $O(|\hat{a}|+|a^{*}|)$ . Therefore, the total time complexity per generation is $O(L)$ , ensuring the reward calculation remains efficient and does not introduce significant computational overhead during training.
## Appendix B Dataset Construction Details
Our data construction pipeline systematically processes a reasoning question through multiple stages of generation, logic extraction, formal translation, and verification to create high-quality training data for supervised fine-tuning. The resulting dataset combines natural language reasoning with formally verified logical modules, enabling our models to learn both human-readable reasoning patterns and mathematically sound logical validation.
Table 4 presents a comprehensive analysis of the execution-based validation outcomes across our merged dataset of 9,162 reasoning chains (a subset of of full training data). The match rate distribution reveals that a substantial majority (59.55%) of the generated formal proofs achieve perfect alignment with expected outputs ( $v_{k}^{\text{act}}=v_{k}^{\text{exp}}$ ), successfully passing Stage 1 validation without requiring further verification. An additional 26.53% of proofs fall within the high-confidence range (95â100% match rate), indicating strong but imperfect alignment that necessitates semantic equivalence checking in Stage 2.
Notably, 40.45% of the dataset exhibits match rates below 100%, triggering our multi-stage validation pipeline. Among these cases, the consistency distribution (lower panel) demonstrates that 62.22% (2,306 instances) maintain semantic equivalence despite surface-level discrepanciesâsuccessfully recovering through Stage 2 verification or Stage 3 proof rewriting. The remaining 37.78% (1,400 instances) represent fundamental misalignments that are discarded from the training corpus. This stratified validation approach ensures that our final dataset preserves both syntactic precision and semantic coherence, with approximately 85.7% overall retention rate ( $=59.55\$ ) after rigorous quality control.
The distribution pattern further reveals that only 0.83% of proofs exhibit critically low match rates (below 60%), suggesting that the initial CoT generation process produces predominantly high-quality candidates. This validates our teacher modelâs effectiveness while highlighting the necessity of execution-based verification to capture subtle logical inconsistencies that may elude purely language-based evaluation.
Table 4: Match rate and consistency analysis.
| Match Rate Range (%) 0â20 20â40 | Count 1 11 | Percentage (%) 0.01 0.12 |
| --- | --- | --- |
| 40â60 | 64 | 0.70 |
| 60â70 | 63 | 0.69 |
| 70â80 | 168 | 1.83 |
| 80â85 | 149 | 1.63 |
| 85â90 | 269 | 2.94 |
| 90â95 | 551 | 6.01 |
| 95â99 | 1,402 | 15.30 |
| 99â100 (excl.) | 1,028 | 11.22 |
| 100 | 5,456 | 59.55 |
| Total | 9,162 | 100.00 |
| Consistency Distribution (when match_rate $\neq$ 100%) | | |
| Consistent | Count | Percentage (%) |
| No | 1,400 | 37.78 |
| Yes | 2,306 | 62.22 |
| Total | 3,706 | 100.00 |
<details>
<summary>x7.png Details</summary>

### Visual Description
## Pie Charts: Dataset Composition - SFT vs. RL
### Overview
The image presents two pie charts, visually comparing the composition of two datasets: "SFT Dataset" (Supervised Fine-Tuning) and "RL Dataset" (Reinforcement Learning). The charts display the percentage distribution of various subject categories within each dataset. A legend at the bottom of the image maps colors to specific subject areas and provides the absolute sample counts for each category within each dataset.
### Components/Axes
The image consists of:
* **Two Pie Charts:** One labeled "SFT Dataset (Total: 14,117 samples)" and the other labeled "RL Dataset (Total: 3,525 samples)".
* **Legend:** Located at the bottom of the image, providing a color-coded key for the subject categories. The legend also lists the number of samples for each category within each dataset.
* **Percentage Labels:** Displayed on each pie chart slice, indicating the percentage of samples belonging to that category.
### Detailed Analysis or Content Details
**SFT Dataset (Total: 14,117 samples)**
* **Mathematics:** 35.2% (2728 samples) - Light Green
* **Physics:** 19.3% (1640 samples) - Blue
* **Chemistry:** 11.6% (722 samples) - Orange
* **Economics:** 5.1% (343 samples) - Yellow
* **Puzzle:** 14.0% (1977 samples) - Gold
* **Other:** 4.8% (678 samples) - Grey
* **Business:** 11.1% (682 samples) - Brown
* **Finance:** 14.0% (395 samples) - Pink
* **History:** 5.1% (159 samples) - Red
* **Biology:** 4.8% (159 samples) - Teal
* **Computer Science:** 1.1% (36 samples) - Dark Red
* **Psychology:** 0.3% (41 samples) - Purple
* **Other STEM:** 0.2% (25 samples) - Dark Grey
* **Philosophy:** 0.1% (15 samples) - Dark Blue
* **Engineering:** 0.1% (8 samples) - Black
* **Law:** 0.1% (13 samples) - Cyan
* **Health:** 0.2% (30 samples) - Magenta
**RL Dataset (Total: 3,525 samples)**
* **Mathematics:** 32.5% (809 samples) - Light Green
* **Economics:** 23.0% (199 samples) - Blue
* **Physics:** 5.6% (1147 samples) - Orange
* **Business:** 10.0% (388 samples) - Yellow
* **Chemistry:** 11.0% (512 samples) - Gold
* **Finance:** 14.5% (351 samples) - Grey
* **Other:** 10.0% (20 samples) - Brown
* **Engineering:** 11.1% (21 samples) - Pink
* **Biology:** 1.1% (31 samples) - Red
* **Computer Science:** 0.3% (12 samples) - Teal
* **Health:** 0.1% (5 samples) - Dark Red
* **Philosophy:** 0.2% (6 samples) - Purple
* **History:** 0.2% (7 samples) - Dark Grey
* **Law:** 0.0% (0 samples) - Dark Blue
* **Psychology:** 0.0% (1 samples) - Black
* **Other STEM:** 0.0% (0 samples) - Cyan
### Key Observations
* **Mathematics Dominance:** Mathematics is the most represented category in both datasets, comprising 35.2% of the SFT dataset and 32.5% of the RL dataset.
* **Physics Disparity:** Physics has a significantly larger representation in the SFT dataset (19.3%) compared to the RL dataset (5.6%).
* **Smaller Categories:** Categories like Philosophy, Engineering, Law, Health, and Psychology have very small representations in both datasets, particularly in the RL dataset.
* **Dataset Size Difference:** The SFT dataset is considerably larger than the RL dataset (14,117 vs. 3,525 samples).
* **Economics Increase:** Economics has a much larger percentage in the RL dataset (23.0%) compared to the SFT dataset (5.1%).
### Interpretation
The data suggests that the SFT dataset is more heavily focused on fundamental sciences like Mathematics and Physics, while the RL dataset has a greater emphasis on Economics and Mathematics. The difference in dataset sizes could indicate a different data collection strategy or a varying need for data in each learning paradigm. The small representation of categories like Philosophy and Law in both datasets suggests these areas are less prioritized or have limited available data for these specific training purposes. The shift in emphasis from Physics to Economics between the SFT and RL datasets could reflect the types of problems these models are being trained to solve â SFT potentially focusing on foundational knowledge, while RL might be geared towards more applied, economic-related tasks. The very small sample sizes for some categories raise questions about the generalizability of models trained on these datasets to those specific areas. The data highlights the importance of considering dataset composition when evaluating and deploying these models, as biases in the data can lead to biased model performance.
</details>
Figure 7: Distribution of data categories in training Sets. Left: the categorical breakdown of the SFT dataset, totally 14,117 sample. Right: the categorical breakdown of the RL dataset. Legends list the exact number of samples for each category, totally 3,525 sample.
## Appendix C Hyperparameter Specification
We carefully calibrate the reward function hyperparameters to balance the learning of hybrid reasoning patterns with answer correctness, while preventing pathological behaviors during training.
Correctness vs. Format Balance. We set the correctness weight $W=3$ and base structural score $\alpha=1.0$ to ensure the model balances learning the hybrid reasoning paradigm with exploring correct answers under this new framework. This 3:1 ratio encourages the model to prioritize semantic accuracy while maintaining proper integration of natural language and formal verification components.
Fatal Error Prevention. Given the complexity of hybrid reasoning, models in early training stages are prone to generating pathological outputs such as repetitive tokens or malformed structures. We therefore define such cases as fatal errors with the maximum penalty $\gamma_{\text{struct}}=3.0$ , while setting $\beta_{\text{struct}}=1.0$ for invalid but recoverable format violations. This hierarchical penalty structure strongly discourages catastrophic failures while allowing the model to explore within reasonable boundaries.
Tag Usage Regulation. In hybrid reasoning, the model must learn our defined tag vocabulary (e.g., <code>, <interpreter>). During training, the model may explore alternative tags, which we discourage through $\lambda_{\text{tag}}=0.005$ with a cap at $\tau_{\text{tag}}=200$ . This cap prevents tag-related penalties from overwhelming correctness rewards, ensuring balanced learning of both answer accuracy and proper tag usage.
Formal Verification Efficiency. Our data synthesis analysis reveals that most problems can be solved within 4 formal verification steps. To optimize inference efficiency and reduce unnecessary formalization, we set $N_{\text{max}}=3$ as the baseline limit with penalty coefficient $\lambda_{\text{call}}=0.5$ . Responses exceeding this threshold incur incremental penalties, while those exceeding $2\times N_{\text{max}}=6$ calls are classified as fatal errors, strongly discouraging excessive tool invocations.
Length Control. Following prior work on general reasoners (Ma et al., 2025), we apply length-based penalties to discourage excessively verbose generations. We set $\lambda_{\text{len}}=0.04$ with a maximum cap $\delta_{\text{max}}=10$ , and enforce a hard limit of 512 tokens for extracted solutions. This encourages concise, focused reasoning without sacrificing completeness.
Model-Based Verifier. Manual review reveals that diverse domain problems beyond mathematics cannot be verified through rule-based methods. We therefore employ a model-based verifier. Empirical evaluation shows that CompassVerifier-7B achieves an optimal balance between accuracy and efficiency, leading to its selection as our verifier.
## Appendix D Training Dynamics and Behavior Analysis
We analyze key training dynamics across 120 optimization steps in Figure 8.
Reward Evolution. Figure 8(a) shows steady improvement from -0.45 to -0.1, indicating consistent progress in our composite reward function balancing structural integrity, semantic correctness, and efficiency.
Response Length. Figure 8(b) exhibits a U-shaped pattern: initial decrease from 3100 to 2850 tokens as the model eliminates redundancy, followed by stabilization around 3100-3200 tokens, reflecting an optimal balance between completeness and conciseness.
Formal Logic Verification Efficiency. Figure 8(c) demonstrates rapid improvement from 2.3 to 1.9 in the first 40 steps, then gradual stabilization. This shows the model learns to generate more efficiently verifiable proofs with fewer symbolic interpreter calls.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Data Trend Over Steps
### Overview
The image presents a line chart displaying a fluctuating data series plotted against "Step" values. The chart shows a generally decreasing trend with significant short-term oscillations.
### Components/Axes
* **X-axis:** Labeled "Step", ranging from approximately 0 to 130, with tick marks at intervals of 10.
* **Y-axis:** Ranges from approximately -0.4 to 0.1, with tick marks at intervals of 0.1.
* **Data Series:** A single blue line representing the data trend.
* **Legend:** Located in the bottom-right corner, labeling the line as "Step".
### Detailed Analysis
The blue line begins at approximately Step 0 with a value of -0.4.
* **Initial Trend (Steps 0-20):** The line exhibits a steep downward slope, decreasing from -0.4 to approximately -0.35. The data fluctuates significantly within this range.
* **Intermediate Trend (Steps 20-60):** The slope becomes less steep, with the line oscillating between approximately -0.35 and -0.2. There are several local minima and maxima.
* **Later Trend (Steps 60-120):** The line continues to oscillate, but the overall trend is still slightly downward, ending at approximately -0.25 at Step 120. The amplitude of the oscillations appears relatively consistent throughout this section.
Approximate data points:
* Step 0: -0.4
* Step 20: -0.35
* Step 40: -0.28
* Step 60: -0.23
* Step 80: -0.25
* Step 100: -0.22
* Step 120: -0.25
### Key Observations
* The data exhibits a clear decreasing trend over the entire range of steps.
* The oscillations are relatively consistent in amplitude, suggesting a periodic or quasi-periodic component to the data.
* There are no obvious outliers or abrupt changes in the trend.
### Interpretation
The chart likely represents a process where a value decreases over time (represented by "Step"), but is subject to constant fluctuations. This could be a measurement affected by noise, a system with inherent variability, or a process with feedback loops causing oscillations. The decreasing trend suggests a gradual decay or reduction in the underlying process. The consistent oscillation amplitude indicates that the factors causing the fluctuations are relatively stable over the observed time period. Without further context, it's difficult to determine the specific meaning of the "Step" variable or the nature of the measured value. The data suggests a system that is settling towards a lower state, but is not doing so smoothly.
</details>
(a) Reward
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Data Trend Over Steps
### Overview
The image presents a line chart illustrating a data trend over a series of steps. The y-axis represents a numerical value ranging from approximately 2800 to 3300, while the x-axis represents "Step" ranging from 0 to 120. The chart shows a fluctuating line, indicating changes in the data value as the step number increases.
### Components/Axes
* **X-axis:** Labeled "Step", with markers at intervals of 10 from 0 to 120.
* **Y-axis:** Ranges from approximately 2800 to 3300, with no explicit label. The scale is linear.
* **Data Series:** A single blue line representing the data trend.
* **Legend:** None.
### Detailed Analysis
The blue line begins at approximately 3200 at Step 0.
* **Steps 0-20:** The line exhibits a downward trend, decreasing from approximately 3200 to around 2900. There are fluctuations within this range.
* **Steps 20-40:** The line continues to fluctuate, remaining generally between 2900 and 3000, with some minor increases and decreases.
* **Steps 40-60:** A significant upward trend is observed, with the line increasing from approximately 2850 to a peak of around 3250.
* **Steps 60-80:** The line experiences high-frequency fluctuations, oscillating between approximately 3100 and 3300.
* **Steps 80-100:** The line continues to fluctuate, generally remaining between 3050 and 3200.
* **Steps 100-120:** The line shows a slight upward trend, ending at approximately 3180.
Approximate data points:
* Step 0: 3200
* Step 10: 3050
* Step 20: 2900
* Step 30: 2950
* Step 40: 2900
* Step 50: 2850
* Step 60: 3100
* Step 70: 3250
* Step 80: 3200
* Step 90: 3100
* Step 100: 3050
* Step 110: 3100
* Step 120: 3180
### Key Observations
* The data exhibits significant volatility, with frequent fluctuations throughout the entire range of steps.
* There is a clear upward trend between Steps 40 and 60.
* The data appears to stabilize somewhat between Steps 80 and 120, although fluctuations persist.
* The lowest value is observed around Step 50, at approximately 2850.
* The highest value is observed around Step 70, at approximately 3250.
### Interpretation
The chart suggests a dynamic process where the measured value changes over time (represented by "Step"). The initial decline could represent a settling period or initial decrease in a variable. The subsequent increase indicates a growth or recovery phase. The high-frequency fluctuations between Steps 60 and 120 suggest a system operating near a stable point, with constant adjustments and minor variations. The lack of context makes it difficult to determine the specific meaning of the data, but it could represent any time-series data, such as stock prices, sensor readings, or process variables. The data does not show any clear long-term trend beyond the initial decline and recovery. The fluctuations suggest a system that is sensitive to external factors or internal noise.
</details>
(b) Response Length
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Value Trend Over Steps
### Overview
The image presents a line chart depicting the trend of a single value over a series of steps. The x-axis represents the step number, ranging from 0 to approximately 125. The y-axis represents the value, ranging from approximately 1.9 to 2.3. The chart shows a decreasing trend initially, followed by stabilization with some fluctuations.
### Components/Axes
* **X-axis:** "Step" - Represents the step number. Scale is linear, ranging from 0 to 125.
* **Y-axis:** Value - Represents the measured value. Scale is linear, ranging from approximately 1.9 to 2.3.
* **Data Series:** A single blue line representing the value trend.
* **Legend:** No explicit legend is present, but the single blue line is the sole data series.
### Detailed Analysis
The line chart begins at approximately 2.28 at Step 0. The line initially exhibits a steep downward trend until around Step 20, where the value reaches approximately 2.05. From Step 20 to Step 40, the line continues to decrease, but at a slower rate, reaching a minimum value of approximately 1.95 at Step 40. There is a sharp peak at Step 42, reaching approximately 2.25. After the peak, the line quickly returns to around 1.95. From Step 40 to Step 125, the line fluctuates around a value of approximately 1.98, with minor oscillations.
Here's a breakdown of approximate values at specific steps:
* Step 0: 2.28
* Step 10: 2.15
* Step 20: 2.05
* Step 30: 2.02
* Step 40: 1.95
* Step 42: 2.25
* Step 50: 1.98
* Step 60: 1.97
* Step 70: 1.93
* Step 80: 1.95
* Step 90: 1.96
* Step 100: 1.97
* Step 110: 1.95
* Step 120: 1.98
* Step 125: 1.96
### Key Observations
* The most significant feature is the initial rapid decrease in value.
* The sharp peak at Step 42 is a notable outlier.
* After Step 40, the value stabilizes, fluctuating within a narrow range.
* The fluctuations after Step 40 appear somewhat random, with no clear upward or downward trend.
### Interpretation
The data suggests a process that initially experiences a significant reduction in value, potentially due to a rapid change or decay. The stabilization after Step 40 indicates that the process has reached a steady state or equilibrium. The outlier at Step 42 could represent an external disturbance or a temporary anomaly in the process. The fluctuations around the stabilized value suggest that the process is subject to minor variations or noise. The chart could represent the learning rate of a model during training, where the initial decrease represents rapid learning, the peak represents a potential overfitting event, and the stabilization represents convergence. Alternatively, it could represent the decay of a signal over time, with the peak representing a transient interference. Without further context, it is difficult to determine the precise meaning of the data, but the chart clearly demonstrates a dynamic process with distinct phases of change and stability.
</details>
(c) Number of Logic Verification
Figure 8: Training dynamics during FLV-RL optimization over 120 steps, showing (a) composite reward improvement, (b) response length evolution, and (c) number of logic verification optimization.
## Appendix E Analysis of Verification Overhead in Mathematical Reasoning
Table 5: Failure case showing enforced verification overhead in mathematical reasoning (MATH_500). Ground truth answer is 27, but the model predicted 81.
Table 5 illustrates a critical failure mode induced by enforced formal verification. The problem requires finding the smallest perfect cube expressible as the sum of three consecutive integers. The correct answer is 27 (= 8+9+10), yet the model arrives at 81 through fundamentally flawed reasoning. The root cause lies in the verification paradigmâs cognitive overhead: rather than directly computing $n^{3}$ for small values and checking $n^{3}=3k$ (which immediately yields $3^{3}=27$ ), the model constructs an unnecessarily complex z3-solver script that obscures the arithmetic. Notice how the code attempts to verify $3\times 27^{3}=(3k)^{3}$ âa nonsensical constraint that conflates the problem statement (finding a cube equal to $3n$ ) with an arbitrary symbolic manipulation. The verification framework, instead of catching this error, produces a âDISPROVEDâ output that the model then rationalizes away (âupon closer inspection, it confirmed that 81 worksâ), demonstrating how mandatory verification can paradoxically reduce error-detection capability. This case exemplifies why formal verification tools become liabilities in computational contexts: they introduce syntactic complexity (z3 constraint formulation) that distracts from semantic correctness (direct enumeration: $1^{3}=1\neq 3k$ , $2^{3}=8\neq 3k$ , $3^{3}=27=3\times 9\checkmark$ ), ultimately degrading performance on problems solvable through elementary arithmetic. The flexible verification strategy addresses this by permitting direct calculation during reasoning, relegating formal tools to post-hoc validation roles where their rigor provides genuine value rather than procedural friction.
## Appendix F Analysis of Package Usage Distribution
Table 6: Categorization of Python packages by problem-solving paradigm.
| Symbolic & Logic | Handles abstract symbols, constraint satisfaction, formal logic, and graph structures. | sympy, z3-solver, networkx, constraint | Reasoning: Abstract deduction & proofs. |
| --- | --- | --- | --- |
| Numerical & Scientific | Performs high-precision arithmetic, matrix operations, and statistical analysis. | numpy, math, scipy, pandas, fractions | Calculation: Quantitative modeling. |
| Algorithmic & Search | Focuses on combinatorial generation, iteration, and discrete optimization strategies. | itertools, collections, random, heapq | Search: Brute-force & simulation. |
| Domain & Utilities | Tools for specific non-mathematical domains (text, time, web) and system operations. | datetime, re, requests, nltk, bs4 | Knowledge: Information retrieval. |
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Bar Chart: Package Usage Comparison
### Overview
This bar chart compares the percentage of package usage for two different systems, SimpleTIR (blue) and FLV-GRPO (red), across five package categories. The y-axis represents the percentage of package usage, while the x-axis represents the package category.
### Components/Axes
* **X-axis Title:** Package Category
* **Y-axis Title:** Percentage of Package Usage (%)
* **Categories:**
* Symbolic & Logic
* Algorithmic & Search
* Numerical & Scientific
* Text & NLP
* Domain & Utils
* **Data Series:**
* SimpleTIR (Blue)
* FLV-GRPO (Red)
* **Legend:** Located in the top-right corner, clearly labeling the two data series with their corresponding colors.
### Detailed Analysis
Let's analyze each category and the corresponding data points:
1. **Symbolic & Logic:**
* SimpleTIR: The blue bar rises to approximately 42.5%.
* FLV-GRPO: The red bar rises to approximately 62.5%.
2. **Algorithmic & Search:**
* SimpleTIR: The blue bar rises to approximately 20.2%.
* FLV-GRPO: The red bar rises to approximately 6.5%.
3. **Numerical & Scientific:**
* SimpleTIR: The blue bar rises to approximately 21.8%.
* FLV-GRPO: The red bar rises to approximately 20.4%.
4. **Text & NLP:**
* SimpleTIR: The blue bar rises to approximately 4.4%.
* FLV-GRPO: The red bar rises to approximately 0%.
5. **Domain & Utils:**
* SimpleTIR: The blue bar rises to approximately 9.9%.
* FLV-GRPO: The red bar rises to approximately 10.0%.
### Key Observations
* FLV-GRPO demonstrates significantly higher usage in the "Symbolic & Logic" category compared to SimpleTIR.
* SimpleTIR shows higher usage in the "Algorithmic & Search" category than FLV-GRPO.
* Usage in "Numerical & Scientific" is relatively similar between the two systems.
* SimpleTIR has some usage in "Text & NLP" while FLV-GRPO has none.
* Usage in "Domain & Utils" is nearly identical between the two systems.
### Interpretation
The data suggests that FLV-GRPO is more heavily reliant on packages related to symbolic and logic operations, while SimpleTIR utilizes packages related to algorithms and search more frequently. The similar usage in "Numerical & Scientific" and "Domain & Utils" indicates a shared need for these types of packages in both systems. The complete absence of "Text & NLP" package usage in FLV-GRPO could indicate that this system does not require or support natural language processing functionalities, or that these functionalities are implemented differently. The differences in package usage likely reflect the distinct design goals and application areas of SimpleTIR and FLV-GRPO. The chart provides a clear comparison of the package dependencies of the two systems, which could be useful for understanding their strengths and weaknesses, and for identifying potential areas for collaboration or integration.
</details>
Figure 9: Comparison of package usage distribution: SimpleTIR vs. FLV-RL
The comparison between SimpleTIR and FLV-RL reveals a fundamental shift in problem-solving paradigms, moving from iterative search to abstract reasoning. As defined in Table 6, the Symbolic & Logic category encompasses tools for constraint satisfaction and formal proofs (e.g., z3-solver, sympy). The chart demonstrates a substantial increase in this category, rising from 42.5% in SimpleTIR to 62.5% in FLV-RL (Figure 9). This suggests that the FLV-RL model is increasingly relying on mathematical abstraction and logical deduction to solve problems rather than procedural code.
Conversely, usage of the Algorithmic & Search categoryâdefined as handling permutations and iterative loops (e.g., itertools)âdrops precipitously from 20.2% to 6.5%. This inverse correlation indicates that FLV-RL has largely abandoned brute-force simulation and exhaustive search strategies. Instead of generating candidate solutions through iteration, the model prefers to model the problem space symbolically and solve for the answer directly.
Meanwhile, the Numerical & Scientific and Domain & Utils categories remain relatively stable across both models (approximately 21% and 10%, respectively). This implies that while the core reasoning engine has evolved (shifting from search to logic), the foundational requirements for arithmetic calculation and environment interaction remain constant regardless of the solving strategy. Overall, these metrics quantify a qualitative leap: the model has transitioned from a âcompute-heavyâ search approach to a âreasoning-heavyâ symbolic approach.
## Appendix G GPQA Problems
Zeng (2026) identifies GPQA (Diamond) as having an inherent error rate lower bound of 26.8% and characterizes it as âan advanced intellectual booby trap from the old era.â The systematic flaws include: (i) Factual errors in questions/answers. (ii) Missing parameters needed for correct solutions. (iii) Transcription mistakes in data. The report concludes that GPQA has transformed âfrom a ruler for measuring intelligence into a noise generatorâ that measures how well models fit to logically flawed questions rather than actual reasoning ability.
## Appendix H Case Study: Formal Verification in Economic Reasoning
Table 7: Error Analysis: Baseline model (Qwen 2.5-7B) reasoning trajectory with systematic failures
| Question |
| --- |
| As we move to the northwest (up and to the left) along a linear demand curve, elasticity |
| a. goes from inelastic to elastic and the lineâs slope is constant |
| b. goes from elastic to inelastic and the lineâs slope is constant |
| c. is constant and so is the lineâs slope |
| d. goes from inelastic to elastic and the lineâs slope rises in absolute terms |
| e. is constant and the lineâs slope falls in absolute terms |
| Asnwer Generated by Qwen-2.5-7b |
| The correct answer is: |
| b. goes from elastic to inelastic and the lineâs slope is constant. |
| Hereâs the reasoning: |
| Elasticity of Demand: Elasticity of demand measures the responsiveness of the quantity demanded to a change in price. It is calculated as the percentage change in quantity demanded divided by the percentage change in price. Along a linear demand curve, as you move from right to left (decreasing price and increasing quantity), the elasticity of demand changes. |
| Slope of the Demand Curve: The slope of a linear demand curve is constant, meaning the change in price per unit change in quantity is the same throughout the curve. |
| Elasticity and Position on the Demand Curve: |
| At higher prices (and lower quantities), the demand is more elastic because a small change in price leads to a relatively larger change in quantity demanded. |
| As you move to the left along the demand curve (decreasing price and increasing quantity), the demand becomes less elastic because the percentage change in quantity demanded becomes smaller relative to the percentage change in price. |
| Therefore, as you move northwest (up and to the left) along a linear demand curve, the elasticity of demand goes from elastic to inelastic, and the slope of the line remains constant. |
To illustrate how formal logic verification improves economic reasoning, we use a standard microeconomics question that exposes typical failure modes of free-form chain-of-thought and shows how verification-guided training corrects them.
### H.1 Problem and Baseline Error Analysis
Problem.
Consider the multiple-choice question: âFor a linear demand curve, as we move northwest along the curve, what happens to price elasticity of demand?â In the conventional $(Q,P)$ diagram (quantity on the horizontal axis, price on the vertical axis), ânorthwestâ means higher price and lower quantity.
Baseline behavior.
The baseline model (Qwen 2.5-7B) selects option (b) âgoes from elastic to inelasticâ and exhibits three systematic failures (Table 7).
Failure Mode 1: Semantic grounding error (directional mis-mapping).
The core mistake is a mis-grounding of the spatial term ânorthwest.â The correct executable semantics in the $(Q,P)$ plane is:
$$
\text{northwest}(1\rightarrow 2)\;\Rightarrow\;(P_{2}>P_{1})\ \wedge\ (Q_{2}<Q_{1}). \tag{11}
$$
The baseline instead behaves as if ânorthwestâ implied the opposite comparative direction (effectively a southeast move), e.g.,
$$
\text{northwest}(1\rightarrow 2)\;\mapsto\;(P_{2}<P_{1})\ \wedge\ (Q_{2}>Q_{1}), \tag{12}
$$
which flips the economic interpretation and deterministically pushes the subsequent reasoning toward the wrong answer.
Failure Mode 2: Undetected logical inconsistency.
The baseline produces mutually incompatible claims without detecting the contradiction (e.g., asserting both that demand is âmore elastic at higher pricesâ and that the alleged ânorthwestâ move reduces elasticity under its mis-grounded direction). Formally, if a reasoning chain asserts propositions $\{\phi_{i}\}_{i=1}^{n}$ , the chain should satisfy global consistency:
$$
\bigwedge_{i=1}^{n}\phi_{i}\not\equiv\bot. \tag{13}
$$
Pure next-token generation optimizes local likelihood and does not enforce Eq. (13), allowing contradictory statements to coexist.
Failure Mode 3: Conceptual conflation (slope vs. elasticity).
The baseline conflates constant slope with constant elasticity. For a linear demand curve
$$
P=a-bQ,\qquad a>0,\;b>0, \tag{14}
$$
the slope $\frac{dP}{dQ}=-b$ is constant, but point price elasticity (in magnitude) is
$$
\lvert\varepsilon(Q,P)\rvert=\left\lvert\frac{dQ}{dP}\right\rvert\frac{P}{Q}=\frac{1}{b}\cdot\frac{P}{Q}, \tag{15}
$$
which varies with the ratio $\frac{P}{Q}$ . Along the same line, moving northwest increases $P$ and decreases $Q$ , hence increases $\frac{P}{Q}$ and therefore increases $\lvert\varepsilon\rvert$ .
### H.2 Verification-Guided Correction
Our framework corrects these errors by interleaving generation with SMT-based verification (Table LABEL:tab:case_flv_correct). We summarize four mechanisms.
Mechanism 1: Executable semantic grounding.
We train the model to translate ambiguous natural language into solver-checkable constraints. For this task, the semantic parser is encouraged to map ânorthwestâ to Eq. (11) (rather than Eq. (12)), together with domain assumptions typical in economics:
$$
P_{1}>0,\ Q_{1}>0,\ P_{2}>0,\ Q_{2}>0,\ \text{and}\ P_{i}=a-bQ_{i}\ \text{for}\ i\in\{1,2\}. \tag{16}
$$
Incorrect mappings are rejected because they become inconsistent with the intended move or with the economic domain; the solver provides immediate feedback via $\textsc{sat}/\textsc{unsat}$ .
One simple way to incorporate verification is to weight learning by verification success:
$$
\mathcal{L}_{\text{ground}}=-\mathbb{E}_{x\sim\mathcal{D}}\Bigl[\mathbb{I}\bigl(\text{verify}(f_{\theta}(x))=\textsc{sat}\bigr)\cdot\log P_{\theta}\bigl(f_{\theta}(x)\mid x\bigr)\Bigr], \tag{17}
$$
where $f_{\theta}$ is the model-produced formalization and $\text{verify}(\cdot)$ calls the SMT solver.
Mechanism 2: Global consistency enforcement over a chain.
Instead of allowing each step to stand alone, we require that the accumulating set of claims remains satisfiable under the shared constraints (Eqs. (14)â(16)). Concretely, we can test competing hypotheses such as:
$$
H_{\downarrow}:\lvert\varepsilon_{2}\rvert<\lvert\varepsilon_{1}\rvert \tag{18}
$$
under the ânorthwestâ constraints. The solver returns unsat for $H_{\downarrow}$ (given standard domain assumptions), steering the model away from inconsistent chains and toward the correct alternative:
$$
H_{\uparrow}:\lvert\varepsilon_{2}\rvert>\lvert\varepsilon_{1}\rvert. \tag{19}
$$
Mechanism 3: Counterexample-driven learning (numerical witnesses).
When a hypothesis fails, the solver can provide concrete satisfying assignments for the correct hypothesis, serving as a numerical witness that bridges symbolic proof and intuition. For example, pick $b>0$ and two points on the same demand curve with $P_{2}>P_{1}$ and $Q_{2}<Q_{1}$ , such as $(P_{1},Q_{1})=(6,4)$ and $(P_{2},Q_{2})=(8,2)$ . Then by Eq. (15),
$$
\lvert\varepsilon_{1}\rvert=\frac{1}{b}\cdot\frac{6}{4}=\frac{1.5}{b},\qquad\lvert\varepsilon_{2}\rvert=\frac{1}{b}\cdot\frac{8}{2}=\frac{4.0}{b}, \tag{20}
$$
hence $\lvert\varepsilon_{2}\rvert>\lvert\varepsilon_{1}\rvert$ . Training on triples $(\text{hypothesis},\text{verification result},\text{witness})$ provides richer supervision than final-answer labels alone.
Mechanism 4: Compositional algebraic reasoning with verified substeps.
We structure the explanation into verifiable primitives: (i) linear constraint $P=a-bQ$ ; (ii) derivative $\frac{dQ}{dP}=-\frac{1}{b}$ ; (iii) elasticity definition $\lvert\varepsilon\rvert=\left\lvert\frac{dQ}{dP}\right\rvert\frac{P}{Q}$ ; (iv) monotonicity: under $P_{2}>P_{1}$ and $Q_{2}<Q_{1}$ , we have $\frac{P_{2}}{Q_{2}}>\frac{P_{1}}{Q_{1}}$ , therefore $\lvert\varepsilon_{2}\rvert>\lvert\varepsilon_{1}\rvert$ . This decomposition yields reusable, solver-checkable reasoning components rather than brittle pattern matching.
Table 8: Formal logic verification-guided reasoning trajectory.
## Appendix I Prompts
This section presents the key prompts used throughout our training pipeline. Table 9 shows the reinforcement learning rollout prompt that guides models to generate formal verification-augmented reasoning during policy optimization. Table LABEL:lst:formal-incert detail the prompt used for formal logic verification-guided reasoning dataset construction.
Table 9: Prompt for Model Training
| Solve the following problem step by step. You can selectively use Python with z3 (for logic) and sympy (for calculations) to verify reasoning. Python code will run in an external sandbox, returning output as <interpreter>output</interpreter>. The python code should be complete scripts, including necessary imports. |
| --- |
| Revise reasoning if sandbox returns âdisprovedâ or fix code if execution errors occur. |
| Code Format: |
| Each code snippet is wrapped with |
| <code> |
| âââ python |
| code snippet |
| âââ |
| </code> |
| Response must end exactly as: |
| <answer> |
| [Summary of all reasoning steps] |
| \\boxed{[Final answer]} |
| </answer> |
| [Question] |
âŹ
You are a helpful AI assistant. Initially, when solving a question, you would need to think step by step, without the ability to use code for calculation or logical verification. Now, you have enhanced capabilities to write code for both computational tasks and logical reasoning verification. The code will be executed by a sandbox, and the result can be returned to enhance your reasoning process.
** Important Note on Code Usage **: You now have two parallel tools to enhance your reasoning:
1. ** Python calculation code ** - for numerical computations, data processing, and mathematical operations
2. ** Z3 logical verification ** - for verifying logical reasoning, constraints, and formal proofs using the Z3 theorem prover
These are complementary tools serving different purposes. Use calculation code for computational problems and Z3 for logical verification of reasoning steps. ** Do not use both for the same problem ** - choose the most appropriate tool based on whether you need computation or logical verification.
The thinking process can have multiple code snippets. Each code snippet is wrapped with:
< code >
âââ python
code snippet
âââ
</ code >, and should be executable. The returned result is wrapped with < interpreter > execution results </ interpreter >.
Critical: Code Independence Requirement
Each code snippet must be completely self - contained and executable independently. This means:
- Each code block should include all necessary imports
- Each code block should define all variables it uses (do not rely on variables from previous code blocks)
- Each code block should be able to run successfully if executed in isolation
- If you need values from previous calculations, redefine or recalculate them in the new code block
- Think of each < code > block as being executed in a fresh Python environment
** Guidelines for Z3 Usage:**
Z3 verification should ONLY be used when it provides genuine formal verification value:
Do NOT use Z3 for:
- Verifying simple arithmetic calculations (e. g., 2 + 2 = 4, or 1.0 * (-2.0 - 4.0) = -6.0)
- Checking calculations with concrete numbers that Python already computed
- Adding concrete values as constraints and then verifying them (this is circular reasoning)
- Repeating what numerical computation already verified
DO use Z3 for:
- Proving general mathematical properties or identities that hold for ALL values (using symbolic variables)
- Verifying complex logical relationships with multiple interrelated constraints
- Checking satisfiability of constraint systems or finding whether solutions exist
- Proving inequalities or optimization conditions symbolically
- Verifying that no counterexamples exist for a general claim
- Formal verification of logical reasoning steps that involve symbolic relationships
- Key principle: Use symbolic variables (not concrete numbers) to prove general statements. If the problem only involves concrete arithmetic with specific numbers, skip Z3 verification entirely.
Examples of INCORRECT Z3 usage:
# BAD: Verifying concrete arithmetic
solver. add (m == 1.0)
solver. add (v_i == 4.0)
solver. add (delta_p == -6.0)
solver. check () # This just checks if -6.0 == -6.0, no value added
Examples of CORRECT Z3 usage:
# GOOD: Proving general algebraic equivalence
from z3 import * # Remember: each code block needs its imports
m, v1, v2 = Reals (â m v1 v2 â)
solver = Solver ()
solver. add (m > 0) # General constraint, not specific value
# Prove: Delta_p = m (v2 - v1) is equivalent to Delta_p = m * v2 - m * v1 for ALL values
solver. add (m * (v2 - v1) != m * v2 - m * v1)
result = solver. check ()
print (" Verification result:", result)
assert result == unsat # Proves no counterexample exists
** Guidelines for Python code calculation Usage:**
When NOT to use code:
- Simple arithmetic that can be done mentally
- Basic algebra or formula substitution
- Straightforward unit conversions
- Verifying obvious mathematical identities
- Problems where all steps are elementary calculations
Code Usage Limit:
- For problems solvable with basic math: Use code AT MOST 1-2 times (or not at all)
- For complex computational problems: Use code AT MOST 3-4 times
- Each code block should serve a distinct, necessary purpose
- ** Never use multiple code blocks to verify the same calculation in different ways **
Goal:
Modify the original thinking process to make it more accurate by:
- Replacing manual calculation steps with Python code snippets and their execution results
- Adding Z3 logical verification when it provides genuine formal verification beyond simple arithmetic
- Keeping the core reasoning logic intact, including any unsuccessful attempts
- Adding code only where it provides genuine value
- Ensuring each code snippet serves a unique, necessary purpose
- use Python code or Z3 verification for a combined total of no more than 4 times
- Wrap the revised thinking process within < revised_thinking_process > and </ revised_thinking_process >.
User Question:
{question}
Original Thinking Process (without code interpreter â s support):
< original_thinking_process > {original_response} </ original_thinking_process >
Details:
1. Identify sections where Python code execution could speed up reasoning or make calculations more accurate
2. Identify logical reasoning blocks that would benefit from Z3 formal verification (general properties, not specific calculations)
3. Replace manual calculation steps with code snippets and corresponding interpreter â s execution results
4. Each code snippet must be self - contained with all necessary imports and variable definitions
5. Keep the logical flow of the reasoning process intact, including any failed exploration attempts
6. Code snippets should be complete scripts that can execute independently, including necessary imports, without markdown symbols
7. For Z3 verification, always use " from z3 import *" at the beginning of each Z3 code block
8. Outputs in code snippets must explicitly call the print function
9. Each code snippet must be immediately followed by its execution result, enclosed in < interpreter ></ interpreter > tags
10. Execution results should match the model â s output exactly, with no extra or missing tokens
11. Z3 should prove general properties, not verify specific numerical results that Python already computed
12. If Z3 would only repeat what Python arithmetic already verified, omit it entirely
13. Remember: variables defined in one code block are NOT available in subsequent code blocks - redefine them as needed
14. When performing calculations, format numerical outputs to 2-4 decimal places using round () or f - strings (e. g., print (f "{result:.2 f}")) to avoid displaying unnecessary floating - point digits. Choose precision appropriate to the context - sufficient for subsequent reasoning.
Revised Thinking Process (With independent selective Python computation blocks and Z3 formal verification):
Listing 1: Prompt for formal logic verification-guided reasoning chain generation