2508.14313

Model: healer-alpha-free

# Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS **Authors**: Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Zihan Dong, Marco Pavone, Ligong Han, Zhang-Wei Hong ## Abstract Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S , the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9% on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs. footnotetext: ‡ Equal advising, Correspondence to: Can Jin <can.jin@rutgers.edu>, Tong Che <tongc@nvidia.com>. ## 1 Introduction Recently, test-time scaling (TTS) has been explored as an effective method to enhance the reasoning performance of large language models (LLMs) [52, 20, 62, 49, 85, 74, 30, 28, 27]. Specifically, reinforcement learning (RL) methods [52, 20, 60] and search strategies such as Monte Carlo Tree Search (MCTS), beam search, and Best-of-N sampling have been adopted to support TTS on complex reasoning benchmarks [85, 83, 82, 62, 76]. Notably, OpenAI’s o -series models [52] and DeepSeek-R1 [20] demonstrate that large-scale RL training can lengthen and refine the chains of thought (CoT) produced at inference time. The RL training of LLMs is generally guided by outcome reward models (ORMs) [9, 20, 78] and process reward models (PRMs) [35, 37, 68, 69, 60], which provide supervisory signals to improve model performance. Although DeepSeek-R1 achieves strong TTS performance using only ORMs or rule-based rewards during RL training [20], sparse outcome rewards often degrade training stability and sample efficiency [12, 37, 79, 6]. Search-based TTS methods that use static PRMs trained on labeled step-wise datasets can guide test-time search and improve reasoning performance, but building fine-grained PRMs that score each intermediate reasoning step requires extensive human annotation or LLM–generated pseudo-labels [69, 37, 12]. Furthermore, distributional shifts between a fixed PRM and the continually updated policy can lead to reward hacking [47, 16, 3], thereby undermining the stability and effectiveness of the policy model. Additionally, the reward hacking can also limit the effectiveness of PRMs in search-based TTS, where the PRM serves as a verifier during test-time search. In this paper, we investigate how to effectively combine RL-based and Search-based TTS in complex reasoning. Specifically, to reduce the cost of training high-quality PRMs and alleviate reward hacking risks from static PRMs trained on separate datasets during test time search, we propose AIRL-S , a framework that integrates adversarial inverse reinforcement learning (AIRL) [14, 13] with group relative policy optimization (GRPO) [12, 20] to support long CoT reasoning in LLMs through both RL-based and search-based TTS. During training, AIRL learns a step-wise PRM from the reference rollouts. The policy model is then updated using the combined objectives of the dense rewards in AIRL and the binary outcome rewards in GRPO. During inference, the PRM naturally serves as a verifier that guides the search procedure, extending the reasoning of the policy model. AIRL-S enables the training of generalizable PRMs without requiring any labeled process reward data, thereby significantly reducing the cost of constructing dense reward models and mitigating reward hacking under distributional shift. Furthermore, the PRM is theoretically invariant to environmental dynamics [14], allowing it to be reused during inference across different datasets and policy models in search-based TTS. We evaluate the effectiveness of the LLM policy and the PRM trained with AIRL-S on eight standard reasoning benchmarks spanning mathematics, science, and code generation. The policy model achieves an average performance improvement of 9% over the base model and matches GPT-4o across these tasks. When combined with search-based TTS methods, the PRM surpasses PRMs trained on labeled process data across multiple policy models and datasets. We further pair the PRM with widely used test-time search algorithms to demonstrate its compatibility and effectiveness under varied TTS configurations. Overall, AIRL-S provides an effective and cost-efficient approach for scaling test-time computation in LLMs for complex reasoning tasks. ## 2 Related Works #### Inverse Reinforcement Learning. Inverse reinforcement learning (IRL) [1, 58, 14, 13, 40, 41] aims to recover reward functions from expert demonstrations, enabling subsequent policy training through standard RL methods. Classic IRL methods include maximum-margin IRL [1, 58] and probabilistic maximum-entropy IRL [87, 86]. Adversarial IRL (AIRL) [14] reformulated IRL into a GAN-style adversarial game [18], improving generalization and theoretical grounding. Subsequent advances have further streamlined IRL: IQ-Learn optimizes inverse soft- $Q$ functions without explicitly modeling rewards [17]; Inverse Preference Learning uses offline pairwise preferences to avoid explicit reward modeling [21]; "IRL without RL" reframes imitation as supervised regression over trajectory data [65]. Our method leverages AIRL’s disentanglement of reward and policy, extending it to RL for large language models (LLMs) and producing a generalizable reward function suitable for guiding test-time search. #### RL for LLMs. Reinforcement learning from human feedback (RLHF) is standard for aligning LLM outputs with user intent [8, 54]. Nonetheless, many open-source reasoning improvement efforts still rely on imitation learning from curated Chain-of-Thought (CoT) datasets [80, 81, 84, 49, 29, 42, 43, 66]. Recent large-scale RL methods using sparse outcome rewards, such as OpenAI o1 [53] and DeepSeek-R1 [20], have achieved significant gains. Specialized RL frameworks for mathematical reasoning, such as Math-Shepherd [69] and DeepSeek-Math-7B-RL [12], utilize dense supervision or extensive RL training to match larger models. Concurrently, PRIME [11] employs "free process rewards" [61] to implicitly derive a token-level process reward from log-likelihood ratios between two LLMs. PRIME is closely related to our work, sharing the fundamental insight of utilizing an implicitly learned reward function from RL training. However, PRIME differs crucially by: - Employing per-token rewards derived from log-likelihood ratios, which reward-guided generation literatures (discrete GANs, human preference modeling, generation quality evaluation etc) [50, 57, 39] suggests is much less effective than our holistic step-wise discriminators. - Producing a policy-dependent reward function unsuitable for training new policies or guiding external search procedures. In contrast, our AIRL-based framework yields an actor-independent reward function capable of both optimal policy recovery (as shown theoretically in AIRL [14]) and direct use as a PRM for guiding search algorithms across different LLMs. #### Test-Time Scaling. Test-time scaling (TTS) methods enhance reasoning capabilities of fixed LLMs by allocating additional inference-time computation. Parallel TTS approaches aggregate independent samples to improve outcomes [4, 25, 70, 83]. Methods such as Self-Consistency [70], Best-of-N sampling [5, 62], and Beam Search [76, 62] utilize diversity-driven strategies for output selection. Monte-Carlo Tree Search (MCTS) integrates lookahead search guided by learned PRMs, achieving strong reasoning performance [83, 85, 82, 75, 55, 73]. Sequential TTS refines outputs iteratively based on previous attempts [49, 62, 44]. Our work integrates the AIRL-trained PRM directly into popular TTS methods, including MCTS, Beam Search, and Best-of-N sampling, demonstrating superior performance compared to static PRMs trained from labeled data. ## 3 Method Our objective is to learn a generalizable step-wise PRM that benefits both RL training and test-time search, thereby improving the reasoning accuracy of LLMs. We adopt AIRL and GRPO to train the PRM and the policy model jointly. The learned PRM then guides the search procedure of the policy LLM during inference, yielding additional gains in performance. The detailed pseudo-code for RL training in AIRL-S is presented in Algorithm 1. ### 3.1 Problem Setup Let $\mathcal{Q}$ be a dataset of questions and $q\in\mathcal{Q}$ a specific question. From the reference rollouts, we obtain a CoT, $$ C(q)=\{C_{1},C_{2},\ldots,C_{T}\}, $$ where each $C_{i}$ is a reasoning step that leads to the final answer. Our aim is to learn a PRM $r_{\phi}$ that assigns a reward to every step and provides a dense training signal for optimizing the policy model $\pi_{\theta}$ via RL. At inference time, an online-adapted $r_{\phi}$ steers the search process of $\pi_{\theta}$ , enhancing its reasoning ability on new questions. ### 3.2 Data Generation and Replay Buffer At the start of training, we obtain the reference rollouts for each question $q\in\mathcal{Q}$ by either sampling multiple chains of thought (CoTs) from the current policy $\pi_{\theta}$ or reusing existing CoTs. We assign a binary outcome reward, indicates whether a CoT yields the correct final answer. CoTs with a positive reward are stored in a replay buffer $\mathcal{B}$ and serve as the reference rollouts for training the AIRL reward model. To keep the reliability and diversity of $\mathcal{B}$ , we can periodically remove low-quality entries and add newly discovered high-quality CoTs during training. ### 3.3 Learning PRM via AIRL To avoid costly step-wise labels utilized in training static PRMs, we leverage the generative adversarial network guided reward learning in AIRL to train a discriminator that distinguishes the reference rollouts from the policy outputs [14, 18, 13, 50]. In AIRL, the discriminator $D_{\phi}$ is defined over state-action pairs. For LLM reasoning, we represent - state: the question $q$ together with the preceding reasoning steps $\{C_{1},\dots,C_{i-1}\}$ , and - action: the current step $C_{i}$ . The discriminator is: $$ D_{\phi}(C_{i}\mid q,C_{<i})=\frac{\exp\{f_{\phi}(q,C_{\leq i})\}}{\exp\{f_{\phi}(q,C_{\leq i})\}+\pi_{\theta}(C_{i}\mid q,C_{<i})}, $$ where $f_{\phi}$ is a learned scoring function. The step-wise reward of $C_{i}$ is then $$ r_{\phi}(C_{i}\mid q,C_{<i})=\log\frac{D_{\phi}(C_{i}\mid q,C_{<i})}{1-D_{\phi}(C_{i}\mid q,C_{<i})}. \tag{1} $$ And $r_{\phi}$ serves as the PRM. We train the discriminator by minimizing $$ {\cal L}_{\text{AIRL}}=\sum_{i=1}^{T}\Bigl[-\mathbb{E}_{q\sim\mathcal{Q},\,C\sim\pi_{e}(\cdot\mid q)}\!\bigl[\log D_{\phi}(C_{i}\mid q,C_{<i})\bigr]-\mathbb{E}_{q\sim\mathcal{Q},\,C\sim\pi_{\theta}(\cdot\mid q)}\!\bigl[\log\!\bigl(1-D_{\phi}(C_{i}\mid q,C_{<i})\bigr)\bigr]\Bigr], \tag{2} $$ where $\pi_{e}$ denotes the reference rollout distribution drawn from the replay buffer $\mathcal{B}$ and $\pi_{\theta}$ is the current policy. Updating the discriminator can be seen as updating the reward function $r_{\phi}$ . Algorithm 1 AIRL-S Input Initial LLM policy $\pi_{\theta_{\text{init}}}$ ; question set $\mathcal{Q}$ ; Total iterations $E$ ; initial replay buffer $\mathcal{B}$ Initialize policy model $\pi_{\theta}$ , reference model $\pi_{\text{ref}}$ , old policy $\pi_{\text{old}}$ , and PRM $r_{\phi}$ with $\pi_{\theta_{\text{init}}}$ Update replay buffer $\mathcal{B}$ by collecting correct rollouts for $q\in\mathcal{Q}$ using $\pi_{\theta}$ for iteration = 1, …, $E$ do Sample a batch of questions $\mathcal{B}_{i}$ from $\mathcal{B}$ Generate a group of policy rollouts: $\{C^{1},...,C^{G}\}\sim\pi_{\theta}(\cdot|q)$ for $q\in\mathcal{B}_{i}$ Update the PRM according to AIRL loss in Equation (2) Update the policy $\pi_{\theta}$ according to the composite objectives in Equation (6) Update the old policy $\pi_{\text{old}}$ using $\pi_{\theta}$ Update the replay buffer by adding the correct rollouts to $\mathcal{B}$ end for Output Optimized policy model $\pi_{\theta}$ and PRM $r_{\phi}$ ### 3.4 Policy Training with Combined RL Objectives The policy model is optimized via RL. In AIRL [14], the objective of the policy model is to maximize the discriminator-derived rewards, thereby "fool" the discriminator into classifying its rollouts are reference rollouts. The AIRL objective is $$ {\cal J}_{\mathrm{AIRL}}(\theta)=\mathbb{E}_{q\sim\mathcal{Q},\,C\sim\pi_{\theta}(\cdot\mid q)}\Biggl[\sum_{i=1}^{|C|}\min\Bigl(\frac{\pi_{\theta}(C_{i}\mid q,C_{<i})}{\pi_{\mathrm{old}}(C_{i}\mid q,C_{<i})}A_{i},\mathrm{clip}\bigl(\tfrac{\pi_{\theta}(C_{i}\mid q,C_{<i})}{\pi_{\mathrm{old}}(C_{i}\mid q,C_{<i})},\,1-\epsilon,\,1+\epsilon\bigr)A_{i}\Bigr)\Biggr], \tag{3} $$ where $C$ is a chain of thought (CoT) produced by $\pi_{\theta}$ , $|C|$ is its length, $\pi_{\text{old}}$ is the sampling policy, and $A_{i}$ is the advantage at step $i$ , estimated with a REINFORCE-style method [72]. DeepSeek-R1 achieves strong reasoning performance by pairing binary outcome rewards with group relative policy optimization (GRPO) [20]. For a group of $G$ CoTs $\{C^{k}\}_{k=1}^{G}$ , we utilize the same outcome rewards based GRPO, and the objective is $$ \mathcal{J}_{\mathrm{GRPO}}(\theta)=\;\mathbb{E}_{q\sim\mathcal{Q},\,\{C^{k}\}_{k=1}^{G}\sim\pi_{\mathrm{old}}(\cdot\mid q)}\\ \Biggl[\frac{1}{G}\sum_{k=1}^{G}\bigl(\min\Bigl(\frac{\pi_{\theta}(C^{k}\mid q)}{\pi_{\mathrm{old}}(C^{k}\mid q)}A^{k},\mathrm{clip}\bigl(\tfrac{\pi_{\theta}(C^{k}\mid q)}{\pi_{\mathrm{old}}(C^{k}\mid q)},\,1-\epsilon,\,1+\epsilon\bigr)A^{k}\Bigr)-\beta\,\mathrm{D}_{\mathrm{KL}}\!\bigl(\pi_{\theta}\|\pi_{\mathrm{ref}}\bigr)\Bigr)\Biggr], \tag{4} $$ $$ \mathrm{D}_{\mathrm{KL}}\!\bigl(\pi_{\theta}\|\pi_{\mathrm{ref}}\bigr)=\frac{\pi_{\mathrm{ref}}(C^{k}\mid q)}{\pi_{\theta}(C^{k}\mid q)}-\log\!\frac{\pi_{\theta}(C^{k}\mid q)}{\pi_{\mathrm{ref}}(C^{k}\mid q)}-1, \tag{5} $$ where $\pi_{\text{ref}}$ is a frozen reference model, and $\epsilon$ and $\beta$ are hyper-parameters. The advantage $A^{k}$ for each CoT in the group is computed from the binary outcome rewards. We define a composite objective that combines the AIRL objective and GRPO objective to incorporate both the intermediate step rewards and the outcome rewards, and update the policy model to maximize the combined RL objectives: $$ {\cal J}(\theta)=\lambda\,{\cal J}_{\mathrm{AIRL}}(\theta)+(1-\lambda)\,{\cal J}_{\mathrm{GRPO}}(\theta), \tag{6} $$ where $\lambda$ is a hyperparameter to balance the outcome rewards and process rewards. The training alternates between (i) updating the discriminator $D_{\phi}$ to distinguish reference rollouts from policy rollouts, and (ii) optimizing the policy by maximizing the composite objectives of AIRL and GRPO in Equation (6). <details> <summary>x1.png Details</summary> ![064b1e8c](/v1/image/064b1e8c600a2db158ac793c2c3745c180c8572363bd278d29d89f99dcee5dcd) ### Visual Description ## Diagram: Comparison of RL-Based and Search-Based Test-Time Scaling (TTS) Methods ### Overview The image is a technical diagram illustrating two distinct paradigms for Test-Time Scaling (TTS) in AI models, specifically for solving reasoning tasks (indicated by "Question q"). The top section details a Reinforcement Learning (RL) approach, while the bottom section details three Search-based approaches. The diagram uses flowcharts, tree structures, and a legend to explain the processes, components, and data flow for each method. ### Components/Axes The diagram is divided into two primary sections by a horizontal dashed line. **Top Section: RL Base TTS** * **Main Flow:** `Question q` → `Policy π_θ` → `Policy Rollouts (Sampled CoTs)` → `AIRL Discriminator (Step-wise reward r_φ)` → `Outcome Reward J_GRPO` & `Step-wise Reward J_AIRL` → `Policy Update`. * **Parallel Input:** `Question q` also leads to `Reference CoTs` (depicted as a document icon). * **Feedback Loop:** The `Policy Update` feeds back into the `Policy π_θ`. * **Key Labels:** `Policy π_θ`, `Policy Rollouts`, `Sampled CoTs`, `Reference CoTs`, `AIRL Discriminator`, `Step-wise reward r_φ`, `Outcome Reward J_GRPO`, `Step-wise Reward J_AIRL`, `Policy Update`. **Bottom Section: Search Base TTS** This section is further divided into three columns, each representing a different search strategy applied to `Question q`. 1. **Left Column: Best-of-N** * **Process:** `Question q` branches into multiple paths (red and blue lines) leading to several `Full Solution` nodes (black circles). * **Text Description:** "Generate N solutions and use PRM to select the best." 2. **Middle Column: Beam Search** * **Process:** `Question q` leads to a tree structure where paths (red and blue lines) branch through multiple layers of `Intermediate Step` nodes (white circles). The tree is pruned at each level. * **Text Description:** "PRM ranks and retains top-N steps per decision." 3. **Right Column: MCTS (Monte Carlo Tree Search)** * **Process:** A four-phase cyclic process depicted with four sub-diagrams connected by green arrows: * **Selection:** "Select nodes by UCT score." (Shows a tree with a path highlighted in blue). * **Expansion:** "Expand the tree by generating steps." (Shows a new node added to the tree). * **Simulation:** "Simulate value by extending nodes." (Shows a dotted line extending from a leaf node). * **Backpropagation:** "Backpropagate to update the tree." (Shows arrows moving back up the tree). * **Key Labels:** `Selection`, `Expansion`, `Simulation`, `Backpropagation`, `UCT score`. **Legend (Bottom-Left Corner):** * `--- Apply PRM` (Dashed box) * `○ Rejected Step` (Orange circle) * `○ Selected Step` (Blue circle) * `○ Intermediate Step` (White circle) * `● Full Solution` (Black circle) ### Detailed Analysis **RL Base TTS Flow:** The process begins with a question (`q`). A policy model (`π_θ`) generates multiple reasoning paths, referred to as "Sampled CoTs" (Chains of Thought), through policy rollouts. These are evaluated by an "AIRL Discriminator" which provides a step-wise reward signal (`r_φ`). This reward is used to compute two objective functions: an outcome-based reward (`J_GRPO`) and a step-wise reward (`J_AIRL`). These rewards drive a "Policy Update," creating a closed-loop learning system. A set of "Reference CoTs" is also provided as a potential input or benchmark. **Search Base TTS Flow:** All three methods start with the same `Question q` but employ different search algorithms to find a solution. * **Best-of-N:** A parallel generation approach. Multiple complete solutions are generated independently. A Process Reward Model (PRM) is then applied (indicated by dashed boxes around the final nodes) to score and select the single best full solution. * **Beam Search:** A sequential, breadth-first search with pruning. At each decision step (layer of the tree), the PRM ranks all possible next steps. Only the top-N most promising steps (blue "Selected Step" nodes) are retained for further expansion, while others (orange "Rejected Step" nodes) are discarded. * **MCTS:** A tree search algorithm that balances exploration and exploitation. It iteratively performs four steps: 1) **Selection** of promising leaf nodes using the UCT (Upper Confidence bounds for Trees) formula, 2) **Expansion** of the tree by adding new child nodes (steps), 3) **Simulation** (or rollout) from the new node to estimate its value, and 4) **Backpropagation** of the simulation result up the tree to update node values. ### Key Observations 1. **Paradigm Contrast:** The diagram explicitly contrasts a *learning-based* approach (RL Base TTS, which updates a policy model) with *inference-time search* approaches (Search Base TTS, which uses a fixed model but searches for better outputs). 2. **Role of PRM:** The Process Reward Model (PRM) is a critical component in all three search-based methods, used for scoring and selecting steps or final solutions. Its application is visually marked by dashed boxes in the Best-of-N and Beam Search diagrams. 3. **Granularity of Reward:** The RL method uses both outcome-level (`J_GRPO`) and step-level (`J_AIRL`) rewards. The search methods inherently operate at a step-level granularity through their ranking and selection mechanisms. 4. **Visual Coding:** The legend is essential for interpreting the search diagrams. Blue lines/nodes represent the selected, high-reward path, while red lines/orange nodes represent rejected or lower-reward paths. The flow of the MCTS cycle is clearly indicated by large green arrows. ### Interpretation This diagram serves as a conceptual framework for understanding how different TTS strategies operate. It suggests that improving AI reasoning at inference time can be approached either by **training a better policy** (RL Base) that internalizes the reward signal, or by **employing a more sophisticated search** (Search Base) around a fixed policy to find optimal reasoning paths. The **RL Base TTS** represents an "end-to-end" learning paradigm where the model's parameters are directly optimized based on reward signals derived from both the final outcome and intermediate steps. This is akin to learning a skill through practice and feedback. The **Search Base TTS** methods represent a "plan-and-execute" paradigm. They treat the base model as a black-box function generator and use algorithmic search (Best-of-N, Beam Search, MCTS) to navigate the vast space of possible reasoning chains. The PRM acts as a heuristic guide, similar to a value function in classic search problems. MCTS is notably the most complex, incorporating elements of both exploration (trying new steps) and exploitation (refining known good paths). The key implication is a trade-off: RL-based methods may offer faster inference once trained but require costly training and careful reward shaping. Search-based methods can be applied to any existing model without retraining but incur higher computational costs at inference time due to the search process. The diagram effectively maps out the architectural choices and core mechanisms involved in this active area of AI research. </details> Figure 1: Overview of AIRL-S . During training, AIRL-S uses the AIRL discriminator to learn a PRM and optimizes the policy with both dense rewards from AIRL and outcome rewards from GRPO. At test time, the trained policy and PRM jointly guide downstream search algorithms. ### 3.5 PRM-guided Test-Time Search #### Test-Time Search. At inference time, search-based TTS explores multiple CoTs rather than generating a single pass. Approaches such as Best-of- N [62], beam search [76], and Monte Carlo Tree Search (MCTS) [85] improve reasoning by leverage outcome or process reward models to guide exploration. Because our reward function $r_{\phi}$ is learned and adapted on the policy LLM during RL, we can seamlessly reuse it as a PRM to steer the search procedure. When the policy proposes several actions, either intermediate steps or complete solutions, we score each candidate with $r_{\phi}$ and retain those with the highest scores according to the chosen search strategy. Figure 1 illustrates how $r_{\phi}$ integrates with Best-of-N, beam search, and MCTS; implementation details are provided in Appendix A. - Best-of- N. For a question $q$ , we sample $N$ full solutions from $\pi_{\theta}(\cdot\mid q)$ and select the one with the highest PRM aggregation score (defined below). - Beam Search. With beam size $N$ and beam width $M$ , we extend each beam node by $M$ candidate steps, rank those steps by $r_{\phi}$ , and keep the top $N$ nodes. The process continues until we obtain $N$ complete solutions. - MCTS. Using expansion width $M$ , we select a node by the upper confidence bound for trees (UCT) criterion, which combines PRM rewards with visit counts [63, 31]. We then generate $M$ child steps, simulate rollouts using $r_{\phi}$ to estimate node values, and back-propagate the rewards. The search returns $N$ full solutions. #### PRM Aggregation. After search, we obtain $N$ candidate solutions $\{C^{k}\}_{k=1}^{N}$ for each question $q$ . Following Snell et al. [62], for every solution $C^{k}$ we compute a step-wise aggregation score $$ s\!\bigl(C^{k}\bigr)=\min_{i}r_{\phi}\!\bigl(C^{k}_{i}\bigr), $$ namely, the smallest PRM reward among its steps. Let $A^{k}$ be the final answer produced by $C^{k}$ . For inter-answer aggregation, we aggregate the step-wise scores of all solutions that yield the same distinct answer $a$ : $$ S(a)=\sum_{k:\,A^{(k)}=a}\,s\!\bigl(C^{(k)}\bigr). $$ The answer with the largest $S(a)$ is selected as the model’s prediction; we term this method PRM-Min-Sum. Further implementation details and the definition of other PRM aggregation methods appear in Appendix A. ## 4 Experiments To evaluate the effectiveness of the unified RL-based and search-based TTS technique in AIRL-S , we conduct extensive experiments to (i) compare the performance of our policy model to current state-of-the-art (SoTA) comparable size models on eight reasoning tasks, (ii) compare our PRM to PRMs trained on labeled datasets and implicit PRMs, (iii) combine our policy model and PRM with various search-based TTS techniques, (iv) conduct additional experiments to investigate the effect of the combined AIRL and GRPO objectives in RL training and the impact of various answer aggregation techniques in search-based TTS. ### 4.1 Experimental Details #### LLMs. Policy Baselines: We select SoTA open-source and API models as baselines for our policy model. The open-source LLMs encompass models specifically designed for or excelling in reasoning through RL training, such as Math-Shepherd-Mistral-7B-RL [71], DeepSeek-Math-7B-RL [60], Qwen2.5-Math-7B-Instruct [24], and Eurus-2-7B-PRIME [11]. We also include Qwen2.5-7B-Instruct [56] and Phi-4-14B [2] as general-purpose open-source baselines. We further include s1.1-7B [49] as a baseline trained through direct imitation learning on DeepSeek-R1 demonstrations instead of AIRL. For API models, we select DeepSeek-R1 [20], DeepseekV3 (2025-03-24) [38] and GPT-4o (2024-11-20) [51] as baselines. PRM Baselines: We select Math-Shepherd-Mistral-7B-PRM [69] and Llama3.1-8B-PRM-Deepseek-Data [77] as PRM baselines trained on labeled step-wise data. We choose the recent EurusPRM-Stage2 [79] as an implicit PRM baseline trained through ORM. More details of the baseline models are expanded in Appendix B. #### Tasks. Mathematical/Scientific Reasoning: We evaluate on AIME2024 [45], AMC [46], MATH500 [23], and GPQA-Diamond [59] to validate the effectiveness of AIRL-S in mathematical or scientific reasoning tasks. Coding: We further evaluate the reasoning capabilities of our model in coding tasks on four benchmarks, including HumanEval [7], MBPP [19], LeetCode [10], and LiveCodeBench (v4) [26]. Details of all the tasks are introduced in Appendix B. #### Training. Our full training dataset contains 160K questions, mostly sampled from NuminaMATH [33]. We use the LLM policy self-generated responses as the reference rollouts, and the rollouts are generated using rollout prompts, and more details are shared in Appendix B. We initialize both $\pi_{\theta}$ and $f_{\phi}$ using Qwen2.5-7B-Instruct, while we modify the head of $f_{\phi}$ into a scalar-value prediction head. The training is conducted on 8 NVIDIA A100 GPUs with Pytorch FSDP. We use a learning rate of 2e-6 for the reward model and 5e-7 for the policy model, maximum output length is set to 8,192, and the training batch size is 1024. For each question, we sample 8 outputs using the rollout prompts. The KL coefficient is set to 0.001, $\lambda$ is set to 0.5, and the total training epochs are 2. #### Evaluation. For results where search-based TTS is not applied, we report zero-shot Accuracy@1 for math and science tasks, and Pass@1 for coding tasks using a temperature of 0 for reproducibility. For results where search-based TTS is applied, we use a temperature of 0.7 to enable step exploration, tree width $M$ is set to 4, and the performance of the PRM aggregated solution is reported. The main results are averaged over three runs. ### 4.2 Main Results #### Effectiveness of the Policy Model. To comprehensively validate the performance of our policy model Qwen2.5-7B-AIRL-S , we compare it to ten SoTA LLMs (open- and closed-source) on eight reasoning tasks. The results in Table 1 lead to the following observations: (i) AIRL-S can effectively enhance the reasoning performance of the base model. When trained on Qwen2.5-7B-Instruct, our policy achieves an average improvement of 9% across eight benchmarks and a 13% improvement on mathematical and scientific reasoning tasks. (ii) Our policy exhibits strong performance compared to existing LLMs trained through RL. AIRL-S outperforms Math-Shepherd-Mistral-7B-RL, DeepSeek-Math-7B-RL, and Qwen2.5-Math-7B-Instruct (trained with PRMs from labeled step-wise datasets) and Eurus-2-7B-PRIME (trained with an implicit PRM derived from its policy). Compared to these models, AIRL-S consistently achieves higher reasoning accuracy on all tasks by learning a policy-independent step-wise reward function through AIRL. (iii) Our approach surpasses direct imitation learning on reference rollouts. Compared to s1.1-7B trained via supervised fine-tuning on DeepSeek-R1 demonstrations, AIRL-S yields an average improvement of 15% by learning a reward function from reference rollouts and maximizing the rewards instead of direct imitation learning. (iv) Finally, AIRL-S outperforms the recently released Phi-4-14B on average despite having half its parameter count and matches the performance of the general-purpose model GPT-4o. Table 1: Comparison of our policy model with ten SoTA open- and closed-source LLMs across eight reasoning benchmarks. Boldface indicates the top result within each model category; green highlighting denotes the best result among local models. Our policy achieves the highest average performance among local models and matches GPT-4o overall. (GPQA: GPQA-Diamond; LCB-v4: LiveCodeBench-v4) | Model | Mathematical/Scientific Reasoning | Coding | Avg. | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | AIME2024 | AMC | MATH500 | GPQA | HumanEval | Leetcode | LCB-v4 | MBPP | | | | API Models | | | | | | | | | | | DeepSeek-R1 | 79.8 | 85.5 | 96.5 | 71.2 | 97.6 | 90.4 | 71.1 | 95.8 | 86.0 | | DeepSeek-V3 | 36.7 | 81.9 | 90.2 | 61.1 | 93.3 | 88.3 | 67.0 | 88.9 | 75.9 | | GPT-4o | 13.3 | 50.6 | 65.8 | 43.9 | 90.2 | 60.6 | 44.2 | 87.3 | 57.0 | | Local Models (RL with PRMs) | | | | | | | | | | | Math-Shepherd-Mistral-7B-RL | 0.0 | 7.2 | 28.6 | 9.6 | 32.3 | 5.6 | 3.9 | 51.1 | 17.3 | | DeepSeekMath-7B-RL | 3.3 | 18.1 | 48.6 | 23.2 | 58.5 | 11.2 | 6.2 | 73.1 | 30.3 | | Qwen2.5-Math-7B-Instruct | 13.3 | 50.6 | 79.6 | 29.3 | 57.9 | 11.7 | 9.3 | 46.0 | 37.2 | | Eurus-2-7B-PRIME | 20.0 | 56.6 | 79.2 | 33.8 | 70.7 | 31.1 | 24.3 | 70.1 | 48.2 | | Local Models (Other Baselines) | | | | | | | | | | | s1.1-7B | 16.7 | 38.6 | 72.6 | 35.4 | 76.8 | 11.1 | 9.3 | 76.7 | 42.2 | | Qwen2.5-7B-Instruct | 16.7 | 33.7 | 72.0 | 32.5 | 81.7 | 47.4 | 28.0 | 79.4 | 48.9 | | Phi-4-14B | 13.3 | 44.6 | 78.6 | 55.6 | 84.1 | 45.6 | 31.2 | 74.3 | 53.4 | | Qwen2.5-7B-AIRL-S (Ours) | 26.7 | 59.0 | 80.2 | 40.2 | 85.1 | 54.4 | 31.3 | 83.3 | 57.5 | #### Effectiveness of the PRM. We assess the generalizability of our PRM, Qwen2.5-AIRL-S-PRM, by comparing it to PRMs trained on labeled step-wise datasets. We evaluate four PRMs using Best-of-N search with 64 rollouts on AIME2024, AMC, and MATH500 across four generative LLMs. The average performance of each LLM–PRM pair is shown in Figure 2, and detailed results are provided in Appendix C. Our observations are: (i) Qwen2.5-AIRL-S-PRM improves Best-of-N performance for all LLMs and datasets, outperforming Math-Shepherd-Mistral-7B-PRM and EurusPRM-Stage2 by 2.4% and 1.4%, respectively. (ii) Combining Qwen2.5-7B-AIRL-S with Qwen2.5-AIRL-S-PRM yields the best performance among all LLM–PRM combinations, with an 11% gain over Qwen2.5-7B-Instruct with Math-Shepherd-Mistral-7B-PRM. These results demonstrate that AIRL-S effectively unifies RL-based TTS and Search-based TTS and that our PRM generalizes across different models and datasets. <details> <summary>x2.png Details</summary> ![957f4710](/v1/image/957f47105fdf9bde2ad7de0eb2a16b8f1a41ea1502ee4d28c54571c940e42cac) ### Visual Description ## Bar Chart: Average Accuracy of each LLM and PRM combination using Best-of-N ### Overview This is a grouped bar chart comparing the performance of four different Large Language Models (LLMs) when paired with five different Process Reward Models (PRMs) or evaluation methods. The performance metric is average accuracy percentage, measured using a "Best-of-N" sampling strategy. The chart demonstrates how the choice of PRM significantly impacts the final accuracy score for each base LLM. ### Components/Axes * **Chart Title:** "Average Accuracy of each LLM and PRM combination using Best-of-N" * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 40 to 65, with major gridlines at intervals of 5% (40, 45, 50, 55, 60, 65). * **X-Axis:** Lists four distinct LLM models, which form the primary groups: 1. `Qwen2.5-7B-Instruct` 2. `Eurus-2-7B-PRIME` 3. `Phi-4-14B` 4. `Qwen2.5-7B-AIRL-S(Our LLM)` * **Legend:** Located in the top-left corner of the plot area. It defines five data series (PRM/evaluation methods), each associated with a specific color: * **Pink:** `Accuracy@1` * **Light Beige:** `Math-Shepherd-Mistral-7B-PRM` * **Light Gray:** `EurusPRM-Stage2` * **Light Green:** `Llama3.1-8B-PRM-Deepseek-Data` * **Dark Gray:** `Qwen2.5-AIRL-S-PRM(Ours PRM)` ### Detailed Analysis The chart displays five bars for each of the four LLM groups. The values are annotated on top of each bar. **1. Group: Qwen2.5-7B-Instruct** * **Trend:** Accuracy increases progressively from the baseline `Accuracy@1` to the advanced PRMs. * **Data Points:** * Accuracy@1 (Pink): **40.8%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **51.1%** * EurusPRM-Stage2 (Light Gray): **52.6%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **53.2%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **53.8%** **2. Group: Eurus-2-7B-PRIME** * **Trend:** Similar upward trend. The gap between the baseline and the best PRM is smaller than in the first group. * **Data Points:** * Accuracy@1 (Pink): **51.9%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **56.3%** * EurusPRM-Stage2 (Light Gray): **56.1%** *(Note: Slightly lower than the previous bar)* * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **57.3%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **57.6%** **3. Group: Phi-4-14B** * **Trend:** A clear, steady increase in accuracy across the PRM sequence. * **Data Points:** * Accuracy@1 (Pink): **45.5%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **53.7%** * EurusPRM-Stage2 (Light Gray): **54.5%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **55.5%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **56.1%** **4. Group: Qwen2.5-7B-AIRL-S(Our LLM)** * **Trend:** This group shows the highest overall accuracies. The trend is upward, with a notable jump to the final PRM. * **Data Points:** * Accuracy@1 (Pink): **55.3%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **59.8%** * EurusPRM-Stage2 (Light Gray): **60.2%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **59.3%** *(Note: Slight dip compared to previous bar)* * Qwen2.5-AIRL-S-PRM (Dark Gray): **61.3%** ### Key Observations 1. **Consistent PRM Hierarchy:** In almost every LLM group, the `Accuracy@1` (pink) bar is the lowest, and the `Qwen2.5-AIRL-S-PRM` (dark gray) bar is the highest. This pattern holds for three out of four groups, with the `Eurus-2-7B-PRIME` group being a very close exception. 2. **Performance of "Our" Models:** The chart highlights two "Ours" components: the LLM `Qwen2.5-7B-AIRL-S` and the PRM `Qwen2.5-AIRL-S-PRM`. Their combination yields the highest overall accuracy on the chart (**61.3%**). 3. **Baseline vs. PRM Boost:** The improvement from using any PRM over the `Accuracy@1` baseline is substantial, ranging from approximately +13 to +18 percentage points across all LLMs. 4. **Minor Anomalies:** There are two instances where the strict ascending order is broken: * In the `Eurus-2-7B-PRIME` group, `EurusPRM-Stage2` (56.1%) is marginally lower than `Math-Shepherd-Mistral-7B-PRM` (56.3%). * In the `Qwen2.5-7B-AIRL-S` group, `Llama3.1-8B-PRM-Deepseek-Data` (59.3%) is lower than both `Math-Shepherd` (59.8%) and `EurusPRM` (60.2%). ### Interpretation This chart provides strong evidence for the efficacy of Process Reward Models (PRMs) in improving the mathematical reasoning accuracy of LLMs when using a Best-of-N sampling strategy. The data suggests that the selection of PRM is a critical hyperparameter, often leading to greater performance gains than the difference between some of the base LLMs themselves. The consistent superiority of the `Qwen2.5-AIRL-S-PRM` across different LLM backbones indicates it is a robust and high-performing reward model. The fact that the authors' own LLM (`Qwen2.5-7B-AIRL-S`) paired with their own PRM achieves the top result suggests a successful co-design or fine-tuning strategy tailored for this task. The minor dips in performance for certain PRMs within specific LLM groups (e.g., `EurusPRM` on `Eurus-2-7B-PRIME`) hint at potential compatibility issues or that a PRM's effectiveness may not be perfectly universal, possibly depending on the underlying data distribution or model architecture it was trained to evaluate. Overall, the chart makes a clear case for investing in specialized PRMs to unlock higher performance from LLMs in reasoning tasks. </details> Figure 2: Average performance of four PRMs applied to four generative LLMs using Best-of-N with 64 rollouts on AIME2024, AMC, and MATH500. Our PRM (Qwen2.5-AIRL-S-PRM) consistently delivers the highest test-time search performance across all models and datasets. #### Effectiveness on Different Search-Based TTS Methods. We evaluate our PRM within three search-based TTS methods, MCTS, beam search, and Best-of-N, using Qwen2.5-7B-AIRL-S as the generative LLM and Qwen2.5-AIRL-S-PRM as the verifier. We include Self-Consistency [70] as a baseline for sanity check: it generates multiple complete solutions and uses majority vote to select the final answer without any PRM or ORM. Results on AMC and MATH500 in Figure 3 show that: (i) Our PRM effectively guides the search, selection, and aggregation processes for all search algorithms. It improves performance across all search-based methods as the number of rollouts increases and outperforms Self-Consistency under the same rollouts. This result indicates that reward-guided step generation and selection produce more accurate intermediate decisions than majority voting on final answers. (ii) The PRM is especially effective for value-based methods like MCTS. By estimating step values during tree expansion and selecting nodes by UCT score, MCTS benefits more from the PRM’s guidance, leading to more accurate intermediate steps. <details> <summary>x3.png Details</summary> ![6b14549f](/v1/image/6b14549fbb806c5c55aebcb5c91d02e8614362edccd4375d89b2caf51e6b342f) ### Visual Description \n ## Line Chart: Test Time Search Performance on AMC ### Overview The image displays a line chart comparing the performance of four different search algorithms or strategies on the AMC (likely a benchmark dataset or task) as a function of computational budget, measured in "Generation Rollouts." The chart plots Accuracy (%) against the number of rollouts on a logarithmic scale (base 2). ### Components/Axes * **Chart Title:** "Test Time Search Performance on AMC" (centered at the top). * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 58 to 68, with major tick marks every 2 units (58, 60, 62, 64, 66, 68). * **X-Axis:** Labeled "Generation Rollouts". The scale is logarithmic, with tick marks at powers of two: 2⁰, 2¹, 2², 2³, 2⁴, 2⁵, 2⁶. * **Legend:** Positioned in the bottom-right quadrant of the chart area. It contains four entries, each with a colored line segment and a marker: * **Blue line with circle markers:** "Self-Consistency" * **Orange line with circle markers:** "Best-of-N" * **Green line with circle markers:** "Beam Search" * **Red line with circle markers:** "MCTS" ### Detailed Analysis The chart tracks four data series. All series begin at the same point (59% accuracy at 2⁰ rollouts). Below is the extracted data and trend for each series. **1. Self-Consistency (Blue Line)** * **Trend:** Shows a steady, near-linear increase in accuracy as rollouts increase, but plateaus at the highest rollout counts. * **Data Points (Approximate):** * 2⁰: 59.0% * 2¹: 59.0% * 2²: 60.2% * 2³: 61.4% * 2⁴: 62.7% * 2⁵: 63.9% * 2⁶: 63.9% **2. Best-of-N (Orange Line)** * **Trend:** Exhibits a volatile but overall strong upward trend. It shows a significant jump between 2¹ and 2², a peak at 2³, a dip at 2⁴, and then a strong recovery to reach the highest final accuracy. * **Data Points (Approximate):** * 2⁰: 59.0% * 2¹: 59.0% * 2²: 62.7% * 2³: 65.1% * 2⁴: 63.9% * 2⁵: 65.1% * 2⁶: 67.5% **3. Beam Search (Green Line)** * **Trend:** Demonstrates a consistent upward trend that plateaus in the middle range (2³ to 2⁴) before a final increase. * **Data Points (Approximate):** * 2⁰: 59.0% * 2¹: 59.0% * 2²: 63.9% * 2³: 65.1% * 2⁴: 65.1% * 2⁵: 66.3% * 2⁶: 66.3% **4. MCTS (Red Line)** * **Trend:** Shows the most consistent and steepest upward trajectory, with no dips. It matches the highest accuracy at the final data point. * **Data Points (Approximate):** * 2⁰: 59.0% * 2¹: 60.2% * 2²: 63.9% * 2³: 65.1% * 2⁴: 66.3% * 2⁵: 66.3% * 2⁶: 67.5% ### Key Observations 1. **Common Starting Point:** All four methods begin at an identical accuracy of 59% with a single rollout (2⁰). 2. **Performance Hierarchy at Scale:** At the maximum rollout count (2⁶ = 64), MCTS and Best-of-N are tied for the highest performance (~67.5%), followed by Beam Search (~66.3%), with Self-Consistency performing the lowest (~63.9%). 3. **Divergence Point:** The performance of the methods begins to diverge significantly after 2¹ rollouts. 4. **Plateau Behavior:** Self-Consistency plateaus after 2⁵ rollouts. Beam Search shows a plateau between 2³ and 2⁴. MCTS and Best-of-N show no sign of plateauing within the measured range. 5. **Volatility:** Best-of-N is the only method to show a decrease in accuracy (at 2⁴) before recovering. ### Interpretation This chart demonstrates the scaling behavior of different search strategies with respect to computational budget (rollouts). The key insight is that more sophisticated search methods (MCTS, Best-of-N) exhibit better scaling laws—they continue to gain significant accuracy as more computation is allocated, whereas simpler methods like Self-Consistency see diminishing returns earlier. The data suggests that for the AMC task, investing in increased rollouts is most beneficial when using MCTS or Best-of-N strategies. The dip in Best-of-N at 2⁴ could indicate sensitivity to a specific rollout count or a potential instability in the method that is overcome with even more samples. The identical starting point implies that with minimal computation, the choice of search strategy is irrelevant; their value is unlocked only when sufficient computational resources are available for exploration. The final tie between MCTS and Best-of-N at 2⁶ raises the question of whether one might eventually outperform the other with even larger budgets, or if they have converged to a similar performance ceiling for this task. </details> <details> <summary>x4.png Details</summary> ![252a9c19](/v1/image/252a9c19f52c3a0f14618f72f4f24ce131a8d08e53e9deb0a4294421b51971b6) ### Visual Description ## Line Chart: Test Time Search Performance on MATH500 ### Overview This is a line chart comparing the performance of four different search strategies on the MATH500 benchmark. The chart plots accuracy percentage against the number of generation rollouts, which is presented on a logarithmic scale (base 2). All four strategies show improved accuracy as the number of rollouts increases, with performance gains beginning to plateau after approximately 8 (2³) rollouts. ### Components/Axes * **Chart Title:** "Test Time Search Performance on MATH500" * **Y-Axis:** * **Label:** "Accuracy (%)" * **Scale:** Linear, ranging from 80 to 88, with major tick marks every 2 units (80, 82, 84, 86, 88). * **X-Axis:** * **Label:** "Generation Rollouts" * **Scale:** Logarithmic (base 2), with markers at 2⁰ (1), 2¹ (2), 2² (4), 2³ (8), 2⁴ (16), 2⁵ (32), and 2⁶ (64). * **Legend:** Located in the bottom-right quadrant of the chart area. It contains four entries, each with a colored line, a marker symbol, and a text label. 1. **Blue line with circle markers:** "Self-Consistency" 2. **Orange line with circle markers:** "Best-of-N" 3. **Green line with circle markers:** "Beam Search" 4. **Red line with circle markers:** "MCTS" ### Detailed Analysis The chart displays four data series, each representing a search strategy. All lines originate from the same approximate starting point at 2⁰ rollouts and trend upward with a diminishing rate of increase. **Trend Verification:** All four lines slope upward from left to right, indicating that accuracy improves with more generation rollouts. The rate of improvement is steepest between 2⁰ and 2³ rollouts and becomes more gradual thereafter. **Data Point Extraction (Approximate Values):** * **At 2⁰ (1 rollout):** All four strategies start at approximately **80.2%** accuracy. * **At 2¹ (2 rollouts):** * MCTS (Red): ~81.8% * Best-of-N (Orange): ~81.7% * Beam Search (Green): ~81.6% * Self-Consistency (Blue): ~81.0% * **At 2² (4 rollouts):** * MCTS (Red): ~84.5% * Best-of-N (Orange): ~84.3% * Beam Search (Green): ~84.0% * Self-Consistency (Blue): ~83.4% * **At 2³ (8 rollouts):** * MCTS (Red): ~85.6% * Best-of-N (Orange): ~85.5% * Beam Search (Green): ~85.0% * Self-Consistency (Blue): ~84.0% * **At 2⁴ (16 rollouts):** * MCTS (Red): ~86.0% * Best-of-N (Orange): ~85.9% * Beam Search (Green): ~85.4% * Self-Consistency (Blue): ~84.4% * **At 2⁵ (32 rollouts):** * MCTS (Red): ~86.4% * Best-of-N (Orange): ~86.3% * Beam Search (Green): ~86.0% * Self-Consistency (Blue): ~84.6% * **At 2⁶ (64 rollouts):** * MCTS (Red): ~86.8% * Best-of-N (Orange): ~86.4% * Beam Search (Green): ~86.2% * Self-Consistency (Blue): ~84.8% ### Key Observations 1. **Performance Hierarchy:** A consistent ranking is maintained across all rollout counts greater than 1. From highest to lowest accuracy: **MCTS > Best-of-N > Beam Search > Self-Consistency**. 2. **Diminishing Returns:** The most significant accuracy gains for all methods occur within the first 8 (2³) rollouts. After this point, the curves flatten, indicating that additional rollouts yield progressively smaller improvements. 3. **Convergence of Top Methods:** The performance gap between MCTS (red) and Best-of-N (orange) is very small, especially at higher rollout counts. They are nearly indistinguishable at 2³ and 2⁴ rollouts. 4. **Self-Consistency Lag:** The Self-Consistency method (blue) consistently underperforms the other three strategies by a margin of approximately 1.5-2.0 percentage points at higher rollout counts (2⁴ and above). ### Interpretation This chart demonstrates the effectiveness of different test-time compute strategies for improving mathematical reasoning performance. The data suggests that: * **Allocating more compute (rollouts) improves accuracy**, but with clear diminishing returns. The "sweet spot" for efficiency appears to be around 8-16 rollouts, after which the cost of additional compute may not justify the marginal accuracy gain. * **Monte Carlo Tree Search (MCTS) is the most effective strategy** among those compared, consistently achieving the highest accuracy. This implies that its method of exploring and evaluating multiple reasoning paths is superior for this task. * **Best-of-N and Beam Search are highly competitive** and perform nearly as well as MCTS, especially when given sufficient rollouts. They represent strong, potentially less complex alternatives. * **Self-Consistency, while beneficial, is the least effective** of the four methods shown. This indicates that simply aggregating multiple independent samples (its likely mechanism) is less powerful than the more structured search employed by the other techniques. The chart provides a clear empirical comparison for researchers or engineers selecting a search strategy for language models on mathematical tasks, highlighting the trade-off between computational cost (rollouts) and performance (accuracy). </details> Figure 3: Comparison of test-time search performance with our PRM applied to MCTS, Beam Search, and Best-of-N across varying rollout counts. Our PRM consistently improves performance for all search techniques. ### 4.3 Ablation Study #### Impact of Our PRM on RL Training. To assess the contribution of our reward function on RL training, we compare AIRL-S to GRPO trained with outcome rewards only (no PRM), while keeping all other settings identical to AIRL-S . The results in Figure 4 show that adding our PRM improves both training and validation accuracy. Furthermore, as training progresses, AIRL-S enables longer response generation at test time, demonstrating its ability to scale inference computation. <details> <summary>x5.png Details</summary> ![3fdb32c7](/v1/image/3fdb32c71166441c596368e0ec455c23fc9199861e537677794d87576dc674d9) ### Visual Description \n ## Line Chart: Training Accuracy Comparison ### Overview The image displays a line chart titled "Training Accuracy," comparing the performance of two machine learning training methods over 200+ steps. The chart plots accuracy values on the y-axis against training steps on the x-axis. Each method is represented by a solid line (mean performance) and a semi-transparent shaded region (likely representing variance or confidence intervals). ### Components/Axes * **Chart Title:** "Training Accuracy" (centered at the top). * **Y-Axis:** * **Label:** "Accuracy" (rotated vertically on the left side). * **Scale:** Linear scale ranging from 0.32 to 0.42. * **Tick Marks:** Major ticks at 0.32, 0.34, 0.36, 0.38, 0.40, 0.42. * **X-Axis:** * **Label:** "Step" (centered at the bottom). * **Scale:** Linear scale from 0 to just beyond 200. * **Tick Marks:** Major ticks at 0, 50, 100, 150, 200. * **Legend:** Located in the bottom-right quadrant of the chart area. * **Blue Line:** Labeled "GRPO (w/o PRM)". * **Red Line:** Labeled "AIRL-S (w. PRM)". * **Data Series:** 1. **Blue Line (GRPO w/o PRM):** A solid blue line with a light blue shaded region around it. 2. **Red Line (AIRL-S w. PRM):** A solid red line with a light red shaded region around it. ### Detailed Analysis **Trend Verification & Data Points:** * **GRPO (w/o PRM) - Blue Line:** * **Visual Trend:** The line shows a rapid initial increase from step 0 to approximately step 50, after which the rate of improvement slows significantly, entering a noisy plateau phase from step 100 onward. * **Approximate Data Points:** * Step 0: ~0.32 * Step 50: ~0.36 * Step 100: ~0.37 * Step 150: ~0.375 * Step 200: ~0.38 * **Uncertainty (Shaded Region):** The blue shaded region is relatively narrow initially but widens considerably after step 50, indicating increased variance in performance. The width suggests the accuracy for this method fluctuates within a band of approximately ±0.01 to ±0.015 around the mean line during the plateau phase. * **AIRL-S (w. PRM) - Red Line:** * **Visual Trend:** The line exhibits a strong, sustained upward trend throughout the entire training period shown. The slope is steepest initially and remains positive, though slightly less steep, after step 100. It consistently stays above the blue line after the first ~20 steps. * **Approximate Data Points:** * Step 0: ~0.32 (similar starting point to blue line) * Step 50: ~0.385 * Step 100: ~0.405 * Step 150: ~0.415 * Step 200: ~0.42 * **Uncertainty (Shaded Region):** The red shaded region is also present and appears to widen as training progresses, similar to the blue region. Its width suggests a variance of approximately ±0.01 to ±0.02 around the mean red line, particularly in the later steps. **Spatial Grounding:** The legend is positioned in the bottom-right, clearly associating the blue color with "GRPO (w/o PRM)" and the red color with "AIRL-S (w. PRM)". The lines and their corresponding shaded regions maintain these color assignments throughout the chart. ### Key Observations 1. **Performance Gap:** A clear and growing performance gap emerges early in training. By step 50, AIRL-S (w. PRM) is already about 0.025 accuracy points higher than GRPO (w/o PRM). This gap widens to approximately 0.04 points by step 200. 2. **Learning Dynamics:** GRPO (w/o PRM) appears to converge or plateau around an accuracy of 0.37-0.38 after step 100. In contrast, AIRL-S (w. PRM) shows no clear signs of plateauing within the 200-step window and continues to improve. 3. **Initial Conditions:** Both methods start at nearly the identical accuracy level (~0.32) at step 0. 4. **Noise/Variance:** Both training processes exhibit significant step-to-step noise, as evidenced by the jaggedness of the mean lines and the width of the shaded confidence bands. The variance appears comparable between the two methods. ### Interpretation The chart presents a comparative analysis of two training algorithms, likely in the domain of reinforcement learning or iterative model optimization, given the "Step" axis and the acronyms (GRPO, AIRL-S, PRM). * **What the data suggests:** The method "AIRL-S (w. PRM)" demonstrates superior learning efficiency and final performance compared to "GRPO (w/o PRM)" on this specific task, as measured by training accuracy. The inclusion of "PRM" (the specific component is not defined in the chart) appears to be a critical factor enabling sustained learning and higher asymptotic performance. * **Relationship between elements:** The direct comparison on the same axes controls for task and evaluation metrics, isolating the effect of the algorithmic difference (AIRL-S vs. GRPO) and the presence/absence of PRM. The shared starting point reinforces that the divergence is due to the training process, not initial model states. * **Notable patterns/anomalies:** The most significant pattern is the divergence in learning trajectories. The plateau of the blue line suggests it may have reached a local optimum or a limit imposed by its algorithmic structure. The continued rise of the red line indicates that AIRL-S with PRM either has a better optimization landscape, avoids premature convergence, or incorporates a mechanism (possibly the PRM) that facilitates ongoing improvement. The high variance in both signals is typical of many stochastic training processes but does not obscure the clear trend difference. </details> <details> <summary>x6.png Details</summary> ![3df91056](/v1/image/3df910569b51cbe7504c043eb032cc4631cc39c4f4cfbfe4a7d83b14be799c65) ### Visual Description ## Line Chart: Validation Accuracy Comparison ### Overview The image is a line chart titled "Validation Accuracy" that compares the performance of two different methods over a series of training steps. The chart plots accuracy values on the y-axis against training steps on the x-axis, showing how each method's validation accuracy evolves. ### Components/Axes * **Chart Title:** "Validation Accuracy" (centered at the top). * **Y-Axis:** Labeled "Accuracy". The scale runs from approximately 0.38 to 0.44, with major tick marks at 0.38, 0.40, 0.42, and 0.44. * **X-Axis:** Labeled "Step". The scale runs from 0 to 200, with major tick marks at 0, 50, 100, 150, and 200. * **Legend:** Located in the bottom-right quadrant of the chart area. * A blue line segment is labeled **"GRPO (w/o PRM)"**. * A red line segment is labeled **"AIRL-S (w. PRM)"**. * **Data Series:** Two lines with circular markers at data points. * **Blue Line (GRPO w/o PRM):** Represents one method. * **Red Line (AIRL-S w. PRM):** Represents the other method. ### Detailed Analysis **Data Series 1: GRPO (w/o PRM) - Blue Line** * **Trend:** The line shows an overall upward trend, indicating improving accuracy over steps. The increase is steepest between steps 0 and 100, after which it plateaus with minor fluctuations. * **Approximate Data Points:** | Step | GRPO (w/o PRM) Accuracy | AIRL-S (w. PRM) Accuracy | | :--- | :--- | :--- | | 0 | ~0.375 | ~0.375 | | 25 | ~0.385 | ~0.389 | | 50 | ~0.397 | ~0.407 | | 75 | ~0.406 | ~0.417 | | 100 | ~0.414 | ~0.425 | | 125 | ~0.412 | ~0.428 | | 150 | ~0.416 | ~0.425 | | 175 | ~0.411 | ~0.432 | | 200 | ~0.415 | ~0.435 | | 225 | ~0.417 | ~0.437 | | 250 | ~0.416 | ~0.438 | **Data Series 2: AIRL-S (w. PRM) - Red Line** * **Trend:** This line also shows a strong upward trend, consistently achieving higher accuracy than the blue line at every measured step after the start. It rises sharply until around step 150, after which the rate of increase slows, approaching a plateau near 0.44. ### Key Observations 1. **Performance Gap:** The red line (AIRL-S w. PRM) maintains a clear and consistent performance advantage over the blue line (GRPO w/o PRM) from approximately step 25 onward. The gap widens significantly between steps 50 and 150. 2. **Convergence:** Both methods show signs of convergence (plateauing) in the later steps (150-250), but the red line converges at a higher accuracy level (~0.438) compared to the blue line (~0.416). 3. **Initial Similarity:** Both methods start at nearly the same accuracy (~0.375) at step 0. 4. **Volatility:** The blue line exhibits slightly more volatility (e.g., dips at steps 125 and 175) compared to the smoother ascent of the red line. ### Interpretation This chart demonstrates a comparative experiment between two algorithms or training methods, likely in a machine learning or reinforcement learning context. The key finding is that the method labeled **AIRL-S (w. PRM)** significantly outperforms **GRPO (w/o PRM)** in terms of validation accuracy over 250 training steps. The inclusion of "PRM" (which could stand for something like "Preference Reward Model" or "Probabilistic Reward Model") in the AIRL-S method appears to be a critical factor for its superior performance. The data suggests that AIRL-S with PRM not only learns faster (steeper initial slope) but also achieves a higher final performance ceiling. The plateauing of both curves indicates that further training steps beyond 250 may yield diminishing returns for both methods under the current conditions. The experiment strongly supports the efficacy of the AIRL-S approach with the PRM component for this specific task. </details> <details> <summary>x7.png Details</summary> ![1b4f8b8b](/v1/image/1b4f8b8be23341b25f4f622eef5f827114bb39d81abe5f81759abf23c72e2a3f) ### Visual Description \n ## Line Chart: Average Response Length ### Overview The image displays a line chart comparing the average response length (in tokens or characters, unit unspecified) over training steps for two different reinforcement learning methods. The chart shows the progression of response length as training advances, with both methods exhibiting an initial dip followed by a sustained increase. ### Components/Axes * **Chart Title:** "Average Response Length" (centered at the top). * **Y-Axis:** Labeled "Response Length". The scale runs from 1100 to 1500, with major gridlines at intervals of 100 (1100, 1200, 1300, 1400, 1500). * **X-Axis:** Labeled "Step". The scale runs from 0 to over 200, with major tick marks and labels at 0, 50, 100, 150, and 200. * **Legend:** Located in the bottom-right quadrant of the chart area. * A blue line is labeled **"GRPO (w/o PRM)"**. * A red line is labeled **"AIRL-S (w. PRM)"**. * **Data Series:** Two lines with shaded regions (likely representing confidence intervals or standard deviation). * **Blue Line (GRPO w/o PRM):** Represents the method without a Process Reward Model. * **Red Line (AIRL-S w. PRM):** Represents the method with a Process Reward Model. ### Detailed Analysis **Trend Verification:** * **Both Lines:** Exhibit a similar macro-trend: an initial decline from step 0 to approximately step 50, followed by a strong, sustained upward trend until the end of the plotted steps (~240). * **Blue Line (GRPO):** Starts at ~1170. Dips to a minimum of ~1080 at step 50. Rises steadily, crossing 1300 around step 110 and 1400 around step 160. Shows high volatility after step 150, with sharp peaks and troughs between ~1320 and ~1460. * **Red Line (AIRL-S):** Starts higher at ~1230. Dips to a minimum of ~1110 at step 50. Rises more steeply than the blue line initially, crossing 1300 around step 90 and 1400 around step 140. Maintains a generally higher value than the blue line from step ~60 onward. Also shows high volatility in later steps, with peaks reaching near 1480. **Key Data Points (Approximate):** * **Step 0:** GRPO ~1170, AIRL-S ~1230. * **Step 50 (Trough):** GRPO ~1080, AIRL-S ~1110. * **Step 100:** GRPO ~1200, AIRL-S ~1280. * **Step 150:** GRPO ~1400, AIRL-S ~1380 (lines intersect around here). * **Step 200:** GRPO ~1420, AIRL-S ~1440. * **Final Steps (~240):** GRPO ~1380, AIRL-S ~1450. **Spatial Grounding & Confidence Intervals:** The shaded blue and red regions around each line indicate variance. The variance appears to increase for both methods as the step count and response length increase, particularly after step 150, where the shaded bands become wider and the lines more jagged. ### Key Observations 1. **Initial Dip:** Both methods cause a decrease in average response length during the first 50 steps of training. 2. **Sustained Growth:** After step 50, both methods drive a significant and continuous increase in response length for the remainder of the training shown. 3. **Method Comparison:** The AIRL-S (with PRM) method generally results in longer average responses than GRPO (without PRM) after the initial training phase (post step ~60). The gap between them is most pronounced between steps 60-120. 4. **Increased Volatility:** In the later stages of training (steps >150), both methods exhibit high-frequency oscillations in average response length, suggesting less stability in the learned policy's output length. ### Interpretation This chart demonstrates the effect of two different reinforcement learning algorithms on the verbosity of a model's generated responses during training. * **The initial dip** suggests an early phase where the models might be optimizing for other factors (like accuracy or reward) at the expense of length, or are exploring a more concise output space. * **The strong upward trend** indicates that both algorithms successfully incentivize longer responses over time. This could be because longer responses are correlated with higher rewards in the training environment (e.g., more detailed answers are preferred). * **The superiority of AIRL-S (w. PRM)** implies that incorporating a Process Reward Model provides a better or more stable learning signal for increasing response length compared to the GRPO baseline without it. The PRM may offer more granular feedback that encourages elaboration. * **The late-stage volatility** is a critical observation. It indicates that while the models learn to produce longer responses, the consistency of that length degrades. This could be a sign of over-optimization, policy instability, or that the reward function does not strongly penalize variance in length once a certain threshold is passed. In summary, the data suggests that using AIRL-S with a PRM is more effective for training a model to generate longer responses than GRPO without a PRM, though both methods lead to increased length and eventual instability in output length consistency. </details> Figure 4: Comparison of AIRL-S and GRPO trained with outcome rewards only. AIRL-S improves training and validation performance and enables longer response generation at test time. #### Impact of PRM Aggregation Techniques. As described in Section 3.5 and Appendix A, different PRM aggregation techniques can select the final answer from multiple rollouts. We evaluate these techniques using our policy and PRM with Best-of-N search on MATH500; results of five aggregation techniques (defined in Appendix A) appear in Figure 5. PRM-Min-Vote and PRM-Last-Vote achieve the highest accuracy, showing that step-level rewards enable effective answer aggregation. In contrast, PRM-Min-Max and PRM-Last-Max achieve the worst performance, demonstrating the importance of inter-answer aggregation via PRM weighted voting to integrated the answers from various rollouts. <details> <summary>x8.png Details</summary> ![a48f4f17](/v1/image/a48f4f171b63e088d273860a6d8fa818ac1ce2f327ede5e1790db445e6e80a23) ### Visual Description ## Line Chart: Performance of Different PRM Aggregation Techniques ### Overview This is a line chart comparing the performance (accuracy) of five different "PRM Aggregation Techniques" as the number of "Generation Rollouts" increases. The chart demonstrates how each technique's accuracy scales with more computational rollouts, with some techniques showing significantly better scaling than others. ### Components/Axes * **Chart Title:** "Performance of Different PRM Aggregation Techniques" * **Y-Axis:** Label: "Accuracy (%)". Scale: Linear, ranging from 80 to 88, with major gridlines at intervals of 2%. * **X-Axis:** Label: "Generation Rollouts". Scale: Logarithmic (base 2), with markers at 2⁰ (1), 2¹ (2), 2² (4), 2³ (8), 2⁴ (16), 2⁵ (32), and 2⁶ (64). * **Legend:** Positioned in the top-left corner of the plot area. It contains five entries, each with a colored line, marker symbol, and technique name: 1. **Majority Vote:** Blue line with circular markers. 2. **PRM-Last-Max:** Orange line with circular markers. 3. **PRM-Last-Sum:** Green line with circular markers. 4. **PRM-Min-Max:** Red line with circular markers. 5. **PRM-Min-Sum:** Purple line with circular markers. ### Detailed Analysis The chart plots accuracy (%) against the number of generation rollouts for each technique. All techniques start at approximately the same accuracy (~80.2%) at 2⁰ rollouts. **Data Series Trends and Approximate Values:** 1. **PRM-Min-Sum (Purple Line):** * **Trend:** Shows the strongest and most consistent upward slope, achieving the highest final accuracy. * **Values:** 2⁰: ~80.2%, 2¹: ~81.8%, 2²: ~84.5%, 2³: ~85.6%, 2⁴: ~86.0%, 2⁵: ~86.2%, 2⁶: ~86.4%. 2. **PRM-Last-Sum (Green Line):** * **Trend:** Follows a very similar strong upward trajectory to PRM-Min-Sum, closely trailing it. * **Values:** 2⁰: ~80.2%, 2¹: ~81.7%, 2²: ~84.4%, 2³: ~85.2%, 2⁴: ~85.8%, 2⁵: ~86.1%, 2⁶: ~86.2%. 3. **Majority Vote (Blue Line):** * **Trend:** Shows a strong upward trend initially, but its rate of improvement slows after 2³ rollouts, and it plateaus or slightly declines at the highest rollout count. * **Values:** 2⁰: ~80.2%, 2¹: ~81.4%, 2²: ~83.0%, 2³: ~84.6%, 2⁴: ~85.4%, 2⁵: ~86.0%, 2⁶: ~85.8%. 4. **PRM-Min-Max (Red Line):** * **Trend:** Shows a moderate, steady upward trend, but at a significantly lower slope than the top three techniques. * **Values:** 2⁰: ~80.2%, 2¹: ~80.6%, 2²: ~81.2%, 2³: ~81.0%, 2⁴: ~81.8%, 2⁵: ~82.0%, 2⁶: ~82.2%. 5. **PRM-Last-Max (Orange Line):** * **Trend:** Shows the weakest performance. It initially dips, then recovers slowly, remaining the lowest-performing technique throughout. * **Values:** 2⁰: ~80.2%, 2¹: ~80.0%, 2²: ~80.8%, 2³: ~81.4%, 2⁴: ~81.2%, 2⁵: ~81.6%, 2⁶: ~81.8%. ### Key Observations * **Performance Tiers:** The techniques clearly separate into two performance tiers. The top tier (PRM-Min-Sum, PRM-Last-Sum, Majority Vote) reaches accuracies between ~85.8% and ~86.4%. The bottom tier (PRM-Min-Max, PRM-Last-Max) only reaches ~81.8% to ~82.2%. * **Convergence at Start:** All five techniques begin at nearly identical accuracy (~80.2%) with a single rollout (2⁰). * **Plateauing:** The Majority Vote technique's performance plateaus and slightly decreases between 2⁵ and 2⁶ rollouts, while the top two "Sum" techniques continue a slight upward trend. * **"Sum" vs. "Max":** Techniques using "Sum" aggregation (PRM-Min-Sum, PRM-Last-Sum) significantly outperform their "Max" counterparts (PRM-Min-Max, PRM-Last-Max). * **"Min" vs. "Last":** For "Sum" aggregation, "Min" (purple) slightly outperforms "Last" (green). For "Max" aggregation, "Min" (red) consistently outperforms "Last" (orange). ### Interpretation The data suggests that the choice of aggregation technique is critical for scaling performance with increased computational budget (rollouts). The "Sum" based aggregation methods (PRM-Min-Sum and PRM-Last-Sum) demonstrate superior scalability, effectively converting additional rollouts into higher accuracy. The "Majority Vote" method is also effective but shows signs of diminishing returns at the highest rollout count tested. The significant performance gap between "Sum" and "Max" techniques indicates that simply taking the maximum value from rollouts is a much less effective strategy than summing them, likely because summation better aggregates information across multiple attempts. The "Min" variants generally outperforming "Last" variants suggests that considering the minimum value (perhaps representing a worst-case or most conservative estimate) across rollouts is more robust than only considering the final one. The chart provides strong evidence that for this task, **PRM-Min-Sum** is the most effective and scalable aggregation technique among those compared, with **PRM-Last-Sum** as a very close second. The **PRM-Last-Max** technique is the least effective. </details> Figure 5: Performance of different PRM aggregation techniques evaluated on MATH500. Table 2: Comparison of our policy model and PRM with PRIME [11] train on the same setting. AIRL-S consistently outperforms PRIME both with and without test-time search. AIME, MATH, and BoN denote AIME2024, MATH500, and Best-of-N, respectively. | PRIME PRIME w. BoN AIRL-S | 16.7 20.0 26.7 | 55.4 62.7 59.0 | 78.4 84.2 80.2 | 50.2 55.6 55.3 | | --- | --- | --- | --- | --- | | AIRL-S w. BoN | 30.0 | 67.5 | 86.4 | 61.3 | #### Comparison with Concurrent Implicit PRM Methods. We compare AIRL-S with PRIME [11], which trains a policy-dependent dense reward function based on [79]. Both methods are trained on Qwen2.5-7B-Instruct under identical settings. At inference, we apply Best-of-N search under 64 rollouts with each method’s PRM. Results on AIME2024, AMC, and MATH500 in Table 2 show that our policy and PRM combination outperforms PRIME, further confirming that AIRL-S unifies RL-based and search-based TTS. ## 5 Conclusion In this paper, we present AIRL-S , a framework that unifies RL-based and search-based TTS for LLMs. AIRL-S integrates AIRL with GRPO to learn a step-wise PRM from reference rollouts without any labeled data. This PRM guides both RL training by improving training stability and sample efficiency, and inference search by extending reasoning chains and mitigating reward hacking. We evaluate AIRL-S on eight benchmarks covering mathematics, scientific reasoning, and code generation. The results show that AIRL-S significantly improves the performance of base models and outperforms existing RL-trained and imitation-based methods. The PRM generalizes across different LLMs, datasets, and search algorithms, consistently enhancing test-time search performance. These findings validate that unifying RL-based and search-based TTS offers an efficient and robust approach for complex reasoning. ## References - [1] P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1. Cited by: §2. - [2] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. (2024) Phi-4 technical report. arXiv preprint arXiv:2412.08905. Cited by: Appendix D, §4.1. - [3] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. (2022) Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. Cited by: §1. - [4] B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini (2024) Large language monkeys: scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787. Cited by: §2. - [5] D. Carraro and D. Bridge (2024) Enhancing recommendation diversity by re-ranking with large language models. ACM Transactions on Recommender Systems. Cited by: §2. - [6] A. J. Chan, H. Sun, S. Holt, and M. Van Der Schaar (2024) Dense reward for free in reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning, pp. 6136–6154. Cited by: §1. - [7] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto, J. Kaplan, et al. (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. External Links: Link Cited by: Table 5, §4.1. - [8] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei (2017) Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 4299–4307. External Links: Link Cited by: §2. - [9] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §1. - [10] T. Coignion, C. Quinton, and R. Rouvoy (2024) A performance study of llm-generated code on leetcode. arXiv preprint arXiv:2407.21579. External Links: Link Cited by: Table 5, §4.1. - [11] G. Cui, L. Yuan, Z. Wang, et al. (2025) Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456. External Links: Link Cited by: Table 3, §2, §4.1, §4.3, Table 2. - [12] DeepSeek-AI (2024) DeepSeek-math-7b-rl: reinforcement learning enhanced math reasoning model. Note: GitHub repository External Links: Link Cited by: §1, §1, §2. - [13] C. Finn, P. Christiano, P. Abbeel, and S. Levine (2016) A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852. Cited by: §1, §2, §3.3. - [14] J. Fu, K. Luo, and S. Levine (2018) Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2, §2, §3.3, §3.4. - [15] B. Gao, F. Song, Z. Yang, Z. Cai, Y. Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y. Zhang, X. Ren, T. Liu, and B. Chang (2024) Omni-math: a universal olympiad level mathematic benchmark for large language models. External Links: 2410.07985, Link Cited by: §B.3, Table 4. - [16] L. Gao, J. Schulman, and J. Hilton (2023) Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. Cited by: §1. - [17] D. Garg, S. Chakraborty, C. Cundy, J. Song, and S. Ermon (2021) IQ-learn: inverse soft-$q$ learning for imitation. In NeurIPS 2021 Spotlight, External Links: Link Cited by: §2. - [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2020) Generative adversarial networks. Communications of the ACM 63 (11), pp. 139–144. Cited by: §2, §3.3. - [19] Google Research, J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. (2021) MBPP-sanitized. Note: Hugging Face DatasetsSanitized split of the MBPP benchmark External Links: Link Cited by: Table 5, §4.1. - [20] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: Table 3, §1, §1, §1, §2, §3.4, §4.1. - [21] J. Hejna and D. Sadigh (2023) Inverse preference learning: preference-based rl without a reward function. Vol. 36, pp. 18806–18827. Cited by: §2. - [22] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt (2021) Measuring coding challenge competence with APPS. CoRR abs/2105.09938. External Links: Link, 2105.09938 Cited by: §B.3, §B.4, Table 4. - [23] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021) Measuring mathematical problem solving with the math dataset. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Link Cited by: Table 5, §4.1. - [24] B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. (2024) Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186. Cited by: Table 3, §4.1. - [25] R. Irvine, D. Boubert, V. Raina, A. Liusie, Z. Zhu, V. Mudupalli, A. Korshuk, Z. Liu, F. Cremer, V. Assassi, et al. (2023) Rewarding chatbots for real-world engagement with millions of users. arXiv preprint arXiv:2303.06135. Cited by: §2. - [26] I. Jain, F. Zhang, W. Liu, et al. (2025) LiveCodeBench: holistic and contamination-free evaluation of llms for code. In International Conference on Learning Representations (ICLR) 2025, External Links: Link Cited by: Table 5, §4.1. - [27] C. Jin, T. Che, H. Peng, Y. Li, D. Metaxas, and M. Pavone (2024) Learning from teaching regularization: generalizable correlations should be easy to imitate. Advances in Neural Information Processing Systems 37, pp. 966–994. Cited by: §1. - [28] C. Jin, H. Peng, Q. Zhang, Y. Tang, D. N. Metaxas, and T. Che (2025) Two heads are better than one: test-time scaling of multi-agent collaborative reasoning. arXiv preprint arXiv:2504.09772. Cited by: §1. - [29] M. Jin, W. Luo, S. Cheng, X. Wang, W. Hua, R. Tang, W. Y. Wang, and Y. Zhang (2024) Disentangling memory and reasoning ability in large language models. arXiv preprint arXiv:2411.13504. Cited by: §2. - [30] M. Jin, Q. Yu, D. Shu, H. Zhao, W. Hua, Y. Meng, Y. Zhang, and M. Du (2024) The impact of reasoning step length on large language models. In Findings of the Association for Computational Linguistics ACL 2024, pp. 1830–1842. Cited by: §1. - [31] L. Kocsis and C. Szepesvári (2006) Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Cited by: 3rd item. - [32] J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y. Fleureau, G. Lample, and S. Polu (2024) NuminaMath. Numina. Note: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT Cited by: §B.3, §B.4, Table 4. - [33] J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y. Fleureau, G. Lample, and S. Polu (2024) NuminaMath. Numina. Note: [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) Cited by: §4.1. - [34] R. Li, J. Fu, B. Zhang, T. Huang, Z. Sun, C. Lyu, G. Liu, Z. Jin, and G. Li (2023) TACO: topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852. Cited by: §B.3, Table 4. - [35] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou, and W. Chen (2023) Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5315–5333. Cited by: §1. - [36] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals (2022) Competition-level code generation with alphacode. Science 378 (6624), pp. 1092–1097. External Links: Document, Link Cited by: §B.3, §B.4, Table 4. - [37] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023) Let’s verify step by step. In The Twelfth International Conference on Learning Representations, Cited by: §1, §1. - [38] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. (2024) Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. Cited by: Table 3, §4.1. - [39] F. Liu et al. (2020) Learning to summarize from human feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 583–592. Cited by: item (a). - [40] S. Liu and M. Zhu (2022) Distributed inverse constrained reinforcement learning for multi-agent systems. Advances in Neural Information Processing Systems 35, pp. 33444–33456. Cited by: §2. - [41] S. Liu and M. Zhu (2024) In-trajectory inverse reinforcement learning: learn incrementally before an ongoing trajectory terminates. Advances in Neural Information Processing Systems 37, pp. 117164–117209. Cited by: §2. - [42] Y. Liu, H. Gao, S. Zhai, X. Jun, T. Wu, Z. Xue, Y. Chen, K. Kawaguchi, J. Zhang, and B. Hooi (2025) GuardReasoner: towards reasoning-based llm safeguards. arXiv preprint arXiv:2501.18492. Cited by: §2. - [43] Y. Liu, S. Zhai, M. Du, Y. Chen, T. Cao, H. Gao, C. Wang, X. Li, K. Wang, J. Fang, J. Zhang, and B. Hooi (2025) GuardReasoner-vl: safeguarding vlms via reinforced reasoning. arXiv preprint arXiv:2505.11049. Cited by: §2. - [44] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. (2023) Self-refine: iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36, pp. 46534–46594. Cited by: §2. - [45] Mathematical Association of America (2024) 2024 american invitational mathematics examination (aime i). Technical report Mathematical Association of America. External Links: Link Cited by: Table 5, §4.1. - [46] Mathematical Association of America (2025) American Mathematics Competitions (AMC). Note: https://www.maa.org/student-programs/amc Accessed: 2025-05-05 Cited by: Table 5, §4.1. - [47] Y. Miao, S. Zhang, L. Ding, R. Bao, L. Zhang, and D. Tao (2024) Inform: mitigating reward hacking in rlhf via information-theoretic reward modeling. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, Cited by: §1. - [48] N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto (2025) S1: simple test-time scaling. External Links: 2501.19393, Link Cited by: §B.3, Table 4. - [49] N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto (2025) S1: simple test-time scaling. arXiv preprint arXiv:2501.19393. Cited by: Table 3, §1, §2, §2, §4.1. - [50] W. Nie, N. Narodytska, and A. Patel (2018) Relgan: relational generative adversarial networks for text generation. In International conference on learning representations, Cited by: item (a), §3.3. - [51] OpenAI (2024-11-20) GPT-4o system card. Technical report OpenAI. External Links: Link Cited by: Table 3, §4.1. - [52] OpenAI (2024-09) Learning to reason with llms. External Links: Link Cited by: §1. - [53] OpenAI (2024-12) OpenAI o1 system card. Technical report OpenAI. External Links: Link Cited by: §2. - [54] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelvin, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe (2022) Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. External Links: Link Cited by: §2. - [55] S. Park, X. Liu, Y. Gong, and E. Choi (2024) Ensembling large language models with process reward-guided tree search for better complex reasoning. Cited by: §2. - [56] Qwen Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, et al. (2024) Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. External Links: Link Cited by: Table 3, §4.1. - [57] A. Rashid, R. Wu, J. Grosse, A. Kristiadi, and P. Poupart (2024) A critical look at tokenwise reward-guided text generation. arXiv preprint arXiv:2406.07780. Cited by: item (a). - [58] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich (2006) Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pp. 729–736. Cited by: §2. - [59] D. Rein, B. Li, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2023) GPQA: a graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022. External Links: Link Cited by: Table 5, §4.1. - [60] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: Table 3, §1, §4.1. - [61] C. Snell, J. Lee, K. Xu, and A. Kumar (2024) Free process rewards without process labels. arXiv preprint arXiv:2412.01981. External Links: Link Cited by: §2. - [62] C. V. Snell, J. Lee, K. Xu, and A. Kumar (2025) Scaling test-time compute optimally can be more effective than scaling LLM parameters. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §A.1, §1, §2, §3.5, §3.5. - [63] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger (2012) Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE transactions on information theory 58 (5), pp. 3250–3265. Cited by: 3rd item. - [64] M. Studio (2024) Codeforces–python submissions. Note: https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions Cited by: §B.3, §B.4, Table 4. - [65] G. Swamy, D. Wu, S. Choudhury, D. Bagnell, and S. Wu (2023) Inverse reinforcement learning without reinforcement learning. pp. 33299–33318. Cited by: §2. - [66] J. Tan, K. Zhao, R. Li, J. X. Yu, C. Piao, H. Cheng, H. Meng, D. Zhao, and Y. Rong (2025) Can large language models be query optimizer for relational databases?. arXiv preprint arXiv:2502.05562. Cited by: §2. - [67] G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al. (2025) Gemma 3 technical report. arXiv preprint arXiv:2503.19786. Cited by: Appendix D. - [68] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins (2022) Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275. Cited by: §1. - [69] P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui (2024) Math-shepherd: verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426–9439. Cited by: §1, §1, §2, §4.1. - [70] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou (2022) Self-consistency improves chain of thought reasoning in language models. Cited by: §2, §4.2. - [71] Z. Wang, X. Liu, H. Chen, S. Ermon, et al. (2023) Math-shepherd: verify and reinforce llms step-by-step. Note: arXiv preprint arXiv:2312.08935 External Links: Link Cited by: Table 3, §4.1. - [72] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, pp. 229–256. Cited by: §3.4. - [73] Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang (2024) Inference scaling laws: an empirical analysis of compute-optimal inference for problem-solving with language models. arXiv preprint arXiv:2408.00724. Cited by: §2. - [74] Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang (2025) Inference scaling laws: an empirical analysis of compute-optimal inference for LLM problem-solving. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1. - [75] Y. Xie, A. Goyal, W. Zheng, M. Kan, T. P. Lillicrap, K. Kawaguchi, and M. Shieh (2024) Monte carlo tree search boosts reasoning via iterative preference learning. arXiv preprint arXiv:2405.00451. Cited by: §2. - [76] Y. Xie, K. Kawaguchi, Y. Zhao, J. X. Zhao, M. Kan, J. He, and M. Xie (2023) Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems 36, pp. 41618–41650. Cited by: §1, §2, §3.5. - [77] W. Xiong, H. Zhang, N. Jiang, and T. Zhang (2024) An implementation of generative prm. GitHub. Note: https://github.com/RLHFlow/RLHF-Reward-Modeling Cited by: §4.1. - [78] F. Yu, A. Gao, and B. Wang (2023) Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724 2 (6). Cited by: §1. - [79] L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou, Z. Liu, and H. Peng (2024) Free process rewards without process labels. arXiv preprint arXiv:2412.01981. Cited by: §1, §4.1, §4.3. - [80] S. Yuan, Z. Chen, Z. Xi, J. Ye, Z. Du, and J. Chen (2025) Agent-r: training language model agents to reflect via iterative self-training. arXiv preprint arXiv:2501.11425. Cited by: §2. - [81] X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen (2023) Mammoth: building math generalist models through hybrid instruction tuning. Cited by: §2. - [82] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang, M. Pavone, Y. Li, et al. (2024) Llama-berry: pairwise optimization for o1-like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884. Cited by: §1, §2. - [83] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, and C. Gan (2023) Planning with large language models for code generation. arXiv preprint arXiv:2303.05510. Cited by: §1, §2. - [84] X. Zhang, C. Du, T. Pang, Q. Liu, W. Gao, and M. Lin (2024) Chain of preference optimization: improving chain-of-thought reasoning in llms. Advances in Neural Information Processing Systems 37, pp. 333–356. Cited by: §2. - [85] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y. Wang (2024) Language agent tree search unifies reasoning, acting, and planning in language models. In Proceedings of the 41st International Conference on Machine Learning, pp. 62138–62160. Cited by: §1, §2, §3.5. - [86] B. D. Ziebart, J. A. Bagnell, and A. K. Dey (2010) Modeling interaction via the principle of maximum causal entropy. Cited by: §2. - [87] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI ’08), pp. 1433–1438. External Links: Link Cited by: §2. ## Appendix A Details for Search-based TTS ### A.1 PRM Aggregation Techniques #### Answer Aggregation. After search, we obtain $N$ candidate reasoning chains for each question $q$ . Following Snell et al. [62], we compute a step-wise aggregation score for each chain by either taking the minimum PRM reward across its steps or using the reward of its final step. For inter-answer aggregation, we either select the answer whose chain has the highest step-wise score (Maximum Score) or sum the step-wise scores of all chains that produce the same answer and choose the answer with the highest total (Sum Score). #### Step-wise Aggregation. For each reasoning chain, we compute its score by one of two methods: - Minimum Step Reward: Use the minimum PRM reward among all steps in the chain. - Last Step Reward: Use the PRM reward at the chain’s final step. #### Inter-answer Aggregation. After computing step-wise scores, we aggregate across chains that yield the same final answer using one of two methods: - Maximum Score: Select the answer of the chain with the highest step-wise score. - Sum Score: Sum the step-wise scores of all chains with the same answer and select the answer with the highest total. These choices yield four PRM aggregation methods: - PRM-Min-Max: Minimum Step Reward + Maximum Score. - PRM-Min-Sum: Minimum Step Reward + Sum Score. - PRM-Last-Max: Last Step Reward + Maximum Score. - PRM-Last-Sum: Last Step Reward + Sum Score. In addition, we include Majority Vote, which selects the final answer that appears most frequently among all reasoning chains. ### A.2 Test-Time Search Techniques #### Best-of-N Sampling. We sample $N$ complete chains of thought (CoTs) from the policy $\pi_{\theta}(\cdot\mid q)$ . By default, we aggregate candidates using the PRM-Min-Sum method described in Section 3.5. #### Beam Search. We adopt the BFS-V implementation with beam size $N$ and beam width $M$ : 1. Generate $N$ initial proposals from $\pi_{\theta}$ . 1. Score each proposal using the PRM’s step-wise reward. 1. Retain the top $\tfrac{N}{M}$ proposals. 1. For each surviving proposal, sample $M$ next-step candidates and return to Step 2. Repeat until $N$ complete solutions are generated. By default, we aggregate candidates with PRM-Min-Sum (Section 3.5). #### Monte Carlo Tree Search (MCTS). MCTS combines exploration via UCT with PRM-based value estimation: 1. Selection. From the root, traverse by choosing the child $n$ that maximizes $$ \mathrm{UCT}(n)=\mu_{n}+\alpha\sqrt{\frac{\ln N_{\mathrm{parent}}}{N_{n}}}, $$ where $\mu_{n}$ is the cumulative PRM reward of all future steps, $N_{\mathrm{parent}}$ and $N_{n}$ are the visit counts of the parent and current node, respectively, and $\alpha=1.25$ by default. 1. Expansion. At a leaf node, generate $M$ new steps from $\pi_{\theta}$ . 1. Simulation. Roll out each new node to completion (temperature 0) and use the terminal PRM reward as the node’s value. 1. Backpropagation. Update $\mu_{n}$ , $N_{\mathrm{parent}}$ , and $N_{n}$ along the selected path. Terminate when $N$ complete solutions are obtained. By default, we use PRM-Min-Sum to aggregate candidates (Section 3.5). #### Self-Consistency. We sample $N$ CoTs from $\pi_{\theta}$ without PRM guidance and aggregate answers by majority vote (see Appendix A). Self-Consistency exploits output diversity but does not use any PRM feedback. ## Appendix B Implementation Details ### B.1 Prompts By defaul, we the same CoT prompts for both training and evaluation. The CoT prompt in our experiments are indicated in Table B.1. | Mathematical/Scientific Reasoning Coding | Please reason step by step, and put your final answer within \\boxed{Your answer}. Write Python code to solve the problem. Please reason step by step and present the code in ’’’python YOUR CODE’’’ at the end. | | --- | --- | ### B.2 Models To clarify the context of our empirical study, Table 3 summarises the LLMs we utilized in our experiments, detailing their backbone architectures and key training procedures. Table 3: Overview of LLMs used in our experiments. | Eurus-2-7B-PRIME | Qwen2.5-Math-7B | Training by two stages: (i) supervised fine-tuning on the Eurus instruction corpus; (ii) online RL with PRIME implicit process reward using GSM8K, MATH, and synthetic Chain-of-Thought data [11]. | | --- | --- | --- | | DeepSeek-Math-7B-RL | DeepSeek-R1-7B | SFT on DeepSeekMath-Instruct (800 K math CoT pairs) followed by GRPO reinforcement learning [60]. | | Math-Shepherd-Mistral-7B-RL | Mistral-7B | PRM-guided step-wise PPO training on MetaMATH and MATH Level-5 problems [71]. | | Qwen-2.5-Math-7B-Instruct | Qwen-2.5 | Supervised fine-tuning on 200 K math instructions plus RLHF preference optimization [24]. | | DeepSeek-R1 | - | Four-stage pipeline: pre-training on 1.2T tokens, SFT, RLHF, and DPO. Public endpoint deepseek-ai/DeepSeek-R1. We use this model for inference based on the togather API [20]. | | DeepSeek-V3 | - | Continued bilingual web-scale pre-training on DeepSeek-R1, followed by alignment stages; version tag 0324. We use this model for inference based on the togather API [38]. | | s1.1-7B | Qwen2.5-7B-Instruct | Few-shot SFT and low-resource RL on $\approx$ 1K GPT-4–generated reasoning tasks (s1K-1.1) [49]. | | Qwen-2.5-7B-Instruct | - | 2M general instruction pairs plus RLHF; supports 131K-token context [56]. | | DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | Distillation of DeepSeek-R1 pseudo-labels into Qwen-7B for reduced inference cost. | | GPT-4o | - | OpenAI RLHF pipeline; accessed via chat/completions with model gpt-4o-2024-11-20. We use this model for inference based on the OpenAI API [51]. | ### B.3 Training Datasets Our training datasets consist of math questions sampled from NuminaMath [32], openaimath [48], Omni-MATH [15], and code questions sampled APPS [22], CodeContests [36], TACO [34], and Codeforces [64]. The detailed information of our training dataset is indicated in Table 4. Table 4: Training dataset source in our experiments. | Dataset | # Questions | Introduction | | --- | --- | --- | | Mathematical Reasoning | | | | NuminaMath | 125K | Olympiad-level competition problems with chain-of-thought solutions; the metric is exact-match accuracy [32]. | | Openaimath | 12K | Competition-style questions released by OpenAI, spanning algebra, combinatorics, geometry, etc.; metric is exact-match accuracy [48]. | | Omni-MATH | 4K | Universal Olympiad-level problems covering 8 subjects and 6 difficulty tiers; evaluated via exact-match correctness [15]. | | Coding | | | | APPS | 3K | Real-world programming challenges of varying difficulty; graded by hidden unit tests— metric is Pass@1 [22]. | | CodeContests | 3K | Codeforces-style contest tasks curated for AlphaCode; metric is Pass@1 [36]. | | TACO | 10K | Algorithmic coding problems grouped by topic and difficulty; metric is Pass@1 [34]. | | Codeforces | 3K | A cleaned subset of Python submissions from Codeforces transformed into standalone tasks; metric is Pass@1 [64]. | ### B.4 Validation Datasets Our validation dataset is sampled from NuminaMath [32], APPS [22], CodeContests [36], and Codeforces [64] with a total of 2K mathematical and coding questions. ### B.5 Evaluation Tasks For completeness, Table 5 enumerates the evaluation tasks—covering mathematical/scientific reasoning, and code generation—on which we evaluate all systems throughout this paper. Table 5: Introduction of evaluation tasks used in our experiments. | Dataset Mathematical/Scientific Reasoning AIME2024 | Introduction American Invitational Mathematics Examination 2024. Two papers (AIME I & II) with 15 questions each, for a total of 30 short-answer problems. We evaluate exact‐match answer accuracy and average across three random seeds for stability [45]. | | --- | --- | | AMC | All come from AMC12 2022, AMC12 2023 with 83 multiple-choice questions. Models must output the correct option; metric is accuracy [46]. | | MATH500 | A subset of the MATH benchmark containing 500 competition problems spanning seven topics and five difficulty levels. We report accuracy [23] . | | GPQA-Diamond | High-quality slice of GPQA with 198 graduate-level multiple-choice questions in biology, chemistry and physics [59]. We report accuracy. | | Coding Tasks | | | HumanEval | Classic Python code-generation set with 164 functions, each graded by unit tests. We report Pass@1 [7]. | | MBPP | Sanitized version of Mostly Basic Python Problems comprising 427 hand-verified tasks; Pass@1 is reported [19]. | | LeetCode | 180 LeetCode tasks including 45 Easy, 91 Medium and 44 Hard; we convert to Python and grade with official test cases, reporting Pass@1 [10]. | | LiveCodeBench | Continually-updated contest benchmark. We use releases v4 with 776 tasks; metric is Pass@1 [26]. | ## Appendix C Additional Results The detailed performance of each LLM-PRM combination using Best-of-N with 64 rollouts on AIME2024, AMC, and MATH500 are shown in Figure 6, Figure 7 and Figure 8, respectively. Our policy model combined with our PRM consistently achieves the best performance in all datasets and LLM-PRM combinations, demonstrating the effectiveness of AIRL-S in unifying RL-based and Search-based TTS. <details> <summary>x9.png Details</summary> ![30da6480](/v1/image/30da648075318fa73ec5821312814dd7115b1ac2478f461af0aed75743ea0980) ### Visual Description \n ## Grouped Bar Chart: Accuracy of LLM and PRM Combinations on AIME2024 ### Overview This image is a grouped bar chart comparing the performance of four different Large Language Models (LLMs) when paired with five different Process Reward Models (PRMs) or evaluation methods. The performance metric is accuracy percentage, measured using a "Best-of-N" sampling strategy on the AIME2024 benchmark. The chart visually demonstrates how the choice of PRM significantly impacts the final accuracy for each base LLM. ### Components/Axes * **Chart Title:** "Accuracy of each LLM and PRM combination using Best-of-N on AIME2024" * **Y-Axis:** * **Label:** "Accuracy (%)" * **Scale:** Linear scale from 10 to 30, with major gridlines at intervals of 5 (10, 15, 20, 25, 30). * **X-Axis:** Represents four distinct LLMs. The labels are: 1. `Qwen2.5-7B-Instruct` 2. `Eurus-2-7B-PRIME` 3. `Phi-4-14B` 4. `Qwen2.5-7B-AIRL-S(Our LLM)` * **Legend:** Located in the top-left corner of the plot area. It defines five data series, each associated with a specific color and label: 1. **Pink:** `Accuracy@1` 2. **Light Peach:** `Math-Shepherd-Mistral-7B-PRM` 3. **Light Gray:** `EurusPRM-Stage2` 4. **Sage Green:** `Llama3.1-8B-PRM-Deepseek-Data` 5. **Dark Gray:** `Qwen2.5-AIRL-S-PRM(Ours PRM)` ### Detailed Analysis The chart displays four groups of bars, one for each LLM on the x-axis. Each group contains five bars corresponding to the five PRM/evaluation methods from the legend. The numerical accuracy value is annotated above each bar. **1. LLM: Qwen2.5-7B-Instruct** * **Accuracy@1 (Pink):** 16.7% * **Math-Shepherd-Mistral-7B-PRM (Light Peach):** 20.0% * **EurusPRM-Stage2 (Light Gray):** 23.3% * **Llama3.1-8B-PRM-Deepseek-Data (Sage Green):** 23.3% * **Qwen2.5-AIRL-S-PRM (Dark Gray):** 23.3% * *Trend:* Accuracy increases from the baseline `Accuracy@1` with all PRMs, plateauing at 23.3% for the last three methods. **2. LLM: Eurus-2-7B-PRIME** * **Accuracy@1 (Pink):** 20.0% * **Math-Shepherd-Mistral-7B-PRM (Light Peach):** 23.3% * **EurusPRM-Stage2 (Light Gray):** 20.0% * **Llama3.1-8B-PRM-Deepseek-Data (Sage Green):** 23.3% * **Qwen2.5-AIRL-S-PRM (Dark Gray):** 23.3% * *Trend:* Performance is mixed. `Math-Shepherd`, `Llama3.1-PRM`, and `Qwen2.5-PRM` improve accuracy to 23.3%, while `EurusPRM-Stage2` matches the baseline `Accuracy@1` at 20.0%. **3. LLM: Phi-4-14B** * **Accuracy@1 (Pink):** 13.3% * **Math-Shepherd-Mistral-7B-PRM (Light Peach):** 16.7% * **EurusPRM-Stage2 (Light Gray):** 20.0% * **Llama3.1-8B-PRM-Deepseek-Data (Sage Green):** 20.0% * **Qwen2.5-AIRL-S-PRM (Dark Gray):** 20.0% * *Trend:* A clear stepwise improvement. `Accuracy@1` is the lowest (13.3%). `Math-Shepherd` provides a boost to 16.7%. The final three PRMs (`EurusPRM`, `Llama3.1-PRM`, `Qwen2.5-PRM`) all achieve the same higher accuracy of 20.0%. **4. LLM: Qwen2.5-7B-AIRL-S(Our LLM)** * **Accuracy@1 (Pink):** 26.7% * **Math-Shepherd-Mistral-7B-PRM (Light Peach):** 30.0% * **EurusPRM-Stage2 (Light Gray):** 30.0% * **Llama3.1-8B-PRM-Deepseek-Data (Sage Green):** 26.7% * **Qwen2.5-AIRL-S-PRM (Dark Gray):** 30.0% * *Trend:* This LLM shows the highest overall performance. The baseline `Accuracy@1` is already high at 26.7%. `Math-Shepherd`, `EurusPRM`, and the proprietary `Qwen2.5-AIRL-S-PRM` all push the accuracy to the chart's maximum of 30.0%. `Llama3.1-PRM` matches the baseline. ### Key Observations 1. **Top Performer:** The highest accuracy achieved is **30.0%**, reached by three different PRMs (`Math-Shepherd`, `EurusPRM`, `Qwen2.5-PRM`) when applied to the `Qwen2.5-7B-AIRL-S` LLM. 2. **PRM Impact:** For every LLM, using a PRM (any of the last four bars) results in equal or higher accuracy compared to the `Accuracy@1` baseline (the first pink bar in each group). 3. **Proposed Method Performance:** The authors' proposed PRM, `Qwen2.5-AIRL-S-PRM` (dark gray bars), is consistently among the top-performing methods for each LLM. It ties for the highest score in three out of four LLM groups. 4. **LLM Baseline Variation:** The baseline `Accuracy@1` varies significantly across LLMs, from a low of 13.3% (`Phi-4-14B`) to a high of 26.7% (`Qwen2.5-7B-AIRL-S`). 5. **Performance Plateaus:** In several cases (e.g., the last three PRMs for `Qwen2.5-7B-Instruct` and `Phi-4-14B`), different PRMs converge to the exact same accuracy score, suggesting a performance ceiling for that specific LLM-benchmark combination. ### Interpretation This chart serves as an ablation study or comparative analysis within the field of AI reasoning and mathematical problem-solving (as AIME is a math competition benchmark). The data suggests several key insights: * **PRMs are Crucial:** The consistent improvement over `Accuracy@1` demonstrates that using a Process Reward Model to re-rank or select among multiple generated solutions (Best-of-N) is an effective strategy for boosting LLM performance on complex reasoning tasks. * **Model Synergy Matters:** The effectiveness of a PRM is not absolute; it depends on the base LLM it is paired with. For example, `EurusPRM-Stage2` performs well with `Qwen2.5-7B-Instruct` but only matches the baseline with its namesake `Eurus-2-7B-PRIME`. This highlights the importance of compatibility between the generator (LLM) and the verifier (PRM). * **Authors' Contribution:** The chart is likely from a research paper introducing the `Qwen2.5-7B-AIRL-S` LLM and/or the `Qwen2.5-AIRL-S-PRM`. The data positions their contributions favorably: their LLM has the highest baseline and peak performance, and their PRM is a top-tier verifier across multiple LLMs. The fact that their PRM achieves the maximum 30.0% accuracy with their own LLM suggests a successfully co-designed system. * **Diminishing Returns:** The performance plateaus indicate that for a given LLM and problem difficulty, there may be a maximum achievable accuracy with current PRM techniques. Breaking through this ceiling might require fundamental improvements in the base LLM's reasoning capabilities or the PRM's verification logic. </details> Figure 6: Performance of applying four PRMs on four different generation LLMs using Best-of-N with 64 rollouts on AIME2024 <details> <summary>x10.png Details</summary> ![69cb7f88](/v1/image/69cb7f88ad794bdbab3220f70cac16213dbb5cd27121f2574c0a146ca7408bed) ### Visual Description \n ## Grouped Bar Chart: Accuracy of LLM and PRM Combinations ### Overview This is a grouped bar chart comparing the accuracy (in percentage) of four different Large Language Models (LLMs) when each is paired with five different Process Reward Models (PRMs) using a "Best-of-N" sampling method on the AMC (likely a benchmark dataset). The chart demonstrates the performance impact of different PRMs on each base LLM. ### Components/Axes * **Chart Title:** "Accuracy of each LLM and PRM combination using Best-of-N on AMC" * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 30 to 70, with major gridlines at intervals of 5%. * **X-Axis:** Lists four distinct LLM models: 1. `Qwen2.5-7B-Instruct` 2. `Eurus-2-7B-PRIME` 3. `Phi-4-14B` 4. `Qwen2.5-7B-AIRL-S(Our LLM)` * **Legend:** Positioned in the top-left corner of the plot area. It defines five PRM categories, each associated with a specific color: * **Accuracy@1** (Pink): Represents the baseline accuracy of the LLM without a PRM. * **Math-Shepherd-Mistral-7B-PRM** (Light Beige) * **EurusPRM-Stage2** (Light Gray) * **Llama3.1-8B-PRM-Deepseek-Data** (Light Green) * **Qwen2.5-AIRL-S-PRM(Ours PRM)** (Dark Gray): The proposed PRM from the authors of the chart. ### Detailed Analysis Each LLM on the x-axis has a cluster of five bars, corresponding to the five PRMs in the legend. The numerical accuracy value is printed atop each bar. **1. LLM: Qwen2.5-7B-Instruct** * **Trend:** A clear, steady upward trend from left to right across the PRM types. * **Data Points (Approximate %):** * Accuracy@1 (Pink): 33.7 * Math-Shepherd (Beige): 53.0 * EurusPRM (Gray): 54.2 * Llama PRM (Green): 55.4 * Ours PRM (Dark Gray): 56.6 **2. LLM: Eurus-2-7B-PRIME** * **Trend:** Upward trend, with a plateau between the third and fourth bars. * **Data Points (Approximate %):** * Accuracy@1 (Pink): 56.6 * Math-Shepherd (Beige): 61.4 * EurusPRM (Gray): 63.9 * Llama PRM (Green): 63.9 * Ours PRM (Dark Gray): 65.1 **3. LLM: Phi-4-14B** * **Trend:** General upward trend, with a slight dip in the middle (third bar). * **Data Points (Approximate %):** * Accuracy@1 (Pink): 44.6 * Math-Shepherd (Beige): 60.2 * EurusPRM (Gray): 59.0 * Llama PRM (Green): 61.4 * Ours PRM (Dark Gray): 62.6 **4. LLM: Qwen2.5-7B-AIRL-S (Our LLM)** * **Trend:** Consistent upward trend, achieving the highest overall values on the chart. * **Data Points (Approximate %):** * Accuracy@1 (Pink): 59.0 * Math-Shepherd (Beige): 63.9 * EurusPRM (Gray): 65.1 * Llama PRM (Green): 65.1 * Ours PRM (Dark Gray): 67.5 ### Key Observations 1. **Consistent PRM Hierarchy:** For every LLM, the "Ours PRM" (dark gray bar) yields the highest accuracy. The "Accuracy@1" baseline (pink bar) is consistently the lowest. 2. **Performance Plateaus:** For `Eurus-2-7B-PRIME` and `Qwen2.5-7B-AIRL-S`, the `EurusPRM-Stage2` and `Llama3.1-8B-PRM-Deepseek-Data` achieve identical or nearly identical scores (63.9% and 65.1% respectively). 3. **Highest Overall Score:** The combination of `Qwen2.5-7B-AIRL-S (Our LLM)` with `Qwen2.5-AIRL-S-PRM (Ours PRM)` achieves the peak accuracy of **67.5%**. 4. **LLM Baseline Variance:** The baseline "Accuracy@1" varies significantly between LLMs, from a low of 33.7% (`Qwen2.5-7B-Instruct`) to a high of 59.0% (`Qwen2.5-7B-AIRL-S`). ### Interpretation The data strongly suggests that integrating a Process Reward Model (PRM) significantly improves the accuracy of LLMs on the AMC benchmark compared to using the LLM alone (Accuracy@1). The authors' proposed PRM ("Ours PRM") demonstrates a consistent performance advantage across all four tested LLMs, indicating its robustness and general effectiveness. Furthermore, the chart implies a synergy between the authors' own LLM (`Qwen2.5-7B-AIRL-S`) and their own PRM, as this pairing produces the best result. This could suggest that the models were co-developed or fine-tuned to work well together. The plateau in performance for some PRMs on certain LLMs may indicate a performance ceiling for those specific model combinations or that those PRMs provide similar quality of reward signals. The primary takeaway is the clear value of using advanced PRMs, particularly the one highlighted as "Ours," for boosting LLM reasoning accuracy in this context. </details> Figure 7: Performance of applying four PRMs on four different generation LLMs using Best-of-N with 64 rollouts on AMC <details> <summary>x11.png Details</summary> ![22491e06](/v1/image/22491e0663632fda683840f84bf3b78c12d963aad3cdba60cc69088148b7ff03) ### Visual Description ## Bar Chart: Accuracy of LLM and PRM Combinations on MATH500 ### Overview This is a grouped bar chart comparing the accuracy (in percentage) of four different Large Language Models (LLMs) when paired with five different Process Reward Models (PRMs) using a "Best-of-N" sampling strategy on the MATH500 benchmark. The chart demonstrates the performance uplift achieved by using various PRMs compared to the base model's accuracy (Accuracy@1). ### Components/Axes * **Chart Title:** "Accuracy of each LLM and PRM combination using Best-of-N on MATH500" * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 70 to 90, with major gridlines at intervals of 5% (70, 75, 80, 85, 90). * **X-Axis:** Lists four distinct LLM models. From left to right: 1. `Qwen2.5-7B-Instruct` 2. `Eurus-2-7B-PRIME` 3. `Phi-4-14B` 4. `Qwen2.5-7B-AIRL-S (Our LLM)` * **Legend:** Positioned in the top-left corner of the chart area. It defines five data series (PRM types) by color: * **Pink:** `Accuracy@1` (This represents the baseline accuracy of the LLM alone, without using a PRM for Best-of-N sampling). * **Light Beige:** `Math-Shepherd-Mistral-7B-PRM` * **Light Gray:** `EurusPRM-Stage2` * **Light Green:** `Llama3.1-8B-PRM-Deepseek-Data` * **Dark Gray:** `Qwen2.5-AIRL-S-PRM (Ours PRM)` ### Detailed Analysis The chart presents data for four LLM groups, each containing five bars corresponding to the PRMs in the legend. Values are extracted by matching bar color to the legend and reading the data label above each bar. **1. LLM: Qwen2.5-7B-Instruct** * **Trend:** All PRM combinations show a significant accuracy increase over the baseline. * **Data Points (from left to right within the group):** * Accuracy@1 (Pink): **72.0%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **80.4%** * EurusPRM-Stage2 (Light Gray): **80.2%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **80.8%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **81.6%** **2. LLM: Eurus-2-7B-PRIME** * **Trend:** Baseline accuracy is higher than the first model. All PRMs provide a substantial boost, with three PRMs achieving very similar high scores. * **Data Points:** * Accuracy@1 (Pink): **79.2%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **84.1%** * EurusPRM-Stage2 (Light Gray): **84.3%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **84.6%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **84.4%** **3. LLM: Phi-4-14B** * **Trend:** Similar pattern to the second group. The baseline is slightly lower, but the PRM-enhanced accuracies are among the highest in the chart. * **Data Points:** * Accuracy@1 (Pink): **78.6%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **84.2%** * EurusPRM-Stage2 (Light Gray): **84.6%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **85.2%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **85.6%** **4. LLM: Qwen2.5-7B-AIRL-S (Our LLM)** * **Trend:** This model shows the highest baseline and achieves the highest overall accuracies with PRMs. The performance gap between the top three PRMs is very narrow. * **Data Points:** * Accuracy@1 (Pink): **80.2%** * Math-Shepherd-Mistral-7B-PRM (Light Beige): **85.6%** * EurusPRM-Stage2 (Light Gray): **85.4%** * Llama3.1-8B-PRM-Deepseek-Data (Light Green): **86.0%** * Qwen2.5-AIRL-S-PRM (Dark Gray): **86.4%** ### Key Observations 1. **Universal PRM Benefit:** For every LLM, using any PRM with Best-of-N sampling dramatically improves accuracy over the baseline (Accuracy@1). The smallest observed improvement is +8.2 percentage points (for Eurus-2-7B-PRIME with Math-Shepherd), and the largest is +9.6 points (for Qwen2.5-7B-Instruct with Qwen2.5-AIRL-S-PRM). 2. **Consistent Top Performer:** The dark gray bar (`Qwen2.5-AIRL-S-PRM (Ours PRM)`) is the highest or tied for the highest in every LLM group. Its lead is most pronounced with the `Qwen2.5-7B-Instruct` model. 3. **Model-PRM Synergy:** The highest absolute accuracy (86.4%) is achieved by the combination of the authors' own LLM (`Qwen2.5-7B-AIRL-S`) and their own PRM (`Qwen2.5-AIRL-S-PRM`), suggesting a potential synergistic effect. 4. **Clustering of High-Performing PRMs:** In the three rightmost LLM groups, the top three PRMs (Light Gray, Light Green, Dark Gray) often cluster within 1.0 percentage point of each other, indicating similar efficacy for those models on this benchmark. ### Interpretation This chart provides strong empirical evidence for the effectiveness of using Process Reward Models (PRMs) with a Best-of-N sampling strategy to enhance the mathematical problem-solving accuracy of LLMs on the MATH500 benchmark. The data suggests that the choice of PRM matters, but the performance of several advanced PRMs (EurusPRM-Stage2, Llama3.1-8B-PRM-Deepseek-Data, and Qwen2.5-AIRL-S-PRM) is closely matched across different base models. The most significant finding is the consistent top performance of the `Qwen2.5-AIRL-S-PRM`, which the authors label as "Ours PRM." This implies the chart is likely from a research paper or technical report introducing this new PRM. The chart's purpose is to validate their model by showing it either matches or exceeds the performance of existing state-of-the-art PRMs across a variety of LLM backbones. The peak performance of their own LLM-PRM combination further strengthens the claim of a well-optimized system. The clear visual message is that their proposed method (`Qwen2.5-AIRL-S-PRM`) is a robust and leading solution for improving LLM reasoning accuracy via reward-guided sampling. </details> Figure 8: Performance of applying four PRMs on four different generation LLMs using Best-of-N with 64 rollouts on MATH500 ## Appendix D Limitations While our experiments demonstrate that AIRL-S effectively unifies RL-based and search-based TTS across multiple test-time search algorithms using a single policy model and PRM, we have not tested its scalability on other base model architectures such as Phi-4 [2] and Gemma-3 [67]. We also have not studied how training dataset size affects AIRL-S ’s performance. Future work will evaluate AIRL-S on larger language models and more extensive datasets, including tasks with more difficult reasoning questions. ## Appendix E Broader Impacts Large‐scale reward-model-guided reinforcement learning (RL) and test-time search (TTS) can amplify the reasoning power of language models, lowering the barrier to advanced problem solving. Positive impacts include: (i) Scientific acceleration —open-sourced PRMs provide a drop-in verifier for symbolic mathematics and formal-methods research, potentially shortening proof discovery cycles; (ii) Educational equity —a modest-sized model fine-tuned with our AIRL-S pipeline attains near-GPT-4o performance, enabling free or low-cost tutoring tools in under-resourced regions; (iii) Software reliability —the PRM can rank candidate patches during code repair, reducing latent defects in safety-critical systems. Negative impacts and misuse scenarios. Disinformation : stronger mathematical/verbal reasoning could automate generation of persuasive but false technical content. Fairness & bias : reward models inherit biases present in demonstrations; these biases can be amplified by MCTS selection. Dual-use : chain-level reward shaping may help adversaries explore large search spaces (e.g. exploit generation, surveillance analytics). Labour displacement : automating competitive-programming-style tasks might displace junior software-engineering roles.

Rendering Paper...