# GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
> Corresponding author.
## Abstract
Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRMâs generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.
## 1 Introduction
The advent of Multimodal Large Language Models (MLLMs) has marked a significant milestone in artificial intelligence, demonstrating remarkable capabilities in integrating and understanding visual and textual information (Caffagni et al. 2024; Yan et al. 2024c; Yan and Lee 2024; Huo et al. 2024; Zheng et al. 2024b). While these models excel at general-purpose tasks such as image captioning and visual question answering, they often falter when confronted with complex, multi-step reasoning challenges, particularly within specialized domains like mathematics (Wang et al. 2024; Yan et al. 2024a, 2025a; Ahn et al. 2024). Solving multimodal mathematical problems requires not only accurate perception of visual elements (e.g., geometric figures, function graphs) but also a rigorous, step-by-step logical deduction process (Shi et al. 2024; Zhuang et al. 2025; Yan et al. 2025b). Minor errors in either image interpretation or logical inference can cascade, leading to entirely incorrect final answers.
<details>
<summary>figures/comparison.png Details</summary>

### Visual Description
## Diagram: Comparison of Reward Models for Multimodal Math Reasoning
### Overview
The image is a technical diagram comparing three different reward model architectures for evaluating and improving multimodal mathematical reasoning. It is divided into three horizontal panels, labeled (a), (b), and (c), each illustrating a distinct model: the Outcome Reward Model (ORM), the Multimodal Process Reward Model (PRM), and the proposed GM-PRM (Ours). The diagram uses a flowchart style with boxes, arrows, and icons to depict the process flow, inputs, outputs, and feedback mechanisms for each model.
### Components/Axes
The diagram is structured into three distinct sections, each with its own title and flow.
**Panel (a): Outcome Reward Model**
* **Title:** `(a) Outcome Reward Model`
* **Input:** A box labeled `Question` containing a chalkboard icon with the equation `E=mc²`.
* **Process:** An arrow labeled `Output` points to a box labeled `Answer`.
* **Evaluation:** An arrow points from the `Answer` box to a blue robot icon labeled `ORM`.
* **Feedback:** A dashed orange arrow labeled `Reward` points from the `ORM` robot back to the `Answer` box.
* **Annotation:** Red text in the top-right corner states: `ONLY Reward Final Ans`.
**Panel (b): Multimodal Process Reward Model**
* **Title:** `(b) Multimodal Process Reward Model`
* **Input:** A box labeled `Multimodal Math Qns` containing icons of a math problem sheet and a pencil.
* **Process:** A sequence of boxes connected by arrows: `Step 1` -> `Step 2` -> `...` -> `Step T` -> `Answer`.
* **Evaluation:** An arrow points from the `Answer` box to a blue robot icon labeled `PRM`.
* **Feedback:** Multiple dashed orange arrows labeled `Reward` point from the `PRM` robot back to each intermediate step (`Step 1`, `Step 2`, `Step T`) and the final `Answer`.
* **Annotations:** Two gray boxes on the right list limitations: `Limited Explainability` and `No Correction Mechanism`.
**Panel (c): GM-PRM (Ours)**
* **Title:** `(c) GM-PRM (Ours)`
* **Input:** A box labeled `Multimodal Math Qns` (identical to panel b).
* **Process:** A sequence of boxes: `Step 1` -> `Step 2` -> `...` -> `Step T` -> `Answer`.
* **Key Modification:** `Step 2` is highlighted in purple. A purple arrow labeled `1st incorrect step` points from `Step 2` to a subsequent purple arrow labeled `After Correction` that points back into the process flow between `Step 2` and the `...` box.
* **Evaluation:** An arrow points from the `Answer` box to a purple robot icon labeled `GM-PRM`. The arrow is labeled `Refined BoN`.
* **Feedback:** Dashed orange arrows labeled `Reward` point from the `GM-PRM` robot back to each step and the answer.
* **Analysis Module:** A large purple dashed box at the bottom is labeled `Analysis & Judgement`. Inside, three connected purple boxes are labeled: `Step Intent`, `Image Alignment`, and `Reasoning Logic`.
* **Annotation:** A gray box on the right states: `Refined & Corrected Version`.
### Detailed Analysis
The diagram presents a clear evolution of model sophistication:
1. **Outcome Reward Model (ORM):** This is the simplest model. It takes a question, generates an answer, and the ORM provides a reward signal based **only** on the final answer's correctness. There is no evaluation of the reasoning process.
2. **Multimodal Process Reward Model (PRM):** This model introduces step-wise evaluation. It breaks down the solution into discrete steps (`Step 1` to `Step T`). The PRM provides reward signals for each intermediate step and the final answer. However, the diagram notes two critical flaws: it offers **limited explainability** for its rewards and has **no mechanism to correct** an identified incorrect step.
3. **GM-PRM (Ours):** This is the proposed, advanced model. It builds upon the PRM framework but introduces two major enhancements:
* **Correction Mechanism:** It can identify the `1st incorrect step` (shown as `Step 2` in purple) and initiate a correction process (`After Correction` arrow), leading to a `Refined BoN` (Best-of-N) answer.
* **Analysis & Judgement Module:** A dedicated component performs deep analysis based on three criteria: `Step Intent` (understanding the goal of the step), `Image Alignment` (ensuring the step correctly uses visual information), and `Reasoning Logic` (validating the logical flow). This module directly informs the GM-PRM's reward and correction process, addressing the "limited explainability" issue of the standard PRM. The resulting model is described as a `Refined & Corrected Version`.
### Key Observations
* **Visual Coding:** The diagram uses color consistently to denote the proposed model's components. Purple is used for the GM-PRM robot, the incorrect step, the correction flow, and the entire Analysis & Judgement module, visually distinguishing it from the blue ORM/PRM robots and orange reward arrows.
* **Flow Complexity:** The process flow increases in complexity from (a) to (c). Panel (a) is a simple loop, (b) adds parallel reward loops, and (c) adds a corrective branch and a parallel analysis subsystem.
* **Spatial Layout:** The three models are stacked vertically for direct comparison. The "Analysis & Judgement" module in (c) is placed at the bottom, acting as a foundational support for the GM-PRM process above it.
* **Iconography:** Simple icons (chalkboard, math sheet, pencil, robot) are used to represent concepts, making the diagram accessible. The robot's expression changes from a simple smile (ORM, PRM) to a more focused, determined look (GM-PRM), subtly implying greater capability.
### Interpretation
This diagram argues for a paradigm shift in reward modeling for multimodal reasoning tasks. It posits that evaluating only the final outcome (ORM) is insufficient. While evaluating each process step (PRM) is better, it remains a passive evaluator that cannot explain its judgments or intervene when errors occur.
The **GM-PRM** is presented as a solution that transforms the reward model from a passive judge into an active tutor. By integrating a structured `Analysis & Judgement` module that scrutinizes intent, visual grounding, and logic, it gains the explainability the PRM lacks. More importantly, by incorporating a `Correction Mechanism`, it can actively repair flawed reasoning chains. This suggests the GM-PRM is designed not just to score performance, but to **improve** the reasoning process itself, leading to more reliable and refined outputs. The diagram effectively communicates that the key innovation is the closed-loop system of analysis, judgment, reward, and correction.
</details>
Figure 1: Comparison among ORM (a), PRM (b), and our proposed GM-PRM (c) for multimodal math reasoning.
To mitigate these reasoning deficiencies, Process Reward Models (PRMs) have emerged as a promising paradigm (Gao et al. 2024; Zhong et al. 2025). Unlike outcome-based models that only reward correct final answers (shown in Figure 1 (a)), PRMs provide fine-grained supervision by evaluating the correctness of each intermediate step in a reasoning chain (Zheng et al. 2024a; Lambert et al. 2024; Yan et al. 2024b), as shown in Figure 1 (b). This approach has proven effective in the language domain (Zeng et al. 2025; Yuan et al. 2024; Zhang et al. 2025a). However, extending PRMs to the multimodal context presents unique challenges (Miao et al. 2025; Du et al. 2025; Li et al. 2025b; Cao et al. 2025). Existing multimodal PRMs often function as binary classifiers, assigning a simple correct/incorrect label to each step, which offers limited explanatory power. Furthermore, they typically possess the ability to identify errors but lack the mechanism to correct them, leaving the reasoning process fundamentally broken. This limitation constrains their utility, especially within mechanisms like Best-of-N (BoN) sampling, which remain passive selection processes over a static set of potentially flawed solutions.
In this work, we introduce a novel G enerative M ultimodal P rocess R eward M odel (GM-PRM) to address these limitations, as illustrated in Figure 1 (c). Our model transcends the role of a simple verifier and acts as an active reasoning collaborator. Instead of merely outputting a scalar score, our GM-PRM leverages its generative capabilities to produce a detailed, interpretable analysis for each reasoning step. This analysis deconstructs the step into three critical aspects: its fundamental step intent, the correctness of its image alignment, and the soundness of its reasoning logic. More importantly, our model is trained not only to identify errors but also to generate a refined, corrected version of the first incorrect step it encounters.
This unique corrective capability enables us to propose a new test-time inference strategy: the Refined Best-of-N (Refined-BoN) process. This dynamic framework moves beyond passive selection by actively enhancing the quality of candidate solutions. When our GM-PRM identifies a flawed step within a generated solution, it intervenes by providing a corrected step, which is then used to guide the policy model in generating a new, more promising reasoning trajectory. This iterative refinement process significantly improves the diversity and correctness of the solution pool, leading to a substantial boost in the policy modelâs problem-solving performance. Furthermore, we demonstrate that this powerful capability can be achieved with remarkable data efficiency, requiring a significantly smaller training dataset than previous approaches. Our primary contributions are as follows:
- We develop a generative multimodal PRM called GM-PRM that provides fine-grained, interpretable feedback for mathematical reasoning. It analyzes each stepâs purpose, image alignment, and logical validity, moving beyond simple binary classification to offer deeper insight into the modelâs thought process.
- We introduce a novel Refined-BoN framework that leverages the PRMâs generative power to actively correct errors at test time. It enhances the policy modelâs ability to find correct solutions by iteratively improving flawed reasoning paths.
- We demonstrate the effectiveness and data efficiency of GM-PRM, achieving state-of-the-art results on multiple multimodal math benchmarks. Our approach requires only a 20K sample dataset, highlighting the quality of data curation and the power of generative supervision strategy.
## 2 Related Work
#### Process Reward Models (PRMs)
PRMs have been proposed to evaluate the fine-grained step level for model reasoning. During the implementation process, annotating and obtaining a high-quality training dataset incurs a high cost. PRM800K (Lightman et al. 2023) is the first process supervision dataset completely annotated by humans. To mitigate annotation costs, Math-Shepherd (Wang et al. 2023) proposes Monte Carlo (MC) estimation, while OmegaPRM (Luo et al. 2024) leverages Monte Carlo Tree Search (MCTS) to automatically evaluate each reasoning step, both utilizing the generation capabilities of Large Language Models (LLMs). Subsequent research has enhanced the effectiveness of PRMs through various methods, including VersaPRM (Zeng et al. 2025), Implicit PRM (Yuan et al. 2024), OpenPRM (Zhang et al. 2025a), PQM (Li and Li 2024), PAV (Setlur et al. 2024), and others. In addition, GenRM (Zhao et al. 2025) utilizes the generation ability of reward models to analyze each reasoning step and obtain the score of each step by taking the probability of the special evaluation token. Furthermore, GenPRM (Zhao et al. 2025), ThinkPRM (Khalifa et al. 2025), R-PRM (She et al. 2025) extend the method of using model generation analysis to evaluate steps to PRMs. There are also many studies on benchmarks of PRMs such as ProcessBench (Zheng et al. 2024a), PRMBench (Song et al. 2025), and Socratic-PRMBench (Li et al. 2025a).
#### Multimodal PRMs
After achieving certain results and progress in the research of language modality in PRMs, research on PRMs has also begun to shift towards multimodal tasks. M-STAR (Liu et al. 2024a) proposes and implements multimodal PRM on multimodal problems. URSA (Luo et al. 2025) constructs a dataset by inserting errors and utilizes it to train a multimodal PRM. VisualPRM (Wang et al. 2025b) not only uses MC estimation to construct a multimodal VisualPRM400K training dataset, but also proposes a benchmark for multimodal PRMs called VisualProcessBench, which is entirely annotated by humans. Athena-PRM (Wang et al. 2025a) proposes using prediction consistency between strong and weak completers to enhance the quality of automatically annotated data based on MC estimation and improving multimodal PRM by ORM initialization and negative data up-sampling. Moreover, PRM-BAS (Hu et al. 2025), MM-PRM (Du et al. 2025) and DreamPRM (Cao et al. 2025) also improve the capability of multimodal PRMs. Although several studies have explored multimodal PRMs, applying them to multimodal tasks effectively remains certain challenges, such as insufficient interpretability of the labels assigned to each reasoning step and the inability to correct identified erroneous steps. In our work, we introduce a generative multimodal PRM, GM-PRM to solve the above problems.
<details>
<summary>figures/Case.png Details</summary>

### Visual Description
## [Diagram]: Geometry Translation Problem with Solution Analysis
### Overview
The image is an educational diagram presenting a geometry problem about translating a triangle on a coordinate plane. It includes the problem statement, a visual representation, and a comparative analysis of an incorrect ("Original") and a corrected ("Refined") solution path. The diagram is structured to highlight a common error in coordinate identification and the subsequent logical correction.
### Components/Axes
The image is segmented into several distinct regions:
1. **Header Section (Top):**
* **Left Box (Gray):** Labeled "Question (Visual Part)". Contains a coordinate grid.
* **Right Box (Gray):** Labeled "Question (Textual Part)". Contains the problem text.
2. **Problem Statement (Textual Part):**
* **Text:** "As shown in the diagram, triangle ABC is first translated 4 units to the right and then 3 units upward. The position of vertex A after the translation, represented as a coordinate pair, is ( )."
* **Multiple Choice Options:** "A. (5, 4); B. (8, 5); C. (7, 7); D. No correct answer"
3. **Visual Part (Coordinate Grid):**
* **Axes:** X-axis labeled from 0 to 10. Y-axis labeled from 0 to 8.
* **Triangle ABC:** Plotted on the grid.
* **Vertex A:** Located at coordinate **(3, 4)**.
* **Vertex B:** Located at coordinate **(1, 1)**.
* **Vertex C:** Located at coordinate **(4, 2)**.
* The triangle is drawn with solid lines connecting these points.
4. **Solution Comparison Section (Middle):**
* **Left Column (Blue Header):** "Original Solution".
* **Right Column (Purple Header):** "Refined Solution".
* **Annotations:** Green checkmarks (â) indicate correct steps. A red cross (X) indicates an incorrect step.
5. **GM-PRM Analysis Box (Bottom Left, Purple Header):**
* **Title:** "GM-PRM: Analysis & Judgement".
* Contains three analysis points in red italic text.
### Detailed Analysis / Content Details
**Original Solution (Incorrect Path):**
* **Step 1 (â):** "To solve this problem, we need to understand the concept of translation in geometry. Translation means moving a figure without rotating or changing its size. It involves adding a fixed amount (in this case, 4 units horizontally and 3 units vertically) to each coordinate of the vertices."
* **Step 2 (X):** "Let's analyze the coordinates of vertex A before the translations: **Vertex A is at the origin, so its initial coordinates are (0, 0).**" (This text is highlighted in red, indicating the error).
**GM-PRM: Analysis & Judgement (Error Diagnosis):**
* **Step intent analysis:** "This step identifies the initial coordinates of vertex A before any translation occurs."
* **Image alignment analysis:** "The image shows vertex A located at the point (3, 4), not (0, 0)."
* **Reasoning logic analysis:** "..." (The text trails off with an ellipsis).
**Refined Solution (Corrected Path):**
* **Step 2 (â):** "Let's analyze the coordinates of vertex A before the translations: Vertex A is initially at the point (3, 4) according to the image."
* **Step 3 (â):** "Apply the first translation (4 units to the right):\n- To translate a point (x, y) 4 units to the right, we add 4 to the x-coordinate. Therefore, (3, 4) becomes (3 + 4, 4) = (7, 4)."
* **Ellipsis (...):** Indicates omitted intermediate steps.
* **Step 6 (â):** "Final answer: C" (This corresponds to option C. (7, 7)).
### Key Observations
1. **Critical Error Source:** The primary mistake in the original solution was misidentifying the initial coordinates of vertex A from the graph, assuming it was at the origin (0,0) instead of its actual plotted position (3,4).
2. **Visual-Textual Alignment:** The diagram emphasizes the necessity of cross-referencing textual problem statements with visual data (the graph) to obtain correct initial conditions.
3. **Pedagogical Structure:** The layout is designed for teaching, using color-coded headers (blue vs. purple), symbolic markers (â/X), and a dedicated analysis box to dissect the reasoning flaw.
4. **Implied Final Calculation:** Although steps 4 and 5 are omitted (indicated by "..."), the final answer "C" implies the complete calculation: Starting at (3,4), translate right 4 â (7,4), then translate up 3 â (7,7).
### Interpretation
This diagram serves as a case study in mathematical problem-solving, specifically targeting the common pitfall of not accurately extracting information from a provided visual. It demonstrates a **Peircean investigative process**:
* **Abduction:** The original solver made an incorrect abductive leap, assuming vertex A was at the origin without verifying against the graph.
* **Deduction:** The refined solution correctly deduces the translation rule (add to coordinates) but only after establishing the correct premise (A is at (3,4)).
* **Induction:** The GM-PRM analysis inductively identifies the class of error ("Image alignment analysis") to prevent similar mistakes.
The underlying message is that in technical and mathematical reasoning, **the integrity of the initial data point is paramount**. A flawless logical process (Step 1 of the original solution) built upon a faulty premise (Step 2's error) will inevitably produce an incorrect result. The "Refined Solution" corrects the foundational premise, allowing the subsequent logical steps to yield the valid answer, (7,7). The diagram effectively argues for meticulousness in the initial "reading" of a problem, whether textual or visual.
</details>
Figure 2: The illustration of a representative example before and after applying GM-PRM. In particular, GM-PRM first judges the steps of the original solution generated by the policy model. Subsequently, GM-PRM finds that the second step is incorrect and refines the second step to generate the correct version. The correct steps are input to the policy model to generate the refined solution, and finally the correct answer is obtained.
## 3 Methodology
In this section, we first describe how to utilize PRMs and generative PRMs combined with the BoN method to improve the performance of policy models for mathematical problems in Section 3.1. Then, we introduce our process to implement multimodal generative PRM, including data construction and model training in Section 3.2. Finally, we propose a novel Refined-BoN framework for PRMs to enhance its performance beyond traditional BoN method in Section 3.3.
### 3.1 PRMs for Mathematical Problem
In this section, we present the implementation methods of PRM and GM-PRM, and provide formal explanations of their usage through mathematical notation.
#### Problem and Reasoning Steps Generation
Let $Q$ denote a mathematical problem. Firstly, an LLM $Ď$ is involved in solving the mathematical problem $Q$ . To facilitate reasoning, the problem is combined with a prompt $P$ , which includes specific instructions guiding the generation of a step-by-step reasoning process and a final answer. This composite input is then fed into the LLM. When generating a response, $Ď$ generates a sequence of reasoning steps, denoted as $R=\{r_1,r_2,âŚ,r_T\}$ , where $T$ represents the total number of reasoning steps to the given mathematical problem. The above process can be explained as follows:
$$
R=Ď(Q\parallel P), \tag{1}
$$
where $\parallel$ denotes the concatenation of the problem $Q$ and the prompt $P$ , and $Ď(¡)$ represents the inference of LLM.
#### PRM
A single instance in a training dataset $D$ to train a PRM comprises three components: (1) a problem statement, (2) a generated response consisting of multiple inference steps, and (3) a corresponding set of binary labels, each taking a value of either 0 or 1, indicating whether the associated reasoning step is incorrect or correct, respectively.
During training, the PRM is optimized using cross-entropy loss and supervised to align its predictions with the ground-truth labels. After being trained, the PRM model is capable of processing new reasoning steps generated by the LLM in response to a given mathematical problem, which means that the PRM is able to assign a scalar score to each individual reasoning step, reflecting the modelâs confidence in the correctness of each step:
$$
f_PRM:(Q,R)âŚ(s_1,s_2,\dots,s_T), \tag{2}
$$
where $f_PRM:(¡)$ represents the inference of PRM, $s_iâ[0,1]$ denotes the confidence score assigned to the $i$ -th reasoning step $r_i$ , and $T$ denotes the number of reasoning steps.
For generative PRM, the binary labels in the training dataset are replaced with textual analyses and judgments, each formulated as a textual choice such as âincorrectâ or âcorrectâ. During inference, generative PRM also generates textual critiques and judgments for each step.
#### GM-PRM
By extending generative PRMs from the textual modality to a multimodal setting, we introduce GM-PRM. In this setting, mathematical problems are represented using both textual and visual information. The input to the policy model comprises the image of the problem, its textual description and task-specific instructions, which are processed jointly to generate reasoning steps. Similarly, during both training and inference, it is essential to provide GM-PRM with inputs from both visual and textual modalities, enabling it to perform cross-modal analysis when assigning correctness labels to each reasoning step:
$$
f_{GM-PRM}:(Q,I,R)âŚ(c_1,j_1,\dots,c_T,j_T), \tag{3}
$$
where $f_{GM-PRM}:(¡)$ represents the inference of GM-PRM, $I$ denotes the image of the mathematical problem, $c_i$ denotes the critique of the $i$ -th reasoning step $r_i$ , and $j_i$ denotes the textual judgment assigned to the $i$ -th reasoning step $r_i$ .
### 3.2 Data Construction
In this section, we present our methodology employed to construct the training data for GM-PRM. The process consists of three key stages: (1) the selection of appropriate types and quantities of question data from the VisualPRM400K dataset (Wang et al. 2025b); (2) the generation of textual analysis and judgment data using GPT-4o; and (3) the filtering of the generated data through MC estimation and LLM-as-a-judge techniques to ensure quality and reliability.
#### Data Selection
VisualPRM400K is a large-scale dataset containing approximately 400,000 multimodal process supervision samples. In our work, we select plane geometry- and function-related problems from VisualPRM400K to construct a specialized subset and supplement it with corresponding textual analysis for training GM-PRM. This targeted subset with textual critiques supports the effective training of GM-PRM, yielding strong performance on geometric and function-based mathematical reasoning tasks.
#### Generation of Analysis and Judgment
To obtain textual analyses and judgments, we employ GPT-4o to critique each reasoning step from 4 key aspects: step intent, image alignment, reasoning logic, and step refinement.
The aspect of step intent indicates identifying the purpose of each reasoning step. This initial analysis establishes a foundation that allows GM-PRM to interpret and evaluate each reasoning step in context more effectively. Furthermore, this level of understanding facilitates subsequent error detection and correction tasks, thereby enhancing the overall effectiveness of GM-PRM.
The second aspect is image alignment. When MLLMs are used for inference in solving multimodal problems, MLLMs often make errors in image alignment, such as misidentifying parallel relationships or incorrectly annotating angles, which leads to flawed solutions. To address this, we employ GPT-4o to produce textual analysis and judgments in image alignment for inference steps, to form the dataset for training GM-PRM.
Reasoning logic is an indispensable presence in the step-by-step problem-solving process of MLLMs. However, the occurrences of logical inconsistencies and errors, such as miscalculations and incorrect inferences significantly impact the correctness of the reasoning steps and the final answers. Therefore, it is crucial for GM-PRM to be capable of identifying such logical flaws and making accurate judgments regarding the validity of the reasoning logic for each step. In our work, we employ GPT-4o to generate textual analysis and judgments of each step in reasoning logic to form the training dataset. The above process can be formulated as follows:
$$
F:(Q,I,R\parallel P)âŚ\{SI_i,IA_i,RL_i,FJ_i\}_i=1^t, \tag{4}
$$
where $F:(¡)$ represents the inference of GPT-4o, $SI_i$ denotes the textual analysis of step intent for the $i$ -th reasoning step, $t$ denotes the number of the first incorrect step or the last step, $1â¤slant tâ¤slant T$ , $IA_i=\{IAC_i,IAJ_i\}$ denotes the analysis which contains critique $IAC_i$ and judgment $IAJ_i$ in image alignment of the $i$ -th reasoning step, $RL_i=\{RLC_i,RLJ_i\}$ denotes the analysis which contains critique $RLC_i$ and judgment $RLJ_i$ in image alignment of the $i$ -th reasoning step, $FJ_i$ denotes the final judgment of the $i$ -th reasoning step.
Building on aforementioned three aspects, we further aim for GM-PRM to correct the first identified erroneous step. The above information enables GM-PRM to generate corrected reasoning steps that are logically coherent, visually accurate, and semantically aligned with the original step intent. The resulting corrected steps can then be used to construct more diverse and accurate inference solutions and ultimately produce more reliable final answers. In our work, we employ GPT-4o to generate a corrected version of the first identified error step in a reasoning process if the first error step is detected to exist by GPT-4o:
$$
F:(Q,I,R\parallel P)âŚ\begin{cases}RS,&if incorrect step exists,\\
â ,&otherwise.\end{cases} \tag{5}
$$
where $RS$ denotes refined step of the first error step in a reasoning process.
In summary, we design a structured prompt for GPT-4o to generate comprehensive analysis data across four dimensions based on the provided problems, associated images, and step-by-step solutions:
$$
F:(Q,I,R\parallel P)âŚD, \tag{6}
$$
where $D$ denotes the generated training dataset:
$$
D=\{(\{SI_i^k,IA_i^k,RL_i^k,FJ_i^k\}_i=1^t,RS^k)\}_k=1^K, \tag{7}
$$
where $kâ\{1,2,\dots,K\}$ represents the $k$ -th sample in the dataset, and $K$ denotes the number of the training instances.
#### Data Filtering
The process of constructing training data using GPT-4o can be regarded as an implementation of LLM-as-a-judge methodology. Inspired by the combination of LLM-as-a-judge and MC estimation techniques (Zhang et al. 2025b), we employ the MC estimation technique proposed by Math-Shepherd (Wang et al. 2023) to effectively filter and curate the generated data.
Monte Carlo estimation is a strategy for automated annotation that leverages LLMs or MLLMs to generate multiple subsequent solutions for each step. When applying MC estimation to evaluate a step $r_i$ , we use an LLM or an MLLM as a âcompleterâ to finalize multiple subsequent reasoning processes from this step:
$$
f_comleterâŚ\{(r_i+1^j,\dots,r_L_{j}^j,a^j)\}_j=1^m, \tag{8}
$$
where $a^j$ is the final answer of the $j$ -th finalized solution and $L_j$ is the total number of steps.
Within MC estimation, one type of evaluation method is commonly applied: hard estimation. In hard estimation, a step $r_i$ is deemed correct if at least one subsequent solution reaches the correct final answer $a^*$ ; otherwise, it is considered incorrect:
$$
l_i^HE=\begin{cases}1,&â a_j,a_j=a^*,\\
0,&otherwise.\end{cases} \tag{9}
$$
In our data construction process, we employ hard estimation to label the correctness of individual reasoning steps. By integrating LLM-as-a-judge technique and MC estimation, we compare the labels acquired by MC estimation and judgments generated by GPT-4o. Data samples that receive consistent evaluations from both methods are selected as our final training dataset. By integrating these two methods, we aim to further enhance the reliability and quality of the training data, ensuring better performance of GM-PRM.
### 3.3 Refined-BoN Process
When applying Test-time Scaling (TTS) for LLMs and MLLMs, a widely adopted method is Best-of-N (BoN) approach. In the BoN process, a policy model is employed to generate N candidate solutions, which are then evaluated by reward models or self-consistency to select the optimal solution. However, during the BoN process, policy models are under identical prompting conditions when generating multiple solutions, which leads to the problem that the solutions often lack diversity and may exhibit limited correctness. In our work, we propose a novel Refined-BoN framework utilizing TTS techniques to enhance the diversity and accuracy of generated solutions, thereby improving the reasoning capabilities of policy models.
#### Refined-BoN Method
As shown in Figure 2, in Refined-BoN process, we first employ an MLLM as the policy model to generate $N/2$ initial solutions to a multimodal problem, and then these solutions are evaluated step-by-step by GM-PRM. For the subsequent $N/2$ solutions, the policy model generates them under varying conditions, informed by the evaluation of the preceding $N/2$ solutions: If GM-PRM identifies an incorrect reasoning step within a solution, it stops evaluating and refines the first erroneous step by generating a corrected version. This corrected step, along with all previously validated correct steps, is then input back into the policy model to continue the solution generation process. Conversely, if GM-PRM determines that all steps in a particular solution are correct, we employ the policy model to generate a new solution using the same prompt. Through this regeneration mechanism, we obtain the additional $N/2$ solutions. Subsequently, we employ GM-PRM to evaluate the subsequent $N/2$ solutions.
#### Solution Selection
After applying the Refined-BoN process, we obtain $N$ solutions for each problem, each accompanied by step-level correctness judgments. Moreover, we divide all the solutions into two categories: one where GM-PRM judges that it contains incorrect steps, and the other where GM-PRM judges that all its steps are correct. Furthermore, we take the probability of GM-PRM generating the associated âCorrectâ and âIncorrectâ tokens as the score of each step.
Among the $N$ generated solutions, if there exist solutions in which all reasoning steps are judged correct, we calculate the average of the scores of all steps in these solutions as the overall score of the solution, and select the solution with the highest average score as the optimal solution.
For N solutions to a problem, if GM-PRM determines that all $N$ solutions contain incorrect steps, we calculate the average score of all steps in a solution as the overall score of the solution, and select the solution with the highest overall score as the final answer.
## 4 Experiments
| MiniCPM-V2.6-8B + GM-PRM (Ours) Improvements | 44.3 51.0 +6.7 | 16.0 18.1 +2.1 | 18.9 24.4 +5.5 | 22.6 25.7 +3.1 | 38.6 51.0 +12.4 | 28.1 34.0 +5.9 |
| --- | --- | --- | --- | --- | --- | --- |
| Llama-3.2-11B-Vision | 44.5 | 14.3 | 16.5 | 28.4 | 46.1 | 30.0 |
| + GM-PRM (Ours) | 49.5 | 18.2 | 18.8 | 32.7 | 53.4 | 34.5 |
| Improvements | +5.0 | +3.9 | +2.3 | +4.3 | +7.3 | +4.5 |
| Qwen2.5-VL-7B | 63.2 | 25.1 | 32.8 | 35.0 | 60.6 | 43.3 |
| + GM-PRM (Ours) | 65.0 | 28.2 | 37.4 | 39.2 | 69.0 | 47.8 |
| Improvements | +1.8 | +3.1 | +4.6 | +4.2 | +8.4 | +4.5 |
| InternVL3-8B | 50.6 | 20.3 | 25.0 | 27.0 | 50.9 | 34.8 |
| + GM-PRM (Ours) | 55.7 | 22.2 | 31.7 | 33.4 | 59.2 | 40.4 |
| Improvements | +5.1 | +1.9 | +6.7 | +6.4 | +8.3 | +5.6 |
| InternVL3-38B | 68.3 | 34.9 | 37.8 | 40.1 | 66.4 | 49.5 |
| + GM-PRM (Ours) | 69.9 | 37.0 | 39.1 | 43.1 | 72.9 | 52.4 |
| Improvements | +1.6 | +2.1 | +1.3 | +3.0 | +6.5 | +2.9 |
| InternVL3-78B | 68.0 | 34.6 | 36.0 | 38.1 | 65.7 | 48.5 |
| + GM-PRM (Ours) | 70.7 | 37.1 | 40.6 | 39.9 | 72.2 | 52.1 |
| Improvements | +2.7 | +2.5 | +4.6 | +1.8 | +6.5 | +3.6 |
Table 1: Percentage accuracy scores (%) of multiple MLLMs across five datasets. For each MLLM, the first row shows the baseline, the second shows the final result with GM-PRM, and the third shows the improvement. Only positive improvements are underlined. The best results are highlighted in bold. All values are reported after rounding to three decimal places.
In this section, we introduce our experimental setup to assess GM-PRM under the Refined-BoN process on five multimodal mathematical benchmarks in Section 4.1. In addition, we present the results of our experiments and three conclusions analyzed from the results in Section 4.2. Finally, we show the ablation studies in Section 4.3.
### 4.1 Experimental Setup
#### Benchmarks.
We evaluate GM-PRM across five datasets, including MathVista (Lu et al. 2023), MathVision (Wang et al. 2024), MathVerse (Zhang et al. 2024), DynaMath (Zou et al. 2024) and WeMath (Qiao et al. 2024). The datasets contain diverse problem types, such as plane geometry, functions, puzzle tests, etc. We use Vision-Only subset of MathVerse dataset and Plane-Geometry subset of DynaMath.
#### Settings.
We employ GM-PRM as the critic model for Refined-BoN evaluation and set N to 8 by default. For MLLMs, we select six models as the policy models to generate step-by-step reasoning processes. When reasoning, we set the temperature of the policy models to 0.7 and top-p to 0.9. For comparison, we use the average accuracy of N sets of answers generated by policy models as baselines.
#### Training Details.
To train GM-PRM, we use Qwen2.5-VL-7B-Instruct as our base model and perform supervised fine-tuning (SFT) with all parameters trainable except for the frozen Vision Transformer (ViT) encoder. During the training process, we utilize bfloat16 mixed-precision and DeepSpeed with zero3 technology and set the training consists of 2 epochs. For batch size, the batch size on each training device is set to 2, and through gradient accumulation, the effective batch size is extended to 16. Moreover, we use two A800 GPUs to train GM-PRM, and the AdamW optimizer is used with an initial learning rate of $1Ă 10^-5$ . The learning rate schedule involves a linear warm-up with the warm-up ratio equal to 0.05 followed by linear decay.
### 4.2 Main Results
As shown in Table 1, integrating GM-PRM with the Refined-BoN process consistently improves performance across five benchmark datasets for six different MLLMs. On average, our method yields notable accuracy gains, with improvements of +5.9 for MiniCPM-V2.6-8B, +4.5 for Llama-3.2-11B-Vision, +4.5 for Qwen2.5-VL-7B, and +5.6 for InternVL3-8B.
A closer look at dataset-level results reveals that the improvements are not uniform. The WeMath benchmark shows the most significant enhancement, with MiniCPM-V2.6-8B improving by +12.4 points, highlighting the ability of our method to strengthen mathematical reasoning on challenging problems. Similarly, MathVerse and DynaMath exhibit consistent gains of +4.5â6.7 points across multiple models, suggesting that our approach particularly benefits datasets requiring complex symbolic manipulation and multi-step reasoning. In contrast, MathVision improvements are more modest (+1.9â3.9), indicating that the visual reasoning component may already be relatively strong in baseline models.
GM-PRM combined with the Refined-BoN process demonstrates strong generalization across diverse multimodal mathematical problems, with particularly remarkable gains in plane geometry tasks. As illustrated in Figure 3, even after excluding plane geometry and function problems, policy models still achieve notable improvements across the datasets. This indicates that although GM-PRM is primarily trained on a dataset composed of plane geometry and function problems, it generalizes effectively to other types of multimodal mathematical problems. Moreover, as shown by the averaged results in Figure 3, the improvements achieved by GM-PRM with Refined-BoN on plane geometry problems consistently exceed those on the overall dataset, function problems, and other categories, underscoring the exceptional effectiveness of our method in tackling plane geometry tasks.
<details>
<summary>figures/Question_types.png Details</summary>

### Visual Description
\n
## Grouped Bar Chart: Accuracy Improvements by Model and Math Category
### Overview
The image displays a grouped bar chart comparing the percentage improvements in accuracy for three different AI models across four mathematical categories. The chart is presented on a white background with a clear legend and labeled axes.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:** Labeled "Improvements of accuracy (%)". The scale runs from 0 to 10, with major grid lines at intervals of 2 (0, 2, 4, 6, 8, 10).
* **X-Axis:** Lists three distinct AI models:
1. MiniCPM-V2.6-8B
2. InternVL3-8B
3. InternVL3-78B
* **Legend:** Located at the top of the chart, centered. It defines four data series by color:
* **Overall** (Red/Orange)
* **Plane Geometry** (Blue)
* **Functions** (Green)
* **Others** (Yellow/Gold)
* **Data Series:** For each model on the x-axis, there are four bars corresponding to the categories in the legend, ordered from left to right: Overall, Plane Geometry, Functions, Others.
### Detailed Analysis
**Model 1: MiniCPM-V2.6-8B**
* **Overall (Red):** Approximately 6.0% improvement.
* **Plane Geometry (Blue):** The highest bar in this group and the entire chart. Approximately 9.2% improvement.
* **Functions (Green):** Approximately 5.5% improvement.
* **Others (Yellow):** The lowest bar for this model. Approximately 1.5% improvement.
* **Trend:** Plane Geometry shows a dominant improvement, significantly outperforming the other categories. "Others" shows minimal gain.
**Model 2: InternVL3-8B**
* **Overall (Red):** Approximately 5.6% improvement.
* **Plane Geometry (Blue):** Approximately 6.5% improvement.
* **Functions (Green):** A very low bar, approximately 0.5% improvement.
* **Others (Yellow):** Approximately 3.5% improvement.
* **Trend:** Plane Geometry again shows the highest improvement, but the gap is smaller. The "Functions" category shows a dramatic drop to near-zero improvement compared to the MiniCPM model. "Others" shows a moderate improvement.
**Model 3: InternVL3-78B**
* **Overall (Red):** Approximately 3.6% improvement.
* **Plane Geometry (Blue):** Approximately 6.0% improvement.
* **Functions (Green):** Approximately 5.6% improvement.
* **Others (Yellow):** Approximately 1.0% improvement.
* **Trend:** Plane Geometry and Functions show similar, strong improvements. The "Overall" improvement is lower than the other two models. "Others" remains the lowest category.
### Key Observations
1. **Plane Geometry Dominance:** The "Plane Geometry" category (blue bar) consistently shows the highest or near-highest accuracy improvement across all three models.
2. **Functions Volatility:** The "Functions" category (green bar) shows high variability. It is strong for MiniCPM-V2.6-8B (~5.5%) and InternVL3-78B (~5.6%), but collapses to near zero (~0.5%) for InternVL3-8B.
3. **"Others" Category:** This category (yellow bar) consistently yields the lowest or second-lowest improvement, suggesting the models' enhancements are more specialized in the core mathematical domains listed.
4. **Model Comparison:** MiniCPM-V2.6-8B achieves the single highest category improvement (Plane Geometry). InternVL3-78B shows the most balanced high performance between Plane Geometry and Functions. InternVL3-8B has a notable weakness in Functions.
### Interpretation
The data suggests that the evaluated AI models have undergone targeted improvements that are not uniform across mathematical domains. The consistent strength in **Plane Geometry** indicates this may be a domain where recent architectural or training advancements yield significant gains. The stark volatility in the **Functions** category for the InternVL3-8B model is a critical anomaly; it implies a potential regression, a specialized training focus that omitted this area, or a sensitivity in the model's architecture to function-related problems that was later addressed in the larger 78B variant.
The relatively low scores for **"Others"** imply that the benchmark or the models' improvements are concentrated on well-defined, core mathematical competencies rather than miscellaneous or applied problems. The "Overall" metric, while useful, masks these important domain-specific variations, highlighting the necessity of disaggregated evaluation. The chart effectively demonstrates that model size (8B vs. 78B) does not linearly correlate with improvement in every category, as seen by the InternVL3-8B's poor performance in Functions compared to both the smaller MiniCPM and its larger sibling.
</details>
Figure 3: Improvements of the average percentage accuracy (%) of multiple MLLMs across different question types in MathVista, MathVision and MathVerse datasets.
The Refined-BoN process yields disproportionately larger gains for models with lower baseline performance. As shown in Table 1, InternVL3-38B starts with the highest initial average accuracy among all policy models (49.5%) and achieves a modest improvement of +2.9 points (+5.9%). In contrast, Qwen2.5-VL-7B, which has the highest baseline accuracy (43.3%) among models with fewer than 12 billion parameters, improves by +4.5 points (+10.4%), surpassing the relative gains of InternVL3-38B. Notably, MiniCPM-V-2.6-8B demonstrates the most significant relative improvement, achieving +5.9 points (+21.0%), despite its lower initial score. These results suggest that models with weaker baseline performance benefit more from the refinement mechanism of GM-PRM with Refined-BoN, likely because the process effectively corrects errors in reasoning steps, leaving greater room for improvement.
<details>
<summary>figures/MiniCPM-BoN.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Solutions per Problem for GM-PRM and Self-Consistency
### Overview
The image is a line chart comparing the performance of two methods, **GM-PRM** and **Self-Consistency**, as the number of solutions generated per problem increases. The chart plots accuracy percentage against the number of solutions, showing that both methods improve with more solutions, but at different rates.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis (Horizontal):**
* **Label:** "# Solutions per Problem"
* **Scale/Markers:** Discrete values at 1, 4, 6, and 8.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy (%)"
* **Scale:** Linear scale from 38 to 53, with major gridlines at intervals of 5 (38, 43, 48, 53).
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Series 1:** "GM-PRM" represented by a blue line with circular markers.
* **Series 2:** "Self-Consistency" represented by an orange line with circular markers.
* **Grid:** Light gray horizontal and vertical gridlines are present.
### Detailed Analysis
**Data Series: GM-PRM (Blue Line)**
* **Trend:** The line shows a consistent, positive slope, indicating accuracy increases with more solutions per problem. The rate of increase is steepest between 1 and 4 solutions.
* **Data Points (Approximate):**
* At 1 solution: ~38.5% accuracy.
* At 4 solutions: ~46.0% accuracy.
* At 6 solutions: ~47.5% accuracy.
* At 8 solutions: ~51.0% accuracy.
**Data Series: Self-Consistency (Orange Line)**
* **Trend:** The line also shows a positive slope, but it is less steep than the GM-PRM line, especially after 4 solutions. The improvement appears to plateau slightly between 6 and 8 solutions.
* **Data Points (Approximate):**
* At 1 solution: ~38.5% accuracy (starting point identical to GM-PRM).
* At 4 solutions: ~43.5% accuracy.
* At 6 solutions: ~45.5% accuracy.
* At 8 solutions: ~46.0% accuracy.
### Key Observations
1. **Identical Starting Point:** Both methods begin at the same accuracy level (~38.5%) when only one solution is generated per problem.
2. **Diverging Performance:** The performance gap between the two methods widens as the number of solutions increases. GM-PRM consistently outperforms Self-Consistency for 4, 6, and 8 solutions.
3. **Scaling Efficiency:** GM-PRM demonstrates superior scaling. Its accuracy gain from 1 to 8 solutions is approximately 12.5 percentage points, compared to a gain of about 7.5 percentage points for Self-Consistency.
4. **Plateauing Effect:** The Self-Consistency line shows signs of diminishing returns, with minimal accuracy gain between 6 and 8 solutions (~0.5%), while GM-PRM continues a strong upward trend (~3.5% gain).
### Interpretation
The chart demonstrates that while both methods benefit from generating multiple solutions per problem (a technique often used to improve robustness in AI systems), **GM-PRM is significantly more effective at leveraging this additional computation.**
The data suggests that the GM-PRM method has a better mechanism for selecting or aggregating the correct solution from a set of candidates, as its accuracy continues to climb robustly. In contrast, the Self-Consistency method's improvement slows down, indicating it may be less efficient at utilizing the information contained in a larger set of solutions.
The key takeaway is that for tasks where generating multiple solutions is feasible, **GM-PRM is the more scalable and accurate approach** according to this evaluation. The identical starting point at one solution implies the core model or base accuracy is similar, but the method for handling multiple solutions is where the critical difference lies.
</details>
(a) MiniCPM-V-2.6-8B
<details>
<summary>figures/QwenVL-BoN.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Solutions per Problem for GM-PRM and Self-Consistency Methods
### Overview
The image is a line chart comparing the performance of two methods, "GM-PRM" and "Self-Consistency," as a function of the number of solutions generated per problem. The chart demonstrates how the accuracy of each method changes as more solutions are considered.
### Components/Axes
* **Chart Type:** Line chart with two data series.
* **X-Axis (Horizontal):**
* **Label:** "# Solutions per Problem"
* **Scale:** Discrete markers at values 1, 4, 6, and 8.
* **Y-Axis (Vertical):**
* **Label:** "Accuracy (%)"
* **Scale:** Linear scale ranging from 60 to 70, with major gridlines at intervals of 2% (60, 62, 64, 66, 68, 70).
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Series 1:** "GM-PRM" represented by a blue line with circular markers.
* **Series 2:** "Self-Consistency" represented by an orange line with circular markers.
### Detailed Analysis
**Data Series: GM-PRM (Blue Line)**
* **Trend:** The line shows a consistent upward slope, indicating accuracy increases with more solutions per problem. The rate of increase is steepest between 1 and 4 solutions.
* **Approximate Data Points:**
* At 1 solution: ~60.5% accuracy.
* At 4 solutions: ~66.0% accuracy.
* At 6 solutions: ~67.8% accuracy.
* At 8 solutions: ~69.0% accuracy.
**Data Series: Self-Consistency (Orange Line)**
* **Trend:** The line also shows a consistent upward slope, but it is less steep than the GM-PRM line after the first data point.
* **Approximate Data Points:**
* At 1 solution: ~60.5% accuracy (appears identical to GM-PRM at this point).
* At 4 solutions: ~64.2% accuracy.
* At 6 solutions: ~65.5% accuracy.
* At 8 solutions: ~66.5% accuracy.
### Key Observations
1. **Initial Parity:** Both methods start at the same accuracy level (~60.5%) when only one solution is generated per problem.
2. **Divergence:** The performance of the two methods begins to diverge immediately after the first data point (x=1). The GM-PRM line rises above the Self-Consistency line and maintains a higher accuracy for all subsequent points (x=4, 6, 8).
3. **Widening Gap:** The absolute difference in accuracy between the two methods increases as the number of solutions per problem grows. The gap is smallest at x=4 (~1.8 percentage points) and largest at x=8 (~2.5 percentage points).
4. **Monotonic Improvement:** Both methods show monotonic improvement; accuracy never decreases as more solutions are added.
### Interpretation
The data suggests that while both methods benefit from generating multiple solutions to a problem, the **GM-PRM method scales more effectively** with an increased number of solutions. The fact that they start at the same point implies their baseline performance with a single solution is equivalent. However, GM-PRM appears to be better at leveraging the additional information or diversity provided by multiple solutions to improve its final accuracy. This could indicate a more robust aggregation or selection mechanism within the GM-PRM framework compared to the Self-Consistency approach. The consistent upward trend for both lines validates the general principle that considering multiple solution paths can improve reliability in problem-solving tasks.
</details>
(b) Qwen2.5-VL-7B
<details>
<summary>figures/InternVL3-78B-BoN.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Number of Solutions per Problem
### Overview
The image displays a line chart comparing the performance of two methods, "GM-PRM" and "Self-Consistency," as the number of solutions generated per problem increases. The chart plots accuracy percentage against the number of solutions, showing how each method's performance scales.
### Components/Axes
* **Chart Type:** Line chart with two data series.
* **X-Axis (Horizontal):** Labeled "# Solutions per Problem". It has discrete markers at values 1, 4, 6, and 8.
* **Y-Axis (Vertical):** Labeled "Accuracy (%)". The scale ranges from 65 to 73, with major gridlines at intervals of 2% (65, 67, 69, 71, 73).
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Blue line with diamond markers:** Labeled "GM-PRM".
* **Orange line with diamond markers:** Labeled "Self-Consistency".
### Detailed Analysis
**Data Series: GM-PRM (Blue Line)**
* **Trend:** The line shows a strong, positive, and concave-down trend. It rises sharply from 1 to 4 solutions and continues to increase at a slower rate thereafter.
* **Data Points (Approximate):**
* At 1 solution: ~65.7%
* At 4 solutions: ~70.9%
* At 6 solutions: ~71.4%
* At 8 solutions: ~72.2%
**Data Series: Self-Consistency (Orange Line)**
* **Trend:** The line shows a positive trend that plateaus. It rises from 1 to 4 solutions, increases slightly to 6 solutions, and then flattens completely between 6 and 8 solutions.
* **Data Points (Approximate):**
* At 1 solution: ~65.7% (appears to start at the same point as GM-PRM)
* At 4 solutions: ~67.7%
* At 6 solutions: ~68.1%
* At 8 solutions: ~68.1%
### Key Observations
1. **Performance Gap:** A significant performance gap emerges between the two methods as the number of solutions increases. While they start at approximately the same accuracy (~65.7%) with a single solution, GM-PRM consistently outperforms Self-Consistency for 4, 6, and 8 solutions.
2. **Diminishing Returns:** Both methods exhibit diminishing returns. The most substantial accuracy gain for both occurs when moving from 1 to 4 solutions. The rate of improvement slows considerably after that point.
3. **Plateau Effect:** The Self-Consistency method shows a clear performance plateau, with no measurable accuracy gain between 6 and 8 solutions per problem. In contrast, GM-PRM continues to show a slight upward trend in this range.
4. **Maximum Performance:** At the highest measured point (8 solutions), GM-PRM achieves an accuracy of approximately 72.2%, which is about 4.1 percentage points higher than the Self-Consistency method's plateau of ~68.1%.
### Interpretation
The data suggests that the **GM-PRM method is more effective at leveraging additional solution samples to improve final answer accuracy** compared to the Self-Consistency method. The steep initial rise for both indicates that generating multiple solutions is fundamentally beneficial over a single attempt.
However, the diverging trends imply a difference in underlying mechanism or robustness. GM-PRM's continued, albeit slower, improvement suggests its aggregation or selection process (likely a Process Reward Model, given the "PRM" acronym) can still extract useful signal from a larger pool of solutions. The plateau for Self-Consistency indicates that its majority-voting or similar consensus mechanism reaches its maximum effectiveness with around 6 solutions, after which additional samples do not contribute to higher confidence in the correct answer.
For practical application, this chart argues that if computational resources allow for generating 4 or more solutions per problem, **GM-PRM is the superior method for maximizing accuracy**. The cost-benefit analysis would hinge on whether the ~4% accuracy advantage at 8 solutions justifies any potential additional computational overhead of the GM-PRM method over Self-Consistency.
</details>
(c) InternVL3-78B
Figure 4: The results of changing the value of N in the Refined-BoN process on the WeMath across different policies. As N increases, the effectiveness of GM-PRM in enhancing accuracy improves and surpasses that of Self-Consistency.
### 4.3 Hyperparameter & Ablation Study
#### Number of solution samples N in Refined BoN.
Following Test-time Scaling technique, we vary the number of N in the Refined-BoN process to evaluate the performance of GM-PRM in comparison to the Self-Consistency baseline.
Figure 4 depicts WeMath accuracy as the number of sampled solutions per problem ( $N$ ) increases from 1 to 8. Across all three backbonesâMiniCPMâVâ2.6â8B, Qwen2.5âVLâ7B, and InternVL3â78Bâboth GM-PRM and the SelfâConsistency (SC) baseline benefit from a larger sampling budget, yet GM-PRM exhibits a noticeably steeper growth curve.
Under the widely adopted Bestâofâ8 setting, GM-PRM delivers gains of 4.9 and 3.5 over SC on MiniCPMâVâ2.6â8B and Qwen2.5âVLâ7B, respectively. Even for the 78Bâparameter InternVL3, GM-PRM maintains a substantial 4.1 margin. These results indicate that the proposed refinement strategy not only scales to larger models but also converts additional candidate solutions into accuracy more effectively than SelfâConsistency, thereby underscoring the robustness and versatility of GM-PRM.
Furthermore, for MiniCPM-V-2.6-8B, GM-PRM surpasses the self-consistency baseline by 2.1, 2.2, and 4.9 points under the Best-of-4, Best-of-6, and Best-of-8 settings, respectively, indicating a steadily increasing performance gap between GM-PRM and self-consistency as N increases.
#### Methods for aggregating step scores.
For PRMs, the method used to aggregate step scores into an overall solution score plays a critical role. In this part, we compare several different aggregation strategies, including averaging step scores, selecting the maximum step score, and selecting the minimum step score. Since step-by-step solutions that contain steps judged incorrect are often not evaluated or scored for all steps, this experiment focuses exclusively on solutions where all steps are judged correct.
<details>
<summary>figures/Aggregation.png Details</summary>

### Visual Description
## Grouped Bar Chart: Model Accuracy Comparison
### Overview
The image displays a grouped bar chart comparing the accuracy percentages of three different vision-language models under four different conditions: Original, Average, Min, and Max. The chart is presented on a white background with horizontal grid lines for reference.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:**
* **Label:** "Accuracy (%)"
* **Scale:** Linear, ranging from 0 to 60.
* **Markers:** 0, 10, 20, 30, 40, 50, 60.
* **X-Axis:**
* **Categories (Models):** Three distinct models are listed.
1. MiniCPM-V-2.6-8B
2. Qwen2.5-VL-7B
3. InternVL3-8B
* **Legend:** Located at the top center of the chart. It defines four data series with corresponding colors:
* **Original:** Gray
* **Average:** Blue
* **Min:** Teal (Dark Cyan)
* **Max:** Light Green (Mint)
### Detailed Analysis
The following table reconstructs the approximate accuracy values for each model and condition, based on visual estimation against the y-axis grid lines. Values are approximate.
| Model | Original (Gray) | Average (Blue) | Min (Teal) | Max (Light Green) |
| :--- | :--- | :--- | :--- | :--- |
| **MiniCPM-V-2.6-8B** | ~28% | ~34% | ~34% | ~30% |
| **Qwen2.5-VL-7B** | ~43% | ~48% | ~47% | ~43% |
| **InternVL3-8B** | ~35% | ~40% | ~40% | ~37% |
**Trend Verification per Model:**
* **MiniCPM-V-2.6-8B:** The "Original" bar is the shortest. "Average" and "Min" bars are the tallest and appear nearly equal in height. The "Max" bar is shorter than "Average"/"Min" but taller than "Original".
* **Qwen2.5-VL-7B:** This model shows the highest overall bars. "Average" is the tallest, followed closely by "Min". "Original" and "Max" are the shortest and appear equal in height.
* **InternVL3-8B:** "Average" and "Min" bars are the tallest and appear equal. "Max" is slightly shorter, and "Original" is the shortest.
### Key Observations
1. **Performance Hierarchy:** Qwen2.5-VL-7B demonstrates the highest accuracy across all four conditions, followed by InternVL3-8B, with MiniCPM-V-2.6-8B showing the lowest accuracy.
2. **Condition Impact:** For all three models, the "Average" and "Min" conditions yield the highest accuracy scores, which are very close to each other. The "Original" condition consistently results in the lowest accuracy.
3. **Anomaly/Uncertainty:** For the MiniCPM-V-2.6-8B model, the "Min" bar (teal) appears visually equal to or marginally taller than the "Average" bar (blue). This is counterintuitive, as a "Min" value is typically expected to be less than or equal to an "Average". This could indicate a data anomaly, a specific characteristic of the model's performance distribution, or a visual approximation error in the chart.
4. **Range Spread:** The difference between the highest ("Average"/"Min") and lowest ("Original") accuracy for a given model is most pronounced for Qwen2.5-VL-7B (~5 percentage points) and least pronounced for MiniCPM-V-2.6-8B (~6 percentage points, but note the Min/Average anomaly).
### Interpretation
This chart likely compares the performance of three vision-language models on a specific task or benchmark. The four conditions (Original, Average, Min, Max) probably represent different evaluation methodologies, data augmentation techniques, or ensemble strategies applied to the base ("Original") model.
The data suggests that applying the "Average" or "Min" strategy significantly improves model accuracy compared to the "Original" baseline for all three models. The "Max" strategy also provides an improvement, but it is less effective than "Average" or "Min". The Qwen2.5-VL-7B model not only has the highest baseline ("Original") performance but also benefits the most in absolute terms from these strategies, indicating it may be the most robust or capable architecture among the three for this particular task. The near-equivalence of "Min" and "Average" performance is noteworthy and suggests that the worst-case performance under the applied strategy is remarkably close to the average-case performance, which could imply high consistency or a specific property of the evaluation metric.
</details>
Figure 5: Average percentage accuracy (%) of MLLMs via different aggregation methods across five datasets.
The results are illustrated in the Figure 5. Across all policy models and datasets, we find that both averaging the step scores and selecting the minimum score significantly outperform the strategy of selecting the maximum score. This suggests that either the average or the minimum score provides a more accurate reflection of the overall quality of a solution than the maximum score. Between the minimum and average aggregation methods, we observe that averaging performs slightly better. This improvement may stem from the fact that the average score takes into account all problem-solving steps, providing a more comprehensive evaluation, whereas the minimum score reflects only the step with the lowest score and thus offers a less holistic assessment.
#### Refined-BoN vs. BoN.
The Refined-BoN process aims to enhance the diversity of N candidate solutions by refining the steps judged incorrect and integrating the refined steps with the steps judged correct into the prompt for the policy models. In this part, we use the Pass@k metric to evaluate the diversity and accuracy of policy models in generating multiple solutions to the given problems.
The results are summarized in the Table 2. Overall, the Refined-BoN process improves Pass@8 scores compared to the standard BoN process across multiple policy models and five benchmark datasets. Specifically, it increases the average Pass@8 values of MiniCPM-V-2.6-8B, Llama-3.2-11B-Vision, and InternVL3-8B by 0.9, 1.3, and 0.9 points, respectively, across the five datasets, demonstrating the effectiveness of the Refined-BoN approach in enhancing the diversity and correctness of the generated solutions.
| MiniCPM-V-2.6-8B Llama-3.2-11B-Vision InternVL3-8B | 62.5 62.7 65.3 | 63.4 64.0 66.2 | +0.9 +1.3 +0.9 |
| --- | --- | --- | --- |
Table 2: Average percentage Pass@8 scores of BoN and Refined-BoN across five datasets for different models.
## 5 Conclusion
In this work, we introduced GM-PRM, a novel paradigm that transforms the reward model from a passive judge into an active reasoning collaborator for multimodal mathematics. By providing fine-grained, interpretable analysis and, more critically, generating corrections for erroneous steps, GM-PRM moves beyond simple binary verification. This unique corrective capability powers our Refined Best-of-N (Refined-BoN) framework, which actively improves flawed reasoning trajectories at test time. Our experiments demonstrate that this approach achieves state-of-the-art results on multiple benchmarks, significantly boosting policy model performance with remarkable data efficiency. The consistent gains across diverse models and problem types underscore the robustness and generalizability of our method. This shift from passive error detection to generative, collaborative correction represents a crucial advance in multimodal reasoning.
*
## References
- Ahn et al. (2024) Ahn, J.; Verma, R.; Lou, R.; Liu, D.; Zhang, R.; and Yin, W. 2024. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157.
- Bai et al. (2023) Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; Hui, B.; Ji, L.; Li, M.; Lin, J.; Lin, R.; Liu, D.; Liu, G.; Lu, C.; Lu, K.; Ma, J.; Men, R.; Ren, X.; Ren, X.; et al. 2023. Qwen Technical Report. arXiv:2309.16609.
- Bai et al. (2025) Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; Zhong, H.; Zhu, Y.; Yang, M.; Li, Z.; Wan, J.; Wang, P.; Ding, W.; Fu, Z.; Xu, Y.; Ye, J.; Zhang, X.; Xie, T.; et al. 2025. Qwen2.5-VL Technical Report. arXiv:2502.13923.
- Caffagni et al. (2024) Caffagni, D.; Cocchi, F.; Barsellotti, L.; Moratelli, N.; Sarto, S.; Baraldi, L.; Cornia, M.; and Cucchiara, R. 2024. The revolution of multimodal large language models: a survey. arXiv preprint arXiv:2402.12451.
- Cao et al. (2025) Cao, Q.; Wang, R.; Zhang, R.; Somayajula, S. A.; and Xie, P. 2025. DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning. arXiv preprint arXiv:2505.20241.
- Chen et al. (2024) Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; Li, B.; Luo, P.; Lu, T.; Qiao, Y.; and Dai, J. 2024. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv:2312.14238.
- Du et al. (2025) Du, L.; Meng, F.; Liu, Z.; Zhou, Z.; Luo, P.; Zhang, Q.; and Shao, W. 2025. MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision. arXiv preprint arXiv:2505.13427.
- Gao et al. (2024) Gao, J.; Xu, S.; Ye, W.; Liu, W.; He, C.; Fu, W.; Mei, Z.; Wang, G.; and Wu, Y. 2024. On designing effective rl reward at training time for llm reasoning. arXiv preprint arXiv:2410.15115.
- Grattafiori et al. (2024) Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; Yang, A.; Fan, A.; Goyal, A.; Hartshorn, A.; Yang, A.; Mitra, A.; Sravankumar, A.; Korenev, A.; et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783.
- Hu et al. (2025) Hu, P.; Zhang, Z.; Chang, Q.; Liu, S.; Ma, J.; Du, J.; Zhang, J.; Liu, Q.; Gao, J.; Ma, F.; et al. 2025. PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search. arXiv preprint arXiv:2504.10222.
- Huo et al. (2024) Huo, J.; Yan, Y.; Hu, B.; Yue, Y.; and Hu, X. 2024. Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model. arXiv preprint arXiv:2406.11193.
- Khalifa et al. (2025) Khalifa, M.; Agarwal, R.; Logeswaran, L.; Kim, J.; Peng, H.; Lee, M.; Lee, H.; and Wang, L. 2025. Process Reward Models That Think. arXiv:2504.16828.
- Lambert et al. (2024) Lambert, N.; Pyatkin, V.; Morrison, J.; Miranda, L.; Lin, B. Y.; Chandu, K.; Dziri, N.; Kumar, S.; Zick, T.; Choi, Y.; et al. 2024. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.
- Li et al. (2024) Li, Q.; Chen, Z.; Wang, W.; Wang, W.; Ye, S.; Jin, Z.; Chen, G.; He, Y.; Gao, Z.; Cui, E.; Yu, J.; Tian, H.; Zhou, J.; Xu, C.; Wang, B.; Wei, X.; Li, W.; Zhang, W.; Zhang, B.; Cai, P.; Wen, L.; et al. 2024. OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text. arXiv:2406.08418.
- Li and Li (2024) Li, W.; and Li, Y. 2024. Process reward model with q-value rankings. arXiv preprint arXiv:2410.11287.
- Li et al. (2025a) Li, X.; Yu, H.; Zhang, X.; Huang, Z.; He, S.; Liu, K.; Zhao, J.; Huang, F.; and Li, Y. 2025a. Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns. arXiv preprint arXiv:2505.23474.
- Li et al. (2025b) Li, Z.; Wen, X.; Lou, J.; Ji, Y.; Lu, Y.; Han, X.; Zhang, D.; and Sun, L. 2025b. The devil is in the details: Tackling unimodal spurious correlations for generalizable multimodal reward models. arXiv preprint arXiv:2503.03122.
- Lightman et al. (2023) Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; and Cobbe, K. 2023. Letâs verify step by step. In The Twelfth International Conference on Learning Representations.
- Liu et al. (2024a) Liu, W.; Li, J.; Zhang, X.; Zhou, F.; Cheng, Y.; and He, J. 2024a. Diving into Self-Evolving Training for Multimodal Reasoning. arXiv:2412.17451.
- Liu et al. (2024b) Liu, Y.; Cao, Y.; Gao, Z.; Wang, W.; Chen, Z.; Wang, W.; Tian, H.; Lu, L.; Zhu, X.; Lu, T.; Qiao, Y.; and Dai, J. 2024b. MMInstruct: a high-quality multi-modal instruction tuning dataset with extensive diversity. Science China Information Sciences, 67(12).
- Liu et al. (2023) Liu, Z.; He, Y.; Wang, W.; Wang, W.; Wang, Y.; Chen, S.; Zhang, Q.; Lai, Z.; Yang, Y.; Li, Q.; Yu, J.; Li, K.; Chen, Z.; Yang, X.; Zhu, X.; Wang, Y.; Wang, L.; Luo, P.; Dai, J.; and Qiao, Y. 2023. InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language. arXiv:2305.05662.
- Lu et al. (2023) Lu, P.; Bansal, H.; Xia, T.; Liu, J.; Li, C.; Hajishirzi, H.; Cheng, H.; Chang, K.-W.; Galley, M.; and Gao, J. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
- Luo et al. (2024) Luo, L.; Liu, Y.; Liu, R.; Phatale, S.; Guo, M.; Lara, H.; Li, Y.; Shu, L.; Zhu, Y.; Meng, L.; Sun, J.; and Rastogi, A. 2024. Improve Mathematical Reasoning in Language Models by Automated Process Supervision. arXiv:2406.06592.
- Luo et al. (2025) Luo, R.; Zheng, Z.; Wang, Y.; Ni, X.; Lin, Z.; Jiang, S.; Yu, Y.; Shi, C.; Chu, R.; Zeng, J.; and Yang, Y. 2025. URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics. arXiv:2501.04686.
- Miao et al. (2025) Miao, B.; Wu, Y.; Gao, M.; Yu, Q.; Bu, W.; Zhang, W.; Li, Y.; Tang, S.; Chua, T.-S.; and Li, J. 2025. Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark. arXiv preprint arXiv:2503.18665.
- Qiao et al. (2024) Qiao, R.; Tan, Q.; Dong, G.; Wu, M.; Sun, C.; Song, X.; GongQue, Z.; Lei, S.; Wei, Z.; Zhang, M.; et al. 2024. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284.
- Setlur et al. (2024) Setlur, A.; Nagpal, C.; Fisch, A.; Geng, X.; Eisenstein, J.; Agarwal, R.; Agarwal, A.; Berant, J.; and Kumar, A. 2024. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146.
- She et al. (2025) She, S.; Liu, J.; Liu, Y.; Chen, J.; Huang, X.; and Huang, S. 2025. R-PRM: Reasoning-Driven Process Reward Modeling. arXiv:2503.21295.
- Shi et al. (2024) Shi, W.; Hu, Z.; Bin, Y.; Liu, J.; Yang, Y.; Ng, S.-K.; Bing, L.; and Lee, R. K.-W. 2024. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294.
- Song et al. (2025) Song, M.; Su, Z.; Qu, X.; Zhou, J.; and Cheng, Y. 2025. PRMBench: A fine-grained and challenging benchmark for process-level reward models. arXiv preprint arXiv:2501.03124.
- Tian et al. (2024) Tian, C.; Zhu, X.; Xiong, Y.; Wang, W.; Chen, Z.; Wang, W.; Chen, Y.; Lu, L.; Lu, T.; Zhou, J.; Li, H.; Qiao, Y.; and Dai, J. 2024. MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. arXiv:2401.10208.
- Touvron et al. (2023a) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G. 2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
- Touvron et al. (2023b) Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; Bikel, D.; Blecher, L.; Ferrer, C. C.; Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; et al. 2023b. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
- Wang et al. (2024) Wang, K.; Pan, J.; Shi, W.; Lu, Z.; Ren, H.; Zhou, A.; Zhan, M.; and Li, H. 2024. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37: 95095â95169.
- Wang et al. (2023) Wang, P.; Li, L.; Shao, Z.; Xu, R.; Dai, D.; Li, Y.; Chen, D.; Wu, Y.; and Sui, Z. 2023. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935.
- Wang et al. (2025a) Wang, S.; Liu, Z.; Wei, J.; Yin, X.; Li, D.; and Barsoum, E. 2025a. Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models. arXiv:2506.09532.
- Wang et al. (2025b) Wang, W.; Gao, Z.; Chen, L.; Chen, Z.; Zhu, J.; Zhao, X.; Liu, Y.; Cao, Y.; Ye, S.; Zhu, X.; et al. 2025b. Visualprm: An effective process reward model for multimodal reasoning. arXiv preprint arXiv:2503.10291.
- Yan and Lee (2024) Yan, Y.; and Lee, J. 2024. Georeasoner: Reasoning on geospatially grounded context for natural language understanding. In Proceedings of the 33rd ACM international conference on information and knowledge management, 4163â4167.
- Yan et al. (2024a) Yan, Y.; Su, J.; He, J.; Fu, F.; Zheng, X.; Lyu, Y.; Wang, K.; Wang, S.; Wen, Q.; and Hu, X. 2024a. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. arXiv preprint arXiv:2412.11936.
- Yan et al. (2024b) Yan, Y.; Wang, S.; Huo, J.; Li, H.; Li, B.; Su, J.; Gao, X.; Zhang, Y.-F.; Xu, T.; Chu, Z.; et al. 2024b. Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection. arXiv preprint arXiv:2410.04509.
- Yan et al. (2025a) Yan, Y.; Wang, S.; Huo, J.; Ye, J.; Chu, Z.; Hu, X.; Yu, P. S.; Gomes, C.; Selman, B.; and Wen, Q. 2025a. Position: Multimodal large language models can significantly advance scientific reasoning. arXiv preprint arXiv:2502.02871.
- Yan et al. (2025b) Yan, Y.; Wang, S.; Huo, J.; Yu, P. S.; Hu, X.; and Wen, Q. 2025b. Mathagent: Leveraging a mixture-of-math-agent framework for real-world multimodal mathematical error detection. arXiv preprint arXiv:2503.18132.
- Yan et al. (2024c) Yan, Y.; Wen, H.; Zhong, S.; Chen, W.; Chen, H.; Wen, Q.; Zimmermann, R.; and Liang, Y. 2024c. Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In Proceedings of the ACM Web Conference 2024, 4006â4017.
- Yao et al. (2024) Yao, Y.; Yu, T.; Zhang, A.; Wang, C.; Cui, J.; Zhu, H.; Cai, T.; Li, H.; Zhao, W.; He, Z.; Chen, Q.; Zhou, H.; Zou, Z.; Zhang, H.; Hu, S.; Zheng, Z.; Zhou, J.; Cai, J.; Han, X.; Zeng, G.; Li, D.; Liu, Z.; and Sun, M. 2024. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800.
- Yuan et al. (2024) Yuan, L.; Li, W.; Chen, H.; Cui, G.; Ding, N.; Zhang, K.; Zhou, B.; Liu, Z.; and Peng, H. 2024. Free process rewards without process labels. arXiv preprint arXiv:2412.01981.
- Zeng et al. (2025) Zeng, T.; Zhang, S.; Wu, S.; Classen, C.; Chae, D.; Ewer, E.; Lee, M.; Kim, H.; Kang, W.; Kunde, J.; et al. 2025. Versaprm: Multi-domain process reward model via synthetic reasoning data. arXiv preprint arXiv:2502.06737.
- Zhang et al. (2025a) Zhang, K.; Zhang, J.; Li, H.; Zhu, X.; Hua, E.; Lv, X.; Ding, N.; Qi, B.; and Zhou, B. 2025a. Openprm: Building open-domain process-based reward models with preference trees. In The Thirteenth International Conference on Learning Representations.
- Zhang et al. (2024) Zhang, R.; Jiang, D.; Zhang, Y.; Lin, H.; Guo, Z.; Qiu, P.; Zhou, A.; Lu, P.; Chang, K.-W.; Qiao, Y.; et al. 2024. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, 169â186. Springer.
- Zhang et al. (2025b) Zhang, Z.; Zheng, C.; Wu, Y.; Zhang, B.; Lin, R.; Yu, B.; Liu, D.; Zhou, J.; and Lin, J. 2025b. The Lessons of Developing Process Reward Models in Mathematical Reasoning. arXiv:2501.07301.
- Zhao et al. (2025) Zhao, J.; Liu, R.; Zhang, K.; Zhou, Z.; Gao, J.; Li, D.; Lyu, J.; Qian, Z.; Qi, B.; Li, X.; and Zhou, B. 2025. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning. arXiv:2504.00891.
- Zheng et al. (2024a) Zheng, C.; Zhang, Z.; Zhang, B.; Lin, R.; Lu, K.; Yu, B.; Liu, D.; Zhou, J.; and Lin, J. 2024a. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559.
- Zheng et al. (2024b) Zheng, K.; Chen, J.; Yan, Y.; Zou, X.; and Hu, X. 2024b. Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models. arXiv preprint arXiv:2408.09429.
- Zhong et al. (2025) Zhong, J.; Shen, W.; Li, Y.; Gao, S.; Lu, H.; Chen, Y.; Zhang, Y.; Zhou, W.; Gu, J.; and Zou, L. 2025. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future. arXiv preprint arXiv:2504.12328.
- Zhu et al. (2023) Zhu, W.; Hessel, J.; Awadalla, A.; Gadre, S. Y.; Dodge, J.; Fang, A.; Yu, Y.; Schmidt, L.; Wang, W. Y.; and Choi, Y. 2023. Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text. arXiv:2304.06939.
- Zhuang et al. (2025) Zhuang, W.; Huang, X.; Zhang, X.; and Zeng, J. 2025. Math-puma: Progressive upward multimodal alignment to enhance mathematical reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 26183â26191.
- Zou et al. (2024) Zou, C.; Guo, X.; Yang, R.; Zhang, J.; Hu, B.; and Zhang, H. 2024. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836.
## Appendix A Appendix
### A.1 More Related Work
#### Multimodal Large Language Models (MLLMs)
The advancement of artificial intelligence has advanced the development of Multimodal Large Language Models (MLLMs). MLLMs extend the capabilities of language-centric models by integrating multiple sensory inputs, primarily visual and auditory, with text. Unlike traditional Large Language Models (LLMs) which process solely textual data, MLLMs are designed to perceive and reason across modalities such as vision and language, thereby achieving the fusion and interaction of multimodal information. The development of MLLMs has been driven by extensive efforts, including enhancements across model structure and data curation. In terms of model structure, multiple studies (Bai et al. 2025; Liu et al. 2023; Yao et al. 2024) achieve notable performance through a method that utilizing connectors to align the embeddings of vision from Vision Foundation Models (VFMs) (Chen et al. 2024) with the latent space of LLMs (Bai et al. 2023; Touvron et al. 2023a, b). Alternatively, another line of research (Grattafiori et al. 2024; Tian et al. 2024) enhances pre-trained LLMs by adding supplementary layers to integrate visual features, which reduces the number of visual tokens but incurs additional training costs. Regarding dataset curation, recent research has achieved substantial advancements. Specifically, MultimodalC4 (Zhu et al. 2023) extends C4 corpus containing only text with images and constructs a corpus that supports pre-training for MLLMs. Furthermore, OmniCorpus (Li et al. 2024) delivers a large-scale yet noisy multimodal dataset suitable for pre-training, and MMInstruct (Liu et al. 2024b) presents an open-source collection of high-quality data designed for instruction tuning. The majority of research efforts have been concentrated on the training processes of MLLMs, leaving significant room for exploration in Test-Time Scaling (TTS) technique. In our work, we investigate the potential of enhancing the performance of MLLMs by incorporating Process Reward Model (PRM) into the TTS framework.
### A.2 Benchmark
We provide more details about the Refined-BoN test benchmarks in Table 3:
| DynaMath | Plane Geometry | 770 |
| --- | --- | --- |
| MathVerse | Vision-Only | 788 |
| MathVista | Testmini | 1000 |
| WeMath | Testmini | 1740 |
| MathVision | Full | 3040 |
Table 3: More details about the Refined-BoN test benchmarks.
### A.3 Dataset
To ensure a balanced distribution of process labels, we carefully construct the training dataset. The final dataset used to train GM-PRM contains 19,614 samples in total, comprising 9,061 solutions that contain incorrect stepsâas jointly identified by GPT-4o and Monte Carlo (MC) estimationâand 10,553 solutions in which all steps are judged to be correct.
### A.4 Prompt
In this section, we introduce the prompts used to construct the training dataset and generate the reasoning processes and final answers. The prompt we guide the policy models to generate reasoning processes and final answers of multi-choice problems is represented in Figure 6.
<details>
<summary>figures/Prompt_multi_choice.png Details</summary>

### Visual Description
## Screenshot: Prompt Template for Multi-Choice Problem Reasoning
### Overview
The image displays a digital text block, likely a screenshot or a rendered document, containing a structured prompt template. The template is designed to instruct an AI or a person on how to generate step-by-step reasoning for solving multimodal mathematical problems that include an image. The text is presented in a clean, sans-serif font on a light gray background, with specific sections highlighted in color.
### Content Details
The text is entirely in English. Below is a precise transcription of all visible text, preserving formatting and placeholders.
**Title (Bold):**
`Prompt for generating reasoning of multi-choice problems:`
**Horizontal Line:**
A solid horizontal line separates the title from the main body.
**Main Instructional Text:**
`You are an expert in solving multimodal mathematical problems. I will provide a mathematical problem along with its corresponding image. According to the problem and the image, please first conduct step-by-step reasoning, and after your reasoning, please provide the correct option letter (e.g., A, B, C, D, E) using the format: "Final answer: ..."`
**Problem Section:**
- The word `Problem:` is displayed in **red** text.
- Below it, the placeholder `<Question>` is displayed in **blue** text, indicating where the actual problem statement would be inserted.
### Key Observations
1. **Template Structure:** The prompt is a reusable template. It defines a clear two-part task: 1) Provide step-by-step reasoning, and 2) Conclude with a final answer in a specific format.
2. **Multimodal Context:** The prompt explicitly mentions the problem will be "multimodal," involving both text ("the problem") and an image ("its corresponding image").
3. **Answer Format Specification:** It mandates a precise output format for the final answer: `"Final answer: [Option Letter]"`, with examples given (A, B, C, D, E).
4. **Visual Cues:** Color is used functionally. Red highlights the "Problem:" label, drawing attention to the start of the problem statement. Blue is used for the placeholder text `<Question>`, distinguishing it from the instructional text.
### Interpretation
This image does not contain data, charts, or diagrams to analyze. Instead, it is a **meta-document**âa set of instructions for generating a specific type of output.
* **Purpose:** The template's purpose is to standardize and guide the process of solving visual math problems. It ensures the solver (likely an AI model) first explains its logic transparently before committing to a final answer, which is a common practice in educational and testing contexts to demonstrate understanding.
* **Underlying Logic:** The structure reflects a pedagogical approach: reasoning is valued as much as the correct answer. The requirement to reference both the text and the image underscores the importance of integrating information from multiple sources (textual and visual) to solve a problem.
* **Implied Workflow:** The template suggests a workflow where a user would replace `<Question>` with a specific math problem and provide an accompanying image. The AI would then process both inputs, generate a reasoning chain, and output the final answer in the prescribed format. This is a common pattern for evaluating AI capabilities on complex, multimodal reasoning tasks.
</details>
Figure 6: Prompt for policy models to generate reasoning and final answers of multi-choice problem.
The prompt we guide the policy models to generate reasoning processes and final answers of free-form problems is represented in Figure 7.
<details>
<summary>figures/Prompt_free_form.png Details</summary>

### Visual Description
## Text-Based Prompt Template: Free-Form Problem Reasoning
### Overview
The image displays a text-based prompt template designed to instruct an AI model on how to approach and solve free-form mathematical problems. The template is presented within a simple bordered box and contains structured instructions, a placeholder for a problem, and formatting cues (color, dashed lines) to organize the content.
### Components/Axes
The image is a single, self-contained text block with the following structural elements:
1. **Header/Title**: Located at the top, in bold text: "Prompt for generating reasoning of free-form problems:"
2. **Separator**: A dashed line (`-------------------`) directly below the title.
3. **Instructional Paragraph**: The main body of text, which defines the AI's role and the required output format.
4. **Problem Section**: Located at the bottom, initiated by a red-colored label "Problem:" followed by a blue-colored placeholder tag `<Question>`.
### Content Details
The complete textual content of the image is transcribed below. All text is in English.
**Title:**
`Prompt for generating reasoning of free-form problems:`
**Separator:**
`-------------------`
**Instructional Text:**
`You are an expert in solving multimodal mathematical problems. I will provide a mathematical problem along with its corresponding image. According to the problem and the image, please first conduct step-by-step reasoning, and after your reasoning, please provide your final answer using the format: "Final answer: ..."`
**Problem Section:**
`Problem:` (displayed in red font)
`<Question>` (displayed in blue font, serving as a placeholder)
### Key Observations
* **Formatting Cues**: The template uses visual formatting to denote different sections. The title is bold, a dashed line creates a visual break, and color is used to highlight the "Problem:" label (red) and the question placeholder (blue).
* **Structured Workflow**: The instructions explicitly define a two-step process for the AI: 1) Provide step-by-step reasoning, and 2) Deliver a final answer in a specific, quoted format.
* **Multimodal Context**: The prompt specifies that the AI will receive both a textual problem and a corresponding image, indicating it is designed for a multimodal AI system.
* **Placeholder Usage**: The `<Question>` tag is a clear placeholder, indicating where the specific mathematical problem text would be inserted when using this template.
### Interpretation
This image is not a data chart but a **meta-instructional document**. Its purpose is to standardize how an AI model should process and respond to complex, open-ended mathematical problems that may involve visual data.
* **What it demonstrates**: It reveals a methodology for eliciting structured, transparent reasoning from an AI. By mandating a "step-by-step" process before the final answer, the template aims to make the AI's problem-solving logic auditable and to reduce the likelihood of a purely guessed or opaque final result.
* **How elements relate**: The title sets the context. The dashed line separates the title from the core instructions. The instructional paragraph defines the AI's persona ("expert"), the input modalities (text + image), and the required output structure. The final "Problem:" section is the designated slot for the actual task, creating a clear separation between the *instructions for how to work* and the *work itself*.
* **Notable Design Choice**: The use of a specific, quoted format for the final answer (`"Final answer: ..."`) is a key design feature. It creates a predictable, machine-readable endpoint for the response, which is crucial for automated evaluation or downstream processing of the AI's output. This template is a tool for controlling and standardizing AI behavior in a research or application setting.
</details>
Figure 7: Prompt for policy models to generate reasoning and final answers of free-form problem.
The prompt we use to employ GPT-4o to generate the training dataset is shown in Figure 8.
<details>
<summary>figures/Prompt.png Details</summary>

### Visual Description
## Technical Document: Prompt Template for GPT-4o Training Dataset Construction
### Overview
The image displays a structured text document outlining a prompt template designed for an AI model (specifically GPT-4o) to construct a training dataset. The document is formatted as a technical specification, defining a task for evaluating and correcting multi-step solutions to multimodal mathematical problems. The text is entirely in English.
### Content Details
The document is organized into the following sections, transcribed verbatim:
**Header:**
`Prompt for GPT-4o to construct training dataset:`
Followed by a dashed line separator.
**Introduction:**
`You are an expert in solving multimodal mathematical problems. You will be given:`
`1. A multimodal mathematical problem and its corresponding image.`
`2. A multiple-step solution (each step on a new line).`
**Task Section (Labeled in Red):**
`**Task**:` (The label "**Task**:" is in red text)
`The tasks you need to do are:`
`1. Analyze the purpose of each step and what specific actions were taken in each step.`
`2. Analyze each step's correctness in terms of image alignment and reasoning logic.`
`- Image alignment: Whether the information and reasoning used in the step are consistent with the content of the provided image.`
`- Reasoning logic: Whether the reasoning is logically sound, calculations are correct, and information used matches that from previous steps and question.`
`When outputting judgements, you must choose one output from "Correct" or "Incorrect".`
`3. For the first incorrect step, correct it based on your analysis of its error, and output the corrected step at the end of your output.`
**Question Section (Labeled in Red):**
`**Question**:` (The label "**Question**:" is in red text)
`The multimodal mathematical problem is as follows:`
`<Question>` (This is a blue placeholder tag)
**Solution Steps Section (Labeled in Red):**
`**Solution Steps**:` (The label "**Solution Steps**:" is in red text)
`The multiple-step solution is as follows:`
`<Solution Steps>` (This is a blue placeholder tag)
**Output Format Section (Labeled in Red):**
`**Output Format**:` (The label "**Output Format**:" is in red text)
`You must output your content in the following format:`
`### Step 1 ###`
`Step intent analysis:[Describe what the step aims to do and the specific actions]`
`Image alignment analysis:[Analyze the consistency of image alignment]`
`Judgement of image alignment:[Correct/Incorrect]`
`Reasoning logic analysis:[Analyze the rationality of logic, correctness of calculations and consistency with prior step]`
`Judgement of reasoning logic:[Correct/Incorrect]`
`Final judgement of the current step:[Correct/Incorrect]`
`### Step 2 ###`
`...`
`Corrected step of the first incorrect step in solution:`
`Step n:[assume that the first incorrect step is step n, and fill in the corrected step n in the square bracket]`
### Key Observations
1. **Document Type:** This is a meta-prompt or a system prompt template, not a chart, diagram, or data visualization. It contains instructions and a structured format for another AI to follow.
2. **Formatting Cues:** Key section labels (`**Task**:`, `**Question**:`, `**Solution Steps**:`, `**Output Format**:`) are highlighted in red. Placeholders for dynamic content (`<Question>`, `<Solution Steps>`) are highlighted in blue.
3. **Structured Output:** The required output format is highly structured, mandating a step-by-step analysis with specific fields for intent, image alignment, reasoning logic, and final judgment for each step in a solution.
4. **Error Correction Protocol:** The template includes a specific instruction to identify and correct the *first* incorrect step found in the provided solution.
### Interpretation
This document defines a rigorous evaluation protocol for multimodal mathematical reasoning. Its purpose is to generate high-quality training data by having an AI model act as a "grader" or "tutor." The process involves:
* **Deconstruction:** Breaking down a solution into discrete steps.
* **Multi-faceted Evaluation:** Assessing each step on two independent axes: fidelity to the provided image (visual grounding) and internal logical/mathematical soundness.
* **Structured Judgement:** Enforcing a binary (Correct/Incorrect) output for clarity and consistency in the generated dataset.
* **Pedagogical Correction:** Requiring the generation of a corrected step, which adds a constructive, teaching-oriented element to the dataset.
The template is designed to produce data that can train models not just to solve problems, but to *explain* and *critique* solution processes, emphasizing the importance of aligning reasoning with visual evidence. The strict output format ensures the generated data is machine-readable and consistent, suitable for fine-tuning or evaluation benchmarks.
</details>
Figure 8: Prompt for GPT-4o to construct training dataset.
### A.5 More Cases
We collect 2 more cases of GM-PRM and the Refined-BoN process illustrated in Figure 9 and 10.
<details>
<summary>figures/More_Cases_1.png Details</summary>

### Visual Description
## Diagram: Mathematical Problem-Solving with Error Analysis
### Overview
The image is a structured educational or technical document that presents a geometry problem, an incorrect solution attempt, a corrected solution, and a meta-analysis of the error. It is divided into distinct sections: a visual problem representation, a textual problem statement, an "Original Solution" with a mistake, a "Refined Solution" with the correction, and an analytical commentary labeled "GM-PRM: Analysis & Judgement."
### Components/Axes
The document is organized into the following spatial regions:
1. **Header Region (Top):**
* **Left Header:** "Question (Visual Part)" in a dark grey box.
* **Right Header:** "Question (Textual Part)" in a dark grey box.
2. **Problem Statement Region (Below Headers):**
* **Visual Part (Left):** A diagram of a rectangle.
* **Labels:** Height = `6` (left side). Width segments = `1 000` and `10` (top).
* **Internal Areas:** Two orange sub-rectangles labeled `6 000` and `60`. A white square on the right (unlabeled).
* **Caption:** "Total area: 6 096" (below the diagram).
* **Textual Part (Right):** The question "How much area is remaining?" in bold text.
3. **Solution Region (Middle):**
* **Original Solution (Left, Blue Header Box):** A step-by-step solution marked with green checkmarks (â) for correct steps and a red cross (â) for an incorrect step.
* **Refined Solution (Right, Purple Header Box):** A corrected step-by-step solution, with all steps marked with green checkmarks (â).
4. **Analysis Region (Bottom Left):**
* **GM-PRM: Analysis & Judgement (Purple Header Box):** A commentary box analyzing the error in the original solution.
### Detailed Analysis / Content Details
**Problem Data:**
* Total Area: `6096` square units.
* Allocated Area 1: `6000` square units.
* Allocated Area 2: `60` square units.
* Goal: Find the remaining area (represented by the white square in the diagram).
**Original Solution Steps:**
* Step 1: States the method: subtract allocated areas from the total area. (â)
* Step 2: Identifies the total area as `6096`. (â)
* Step 3: Identifies allocated areas as `6000` and `60`. (â)
* Step 4: Performs the subtraction:
* First subtraction: `6096 - 6000 = 96`. (â)
* Second subtraction: `96 - 60 = 6`. (â) **This is marked as incorrect.**
**Refined Solution Steps:**
* Steps 1-3 are identical to the original.
* Step 4: Performs the corrected subtraction:
* First subtraction: `6096 - 6000 = 96`. (â)
* Second subtraction: `96 - 60 = 36`. (â) **This is the corrected calculation.**
* Step 5: Final answer: "The remaining area is 36 square units." (â)
**GM-PRM Analysis & Judgement:**
* **Step intent analysis:** Confirms the step's purpose is to subtract allocated areas from the total.
* **Image alignment analysis:** States the diagram supports the subtraction process.
* **Reasoning logic analysis:** Identifies the core error: "The calculations are performed correctly: (6096 - 6000 = 96) and (96 - 60 = 36), but the final result should be 36 instead of 6." This indicates the original solution's arithmetic in the second subtraction was faulty (`96 - 60` was incorrectly computed as `6` instead of `36`).
### Key Observations
| Observation | Description |
| :--- | :--- |
| **Error Localization** | The error is precisely isolated to a single arithmetic mistake in the second subtraction step of the original solution. |
| **Visual-Textual Alignment** | The diagram visually represents the problem (total area minus two parts), which aligns perfectly with the textual problem statement and the solution method. |
| **Pedagogical Structure** | The document uses a clear compare-and-contrast format (Original vs. Refined) with symbolic markers (â, â) and a dedicated analysis box to highlight and explain the mistake. |
| **Numerical Consistency** | The total area (`6096`) is the sum of the two allocated areas (`6000 + 60 = 6060`) and the remaining area (`36`), confirming the mathematical relationship: `6060 + 36 = 6096`. |
### Interpretation
This image serves as a technical case study in error detection and correction within a mathematical problem-solving workflow. It demonstrates a process of **computational verification**.
* **What it demonstrates:** The core lesson is that a correct procedural approach (subtracting parts from a whole) can still yield a wrong answer due to a simple arithmetic error. The "GM-PRM" analysis acts as a validation layer, performing a logic check that separates the correctness of the *method* from the correctness of the *execution*.
* **Relationship between elements:** The visual diagram grounds the abstract numbers in a spatial model. The two solution paths show the consequence of the error. The analysis box provides the meta-cognitive layer, explaining *why* the original was wrong and *how* the refined version corrects it. This structure is typical of educational tools, automated tutoring systems, or technical documentation for debugging mathematical software.
* **Notable implication:** The presence of a formal "Analysis & Judgement" component suggests this might be output from an AI or automated system designed to evaluate and refine problem-solving steps, highlighting the importance of automated checks in computational pipelines. The error is not conceptual but operational, emphasizing the need for precision at every step of a calculation.
</details>
Figure 9: The illustration of a representative example before and after applying GM-PRM. In particular, GM-PRM first judges the steps of the original solution generated by the policy model. Subsequently, GM-PRM finds that the forth step is incorrect regarding reasoning logic and refines the forth step to generate the correct version. The correct steps are input to the policy model to generate the refined solution, and finally the correct answer is obtained.
<details>
<summary>figures/More_Cases_2.png Details</summary>

### Visual Description
## Composite Diagram: Mathematical Problem with Graphical and Textual Components
### Overview
The image is a composite educational diagram presenting a mathematical problem involving two quadratic functions representing mosquito and bat populations relative to rainfall. It includes a graph, the problem statement, an original (flawed) solution, a refined (correct) solution, and an analytical commentary on the error. The purpose is to demonstrate a common algebraic mistake and its correction.
### Components/Axes
The image is segmented into four primary regions:
1. **Top-Left (Question - Visual Part):** A Cartesian coordinate graph.
* **X-axis:** Labeled "x", representing rainfall in centimeters. Scale: 0 to 6, with major gridlines at integers.
* **Y-axis:** Labeled "y", representing population count (implied). Scale: 0 to 6, with major gridlines at integers.
* **Curves:**
* A blue parabola opening downward, labeled with the equation `y = 5x - x²` and the letter `f`.
* A pink/magenta parabola opening downward, labeled with the equation `y = 3x - 0.5x²` and the letter `g`.
* **Key Points:** The curves intersect at the origin (0,0) and at a second point approximately at (4, 4).
2. **Top-Right (Question - Textual Part):** A block of text stating the problem.
* **Text:** "In both equations x represents rainfall (in centimeters). When there is 0cm of rainfall, the number of mosquitos is the same as the number of bats. What is another rainfall amount where the number of mosquitos is the same as the number of bats? Round your answer to the nearest half centimeter."
3. **Bottom-Left (Original Solution & Analysis):** A step-by-step solution marked with errors.
* **Header:** "Original Solution" in a blue box.
* **Steps 1-3:** Correctly identify the equations and set them equal: `5x - x² = 3x - 0.5x²`.
* **Step 4 (Error):** Contains a red "X" and shows a flawed algebraic simplification: `[5x - =x - x² + 0.5x² = 0] [5x - =x + 0.5x² - x² = 0] [5x - =x - 0.5x² = 0] [5x - =x - 0.5x² = 0] [5x - =x - 0.5x² = 0]`. The notation is inconsistent and incorrect.
* **GM-PRM Box:** A purple box titled "GM-PRM: Analysis & Judgement" providing commentary:
* *Step intent analysis:* "This step simplifies the equation set in Step 3 to prepare it for solving."
* *Image alignment analysis:* "The simplification process should reflect the intersection points seen in the image."
* *Reasoning logic analysis:* "There is a mistake in the simplification process. The rearrangement and simplification steps contain errors in algebraic manipulation."
4. **Bottom-Right (Refined Solution):** A corrected step-by-step solution.
* **Header:** "Refined Solution" in a purple box.
* **Step 4 (Corrected):** Shows proper simplification: `[5x - x² = 3x - 0.5x²]` -> `[5x - x² - 3x + 0.5x² = 0]` -> `[2x - 0.5x² = 0]` -> `[0.5x² - 2x = 0]` -> `[x(0.5x - 2) = 0]`. A green checkmark is present.
* **Step 5:** Solves for x: `[x(0.5x - 2) = 0]` gives solutions `x = 0` and `0.5x - 2 = 0` -> `0.5x = 2` -> `x = 4`. A green checkmark is present.
* **Step 8:** States the final answer: `( 4 ) cm.` A green checkmark is present.
### Detailed Analysis
* **Graph Data Points:** The two functions are:
* Mosquitoes (f): `y = 5x - x²`
* Bats (g): `y = 3x - 0.5x²`
* **Intersection Points:** The graph visually confirms two intersection points:
1. At `x = 0` cm rainfall, `y = 0` for both.
2. At `x = 4` cm rainfall, substituting into either equation gives `y = 4` (e.g., `5(4) - (4)² = 20 - 16 = 4`).
* **Algebraic Error in Original Solution:** The error occurs in the rearrangement step. The original solution incorrectly manipulates terms, leading to nonsensical expressions like `5x - =x`. The core mistake is a failure to correctly combine like terms after moving all terms to one side of the equation.
* **Correct Algebraic Path (Refined Solution):**
1. Set equations equal: `5x - x² = 3x - 0.5x²`
2. Move all terms to left: `5x - x² - 3x + 0.5x² = 0`
3. Combine like terms: `(5x - 3x) + (-x² + 0.5x²) = 0` -> `2x - 0.5x² = 0`
4. Factor: `x(2 - 0.5x) = 0` or `0.5x(4 - x) = 0`
5. Solutions: `x = 0` or `x = 4`.
### Key Observations
1. **Visual-Numerical Consistency:** The graphical intersection at `x=4` is perfectly confirmed by the algebraic solution.
2. **Pedagogical Structure:** The image is designed to teach by contrasting a common error with the correct method, supported by visual evidence from the graph.
3. **Error Localization:** The mistake is isolated to a single algebraic manipulation step (Step 4 in the original), while the problem setup (Steps 1-3) and final solving logic (Step 5 onward) are correct in intent.
4. **Annotation Use:** Color-coding (blue for original, purple for refined/analysis) and symbols (red X, green checkmarks) are used effectively to guide the viewer's attention to correct and incorrect elements.
### Interpretation
This diagram serves as a case study in mathematical problem-solving, emphasizing the importance of careful algebraic manipulation. The data demonstrates that the populations of mosquitos and bats, as modeled by the given quadratics, are equal at two specific rainfall levels: 0 cm and 4 cm. The "another rainfall amount" sought by the problem is **4 cm**.
The relationship between the components is instructional: the graph provides visual intuition and a means to verify the algebraic result. The GM-PRM analysis explicitly links the algebraic error to a failure to "reflect the intersection points seen in the image," highlighting the value of using graphical representation as a sanity check for symbolic work. The refined solution provides the correct pathway, showing that factoring is the efficient method to find the roots after proper simplification. The outlier here is not a data point, but the flawed algebraic step itself, which is the central focus of the educational content.
</details>
Figure 10: The illustration of a representative example before and after applying GM-PRM. In particular, GM-PRM first judges the steps of the original solution generated by the policy model. Subsequently, GM-PRM finds that the forth step is incorrect regarding reasoning logic and refines the forth step to generate the correct version. The correct steps are input to the policy model to generate the refined solution, and finally the correct answer is obtained.