# VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
## Abstract
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in this page.
🖂 Corresponding Author: wangwenhai@pjlab.org.cn
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Bar Chart: Pass@1 Performance Comparison of Policy Models
### Overview
This bar chart compares the "Pass@1" performance of different policy models, using three different approaches: "Pass@1" (green bars), "BoN with InternVL2.5-8B" (salmon/reddish bars), and "BoN with VisualPRM-8B (ours)" (blue bars). The chart displays overall performance on the y-axis against various policy models on the x-axis.
### Components/Axes
* **X-axis:** Policy Model. Categories include: MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, InternVL2.5-26B, InternVL2.5-38B, InternVL2.5-78B.
* **Y-axis:** Overall Performance (labeled as "Overall Performance"). Scale ranges from 25 to 55.
* **Legend:** Located in the top-left corner.
* Green: Pass@1
* Salmon/Reddish: BoN with InternVL2.5-8B
* Blue: BoN with VisualPRM-8B (ours)
### Detailed Analysis
The chart consists of six groups of three bars, each representing a different policy model. The height of each bar indicates the "Pass@1" performance.
* **MiniCPM-V2.6:**
* Pass@1: Approximately 29.5
* BoN with InternVL2.5-8B: Approximately 28.6
* BoN with VisualPRM-8B: Approximately 37.5
* **QwenVL2.5-7B:**
* Pass@1: Approximately 41.4
* BoN with InternVL2.5-8B: Approximately 41.6
* BoN with VisualPRM-8B: Approximately 45.1
* **InternVL2.5-8B:**
* Pass@1: Approximately 41.2
* BoN with InternVL2.5-8B: Approximately 32.8
* BoN with VisualPRM-8B: Approximately 33.2
* **InternVL2.5-26B:**
* Pass@1: Approximately 45.8
* BoN with InternVL2.5-8B: Approximately 36.9
* BoN with VisualPRM-8B: Approximately 39.1
* **InternVL2.5-38B:**
* Pass@1: Approximately 44.4
* BoN with InternVL2.5-8B: Approximately 44.9
* BoN with VisualPRM-8B: Approximately 50.7
* **InternVL2.5-78B:**
* Pass@1: Approximately 46.0
* BoN with InternVL2.5-8B: Approximately 46.4
* BoN with VisualPRM-8B: Approximately 51.9
**Trends:**
* **Pass@1:** Generally increases with larger policy models, peaking at 46.0 for InternVL2.5-78B.
* **BoN with InternVL2.5-8B:** Shows a more erratic pattern, with a dip at InternVL2.5-8B and a peak at InternVL2.5-78B.
* **BoN with VisualPRM-8B:** Demonstrates a consistent upward trend, achieving the highest performance values across most models, especially at InternVL2.5-38B and InternVL2.5-78B.
### Key Observations
* "BoN with VisualPRM-8B (ours)" consistently outperforms the other two approaches, particularly with larger policy models.
* The performance of "BoN with InternVL2.5-8B" is relatively stable, but generally lower than "Pass@1" and "BoN with VisualPRM-8B".
* The largest performance gains are observed when using "BoN with VisualPRM-8B" in conjunction with the InternVL2.5-78B policy model.
* InternVL2.5-8B shows a significant drop in performance for the "BoN with InternVL2.5-8B" approach.
### Interpretation
The data suggests that the "BoN with VisualPRM-8B" approach significantly enhances the performance of policy models, especially as the model size increases. This indicates that incorporating visual information through the VisualPRM-8B model is beneficial for achieving higher "Pass@1" scores. The consistent upward trend of "BoN with VisualPRM-8B" suggests a strong positive correlation between model capacity and performance when combined with this approach. The outlier at InternVL2.5-8B for "BoN with InternVL2.5-8B" could indicate a compatibility issue or a specific characteristic of that model that hinders the effectiveness of the BoN approach. The chart demonstrates the effectiveness of the proposed "BoN with VisualPRM-8B" method, positioning it as a superior approach compared to the baseline "Pass@1" and "BoN with InternVL2.5-8B" methods.
</details>
Figure 1: The overall Best-of-8 evaluation results across seven multimodal reasoning benchmarks with different critic models. Our VisualPRM greatly enhances the overall performance, while InternVL2.5-8B struggles to be an effective critic model.
With the remarkable success of Large Language Models (LLMs) [75, 76, 21, 5, 71, 10, 9, 1, 4] in natural language processing, Multimodal Large Language Models (MLLMs) [80, 36, 41, 42, 6, 84, 83, 14, 16, 15, 58, 61, 89] have also achieved significant advancements across various vision-language tasks. Despite their strong performance in perception and recognition, a large gap remains in reasoning capabilities between open-source and proprietary models. A series of studies have explored methods to enhance reasoning abilities, focusing on the perspectives of data collection and construction [57, 74, 38, 43], offline preference optimization [59, 82, 32], and online reinforcement learning [66, 24, 2, 26]. Additionally, another line of research [68, 20, 94, 79] investigates utilizing Test-Time Scaling (TTS) to enhance the reasoning abilities of LLMs. This approach requires the policy model to generate multiple response candidates and select the best one, based on the quality estimation of a critic model, thereby improving the response quality at the cost of higher inference time. However, TTS for MLLMs remains largely unexplored.
This work investigates the application of TTS for MLLMs, focusing on the Best-of-N (BoN) evaluation strategies. The challenges of adapting TTS for MLLMs involves: (1) Lack of effective critic models. In BoN evaluation, a critic model is required to estimate the quality of each response candidate. However, as shown in Figure 1, existing open-source MLLMs struggle to serve as critic models, leading to marginal improvements compared to models without TTS. This limitation stems from the lack of sufficient critic data in their training corpus. (2) Lack of evaluation benchmarks for multimodal critic models. The effectiveness of TTS heavily depends on the performance of the critic model. However, directly evaluating critics under BoN settings poses two key issues. First, the evaluation cost of BoN is expensive. Although the focus is on the performance of critic models, the policy model is required to generate $N$ reasoning processes, with the majority of computational costs arising from the policy model. Second, BoN performance is also affected by the policy model, making it difficult to compare different critic models when paired with varying policy models.
To solve these challenges, we first introduce VisualPRM400K, a dataset comprising approximately 400K multimodal process supervision data. Each sample includes an image, a question, a step-by-step solution, and correctness annotations for each step. Specifically, we collect question prompts from MMPR v1.1 [82] and then generate process correctness using an automatic data pipeline [79]. This pipeline samples multiple continuations starting from a certain step and computes the expected accuracy of that step as the average accuracy of its continuations.
To facilitate the evaluation of multimodal critic models, we introduce VisualProcessBench, a benchmark for evaluating PRMs and MLLMs in detecting erroneous steps in multimodal reasoning tasks. This benchmark includes 2,866 samples with 26,950 human-annotated step-wise correctness labels. Each sample includes a multimodal reasoning question, a step-by-step solution, and correctness labels for each step. To ensure annotation accuracy, we employ human experts with at least a university degree to manually assess the correctness of each step. Unlike prior benchmarks [96, 39], which require identifying only the first erroneous step, VisualProcessBench challenges models to detect all errors within a given solution. This adjustment aligns with recent advancements in model reflection abilities, helping to reduce false negatives in evaluations. Evaluation results reveal that existing open-source MLLMs struggle to accurately assess step-wise correctness, highlighting the need for improved multimodal critic models.
Building upon the dataset and benchmark, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, to serve as the critic model in BoN evaluation. Each training sample is formulated as a multi-turn chat. The first turn includes the image, the question, and the first solution step, while each subsequent turn presents a new step. The model is trained to predict the correctness of the given step at each turn. Experimental results demonstrate that VisualPRM enhances MLLM reasoning across different model families and scales. Specifically, VisualPRM improves the overall reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks [90, 50, 78, 93, 99, 60, 87]. Additionally, we compare PRMs with Outcome Reward Models and Self-Consistency in BoN evaluation, finding that PRMs consistently outperform both approaches.
In summary, our main contributions are as follows:
(1) We introduce VisualPRM400K, a dataset comprising approximately 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal PRM to serve as the critic model in the BoN evaluation.
(2) We construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to identify erroneous steps in multimodal reasoning tasks. This benchmark comprises 2,866 samples with a total of 26,950 human-annotated step-wise correctness labels.
(3) Through extensive experiments, we demonstrate that PRMs can serve as effective critic models for test-time scaling of MLLMs. Specifically, VisualPRM enhances the overall multimodal reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks. Furthermore, our results show that PRMs consistently outperform both ORMs and SC in BoN evaluation. Additionally, experiments on VisualProcessBench reveal that existing open-source MLLMs struggle to accurately assess the correctness of each step.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Educational Material: Geometry Problem Sets
### Overview
The image presents two geometry problem sets, likely from an educational resource. Each set includes a diagram, a question, and a step-by-step solution. The first problem involves finding an angle in a quadrilateral, while the second involves finding an angle formed by intersecting lines and a transversal. The solutions include reasoning steps and calculations.
### Components/Axes
The image is divided into two distinct sections, each containing:
* **Diagram:** A geometric figure illustrating the problem.
* **Question:** A textual question related to the diagram.
* **Solution:** A numbered list of steps detailing the solution process, including reasoning and calculations.
* **Step Annotations:** Each step is annotated with a confidence score (e.g., `m=0.75`, `m=0.625`, `m=0.0`).
### Detailed Analysis or Content Details
**Problem 1: Finding Angle S**
* **Diagram:** A quadrilateral QRS T with angles labeled as follows:
* Angle Q: (2x + 5)°
* Angle R: (2x + 7)°
* Angle T: x°
* Angle S: ?
* **Question:** Find m angle S.
* **Solution:**
* Step 1 (m=0.75): "To find (m angle S) in the given quadrilateral, we'll use the property that the sum of the interior angles of any quadrilateral is (360°)."
* Step 2 (m=0.625): "Write the equation for the sum of the angles: ((2x + 5)° + x° + (2x + 7)° + x° = 360°)"
* Step 3 (m=0.625): "Simplify the equation: (2x + 5 + x + 2x + 7 + x = 360)"
* Step 4 (m=0.5): "(6x + 12 = 360)"
* Step 5 (m=0.5): "Solve for (x): (6x = 360 - 12)"
* Step 6 (m=0.5): "(6x = 348)"
* Step 7 (m=0.5): "(x = 58)"
* Step 8 (m=0.0): "Substitute (x) back into (m angle S) which is given as (x): (m angle S = x = 58°)"
* Final answer: 58
**Problem 2: Finding Angle 2**
* **Diagram:** Two parallel lines AB and CD intersected by a transversal EG. Angle 1 is labeled, and Angle 2 is the target angle.
* **Question:** As shown in the figure, AB parallel CD, EG bisects angle BEF, then angle 2 is equal to
* Choices: A: 50° B: 60° C: 65° D: 90°
* **Solution:**
* Step 1 (Neutral): "Given Information: (AB parallel CD) (AB is parallel to CD), (EG bisects (angle BEF)), (angle 1 = 50°)."
* Step 2 (Correct): "Understanding the Geometry: Since (AB parallel CD), the transversal (EG) creates alternate interior angles and corresponding angles."
* Step 3 (Correct): "(EG) bisects (angle BEF), meaning ((angle BEG = angle GEF))."
* Step 4 (Correct): "Relationship Between Angles: (angle BEF = angle BEG + angle GEF)."
* Step 5 (Correct): "Since (EG) bisects (angle BEF), ((angle BEG = angle GEF) = 50°)."
* Step 6 (Correct): "Finding (angle 2): (angle 2) and (angle GEF) are congruent because (AB parallel CD) and (EG) is a transversal."
* Step 7 (Incorrect): "Therefore, (angle 2 = angle GEF = 50°)."
* Step 8 (Incorrect): "Final answer: The correct option is A."
### Key Observations
* The confidence scores (m values) associated with the solution steps in the first problem decrease as the solution progresses, suggesting increasing uncertainty in later steps.
* The second problem's solution contains an error in the final step. While the reasoning is sound up to step 6, the conclusion that angle 2 is 50° is incorrect. Angle 2 is actually 130 degrees.
* The solutions are presented in a very detailed, step-by-step manner, suitable for educational purposes.
### Interpretation
The image showcases a learning approach that emphasizes breaking down complex problems into smaller, manageable steps. The inclusion of confidence scores suggests an attempt to evaluate the reliability of each step, potentially for automated tutoring systems or self-assessment. The error in the second problem highlights the importance of careful verification and critical thinking, even when following a seemingly logical process. The problems demonstrate fundamental concepts in geometry, including the properties of quadrilaterals, parallel lines, transversals, and angle bisectors. The use of degree symbols (°\circ) consistently indicates that the angles are measured in degrees. The notation `m=...` likely represents a measure of confidence or correctness assigned to each step by an automated system. The problems are designed to test a student's understanding of geometric principles and their ability to apply them to solve problems.
</details>
Figure 2: Data examples in VisualPRM400K and VisualProcessBench. For VisualPRM400K, we generate the data using an automatic data pipeline. The key idea is to estimate the expected accuracy $mc_{i}$ of the given step $s_{\leq i}$ based on Monte Carlo sampling and consider the step correct if $mc_{i}>0$ . During the training process of VisualPRM, the data is formulated as multi-turn conversations and the model is required to predict the correctness of each step conditioned on the image, question, and previous steps. For VisualProcessBench, we collect questions from existing multimodal reasoning benchmarks [90, 78, 93, 99, 60] and generate the solutions using leading MLLMs [58, 4, 15, 82, 72]. Based on these questions and solutions, we employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions.
## 2 Related Work
Multimodal Large Language Models. A wide range of efforts has been made to advance the development of MLLMs, including improvements in model architecture, data construction, and training algorithms. From an architectural perspective, several studies [41, 42, 89, 15, 16, 84, 83, 45, 85, 7, 89] employ connectors to align visual embeddings from Vision Foundation Models (VFMs) [14, 91] with the latent space of LLMs [5, 75, 76, 71], achieving promising performance. Another series of works [3, 21, 73, 81] extends pre-trained LLMs with additional layers to fuse visual features, reducing the number of required visual tokens while introducing extra training cost. In terms of data construction, recent studies have made significant progress [63, 98, 34, 38, 43, 82, 95]. For example, OmniCorpus [38] offers a noisy but large-scale multimodal corpus for pre-training, while MMInstruct [43] provides an open-source, high-quality instruction-tuning dataset. Additionally, MMPR [82] constructs a preference dataset focusing on multimodal reasoning abilities. Regarding training algorithms, the InternVL2.5 series [15, 82] proposes square loss and Mix Preference Optimization to enhance MLLM capabilities. Despite these advancements, existing works primarily focus on the training process of MLLMs, leaving Test-Time Scaling (TTS) for MLLMs largely underexplored. In this work, we investigate TTS applications for MLLMs, specifically focusing on the Best-of-N evaluation to improve multimodal reasoning performance.
Process Reward Models. Reward models play a crucial role in Reinforcement Learning (RL) [64, 66, 2, 26] and TTS [68, 79, 20, 51]. Outcome Reward Models (ORMs) [55, 92, 77] directly assign an overall score to the given response. In contrast, Process Reward Models (PRMs) first estimate the quality of each step in the given response and then aggregate them into a final score. PRM800K [39] is the first open-source process supervision dataset, entirely annotated by human experts. To reduce annotation costs, MathShepherd [79] and OmegaPRM [51] introduce a Monte Carlo sampling-based data pipeline to automatically estimate the quality of each step. Despite these advancements in natural language processing, multimodal PRMs remain largely underexplored. In this work, we introduce VisualPRM400K, the first multimodal process supervision dataset, and develop VisualPRM, a multimodal PRM trained on this dataset.
Benchmarks for Reward Models. The evaluation of reward models (RMs) is a crucial research topic. A series of benchmarks [33, 37, 44] have been proposed to assess the effectiveness of RMs, typically formulated as a binary preference judgment task. Building on this, subsequent work [97] extends the evaluation settings and includes both pairwise and Best-of-N evaluations, providing a more comprehensive evaluation of RM performance. With the rapid advancement of PRMs, a series of benchmarks [96, 69] have been introduced to evaluate their step-wise judgment capabilities. Despite these developments, there remains a lack of a multimodal process benchmark. To bridge this gap and support the development of multimodal PRMs, we introduce VisualProcessBench, a benchmark designed to evaluate the ability of PRMs and MLLMs to detect erroneous steps in multimodal reasoning tasks.
## 3 Method
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: Problem Solving Steps & Accuracy Assessment
### Overview
The image depicts a diagram illustrating a problem-solving process, specifically for a geometry problem involving finding an angle in a triangle. It shows a sequence of steps, along with associated metrics like 'mc' (likely representing a measure of correctness) and accuracy assessments ('Good', 'Bad', 'Tie'). The diagram also highlights two different approaches to problem-solving: Value-based PRM and Advantage-based PRM.
### Components/Axes
The diagram is structured horizontally, representing a sequence of steps. It can be divided into three main sections:
1. **Question & Solution (Top-Right):** Contains the problem statement and final answer.
2. **Steps (Top-Left to Center):** Shows the individual steps taken to solve the problem.
3. **Accuracy Assessment (Bottom):** Evaluates the correctness of each step and the overall solution.
The diagram includes the following labels:
* **Question:** "Find m∠S."
* **Step-0:** "To find (m∠S)..."
* **Step-1:** "Write the equation for..."
* **Step-4:** "Substitute (x) back..."
* **Step-5:** "Final answer: 58"
* **Expected Accuracy:** Label for the bottom-right section.
* **Value-based PRM:** Label for the bottom-left section.
* **Advantage-based PRM:** Label for the bottom-right section.
* **mc = 0.75**
* **mc = 0.5**
* **mc = 0.0**
* **mc = 0.0**
* **Correct (+)**
* **Correct (+)**
* **Incorrect (-)**
* **Incorrect (-)**
* **Good (+)**
* **Bad (-)**
* **Bad (-)**
* **Tie (=)**
* Triangle with labeled angles: ∠Q = (2x + 5)°, ∠R = (2x - 7)°, ∠T = r°, ∠S = ?
### Detailed Analysis or Content Details
The diagram shows a problem-solving process with five steps.
* **Step 0:** The initial step involves setting up the problem to find the measure of angle S. The associated 'mc' value is 0.75, and the assessment is "Correct (+)" and "Good (+)".
* **Step 1:** The next step involves writing the equation. The 'mc' value is 0.5, and the assessment is "Correct (+)" and "Bad (-)".
* **Step 4:** A step is skipped, and the diagram jumps to Step 4, which involves substituting the value of 'x' back into the equation. The 'mc' value is 0.0, and the assessment is "Incorrect (-)" and "Bad (-)".
* **Step 5:** The final step provides the answer: 58. The 'mc' value is 0.0, and the assessment is "Incorrect (-)" and "Tie (=)".
* The triangle has angles labeled as follows: ∠Q = (2x + 5)°, ∠R = (2x - 7)°, and ∠T = r°. The angle ∠S is the target to be found.
The diagram also presents two problem-solving approaches: Value-based PRM and Advantage-based PRM.
### Key Observations
* The 'mc' values decrease as the steps progress, indicating a decline in correctness.
* The accuracy assessments show a shift from "Good" to "Bad" and "Tie", suggesting the solution deteriorates as the steps are executed.
* The jump from Step 1 to Step 4 indicates missing steps in the diagram.
* The final answer (58) is associated with an 'mc' value of 0.0 and an assessment of "Incorrect (-)" and "Tie (=)", suggesting the answer is likely wrong or incomplete.
### Interpretation
The diagram illustrates a flawed problem-solving process. While the initial steps show some promise (mc = 0.75, Correct), the subsequent steps lead to incorrect results. The decreasing 'mc' values and the shift in accuracy assessments highlight the errors introduced during the process. The missing steps (between Step 1 and Step 4) further contribute to the uncertainty and potential inaccuracies.
The comparison between Value-based PRM and Advantage-based PRM suggests that these are two different strategies for evaluating or approaching the problem-solving process. The diagram doesn't provide enough information to determine which approach is superior, but it implies that the current process, as depicted, is not yielding accurate results. The final "Tie" assessment suggests that the solution is not definitively wrong, but also not entirely correct, possibly due to incomplete information or a flawed methodology. The diagram serves as a cautionary example of how a seemingly logical sequence of steps can lead to an incorrect solution if errors are introduced along the way.
</details>
Figure 3: Different modeling methods for PRMs. PRMs are developed to estimate the quality of each step in a given solution. For value-based PRMs, the quality of a certain step is determined by its expected accuracy $mc_{i}$ , where a step is considered correct if $mc_{i}>0$ . For advantage-based PRMs, the quality of a certain step is determined by the improvement of $mc_{i}$ over $mc_{i-1}$ , where a step is considered good if $mc_{i}-mc_{i-1}>0$ . During the training stage, the output space of PRMs is discretized into specific tokens, while during the inference stage, we compute the step score as the weighted sum of the generation probability for these discretized tokens.
During Best-of-N (BoN) evaluation, a critic model is required to estimate the quality of each response candidate. In this work, we formulate the critic model as a Process Reward Model (PRM). To develop a multimodal PRM, we first construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data, as detailed in Section 3.1. We then describe our PRM’s modeling approach in Section 3.2. Furthermore, to support the evaluation of critic models, we establish VisualProcessBench to measure the abilities of critic models to detect erroneous steps in multimodal reasoning, as introduced in Section 3.3.
### 3.1 VisualPRM400K
Definition. As shown in Figure 2, each data sample in our VisualPRM400K consists of an image $I\in\mathcal{I}$ , a question $q\in\mathcal{Q}$ , a step-by-step solution $s=\{s_{0},s_{1},\cdots,s_{n}\}\in\mathcal{S}$ , and the expected accuracy annotation $mc=\{mc_{0},mc_{1},\cdots,mc_{n}\},mc_{i}\in\mathbb{R}_{\geq 0}$ for each step, where $n$ is the number of steps of a certain solution and $mc_{i}$ denotes the expected accuracy of step $s_{i}$ . The image sets $\mathcal{I}$ and question sets $\mathcal{Q}$ are collected from MMPR v1.1 [82], while the step-by-step solutions $\mathcal{S}$ are sampled using InternVL2.5 series models [15, 82].
Process Supervision Generation. Given an image $I$ , a question $q$ , and a solution $s=\{s_{0},s_{1},\cdots,s_{n}\}$ , we annotate the correctness of each step $s_{i}$ using an automatic data pipeline. The key idea is to estimate the expected accuracy of given steps $s_{\leq i}$ based on Monte Carlo sampling. Specifically, the model is required to complete the solution as follows:
$$
\tilde{s}_{>i}\sim M(\tilde{s}_{>i}\mid I,q,s_{\leq i}), \tag{1}
$$
where $\tilde{s}_{>i}$ is the completion of $s_{\leq i}$ . Besides, the expected accuracy of $s_{i}$ is defined as:
$$
mc_{i}=\frac{\text{num(correct completions)}}{\text{num(sampled completions)}}. \tag{2}
$$
Notably, to reduce the data construction costs, we set the max number of steps to 12 and evenly merge the steps if the number of current steps exceeds the threshold.
Statistics. During the construction process, we sample $4$ solutions for each image-question pair and split each of them into at most $12$ steps. For each step, we sample $16$ continuations and compute $m_{i}$ according to these continuations. The resulting dataset comprises approximately 400K samples and 2 million steps with process supervision. Each response averages 126.9 words and 5.6 steps, while each step averages 22.6 words. Among these steps, about 10% are incorrect steps. Despite the imbalanced distribution of correct and incorrect steps, our PRM demonstrates promising performance, as shown in Section 4.
### 3.2 VisualPRM
Overview. During the training process, we formulate the process supervision problem as a multi-turn chat task so that we can effectively leverage the generation ability of MLLMs. The image $I$ , question $q$ , and the first step $s_{0}$ of the solution to this question are included in the first turn and a new step is presented in each subsequent turn. The model is required to predict the quality of the given step in each turn as follows:
$$
y_{i}\sim M(y_{i}\mid I,q,s_{\leq i}), \tag{3}
$$
where $y_{i}$ denotes the quality of $i$ -th step.
For value-based PRMs, the quality of a certain step is determined by its expected accuracy $mc_{i}$ , which is similar to the definition of the value function in reinforcement learning. Following Math-Shepherd [79, 20], we require the model to predict the correctness $c_{i}\in\{+,-\}$ of the given step, rather than the exact score of $mc_{i}$ . The $i$ -th step is considered correct if $mc_{i}>0$ . We also try to set a threshold to reduce false positive steps, but find that such a threshold negatively impacts the PRM performance, as shown in Section 7. Notably, unlike previous works [79, 39, 20], which choose to supervise only up to the first incorrect step, we always supervise all steps.
For advantage-based PRMs, the quality of a certain step is determined by the improvement of $mc_{i}$ over $mc_{i-1}$ , which is analogous to the definition of the advantage function in reinforcement learning. Similar to value-based PRMs, the quality space is discretized into predefined values $\{+,=,-\}$ , meaning that the $i$ -th step $s_{i}$ results in a superior, comparable, or inferior situation.
During inference stage, we first compute the scores of each step and then merge them to obtain the response score. Specifically, the score for each step is defined as the weighted sum of the generation probability for the discretized scores. For value-based PRMs, the weights for $\{+,-\}$ are $\{1,0\}$ . For advantage-based PRMs, the weights for $\{+,=,-\}$ are $\{1,0,-1\}$ . Without further explanation, we average the scores of each step as the response score.
### 3.3 VisualProcessBench
| Total Samples - MMMU [90] - MathVision [78] | 2866 267 712 |
| --- | --- |
| - MathVerse [93] | 1026 |
| - DynaMath [99] | 570 |
| - WeMath [60] | 291 |
| Source Solutions | 2866 |
| - GPT-4o [58] | 870 |
| - Claude-3.5-Sonnet [4] | 865 |
| - QvQ-72B-Preview [72] | 825 |
| - InternVL2.5-78B [15] | 306 |
| Total Steps | 26950 |
| - Correct Steps | 16585 |
| - Incorrect Steps | 7691 |
| - Neural Steps | 2674 |
| Query Word Length Quartile | (22, 24, 50) |
| Response Word Length Quartile | (137, 193, 552) |
| Step Word Length Quartile | (13, 31, 67) |
| Number of Steps per Solution | 9.4 |
Table 1: Statistics of VisualProcessBench.
Definition. Each sample in our benchmark consists of a multimodal reasoning question, a step-by-step solution, and correctness annotations for each step. Considering that recent models begin to demonstrate reflection abilities to rectify their own reasoning process, the evaluation setting used in previous works [96, 39], which only requires the model to find the first erroneous step, may lead to a false negative estimation. Therefore, our benchmark requires the model to identify all erroneous steps in the given solution instead of only the first erroneous step.
Data Source. Our benchmark focuses on multimodal reasoning tasks, collecting images and questions from existing representative multimodal reasoning benchmarks, including MMMU [90], MathVision [78], MathVerse [93], DynaMath [99], and WeMath [60]. Given these questions, we generate step-by-step solutions using leading MLLMs, including GPT-4o [58], Claude-3.5-Sonnet [4], Gemini-2.0-Flash [70], QvQ-72B-Preview [72], and InternVL2.5-78B [15]. The solutions are sampled from different MLLMs to ensure their diversity.
Step Correctness Annotation. We employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions. Specifically, 13 people worked for 3 days, resulting in a workload of 39 person-days. The cost per person-day is approximately 37 dollars. During the annotation process, annotators are provided with the image, question, ground truth answer, and each step of the solution. Their task is to assign each step in the solution a label of positive, negative, or neutral, as illustrated in Figure 2. A positive label indicates that the step is correct, while a negative label signifies an incorrect step. The neural label is assigned to steps that do not involve any reasoning process or provide no additional information. To ensure the annotation quality, annotators are permitted to skip questions they do not understand. During the annotation process, our dataset is divided into 10 splits, each containing approximately 300 samples. For each split, the authors of this paper manually review about 10% of the samples. Splits with erroneous annotations are sent back for re-annotation. See Section 10 for more data examples.
Statistics. As shown in Table 1, our benchmark comprises 2866 samples. To enhance the diversity of our evaluation samples, we gather questions and solutions from a wide range of benchmarks and models while carefully regulating their distribution. The statistics of step distribution are presented in Section 8.
Metrics. In this work, we use macro F1 scores to compare model performance, aiming to mitigate the impact of the imbalanced distribution between correct and incorrect steps. Specifically, we first compute the F1 scores separately for correct and incorrect steps and then take their average to obtain the overall score.
## 4 Experiments
In this section, we first employ VisualPRM to evaluate various MLLMs using BoN evaluation strategies in Section 4.1, demonstrating that PRMs can significantly enhance the reasoning abilities of MLLMs. Next, we evaluate our VisualPRM and other leading MLLMs on VisualProcessBench in Section 4.2. Finally, the ablation studies are presented in Section 4.3 and Section 7.
| GPT-4o [58] Gemini-2.0-Flash [61] Claude-3.5-Sonnet [4] | 70.7 69.9 66.4 | 60.0 70.4 65.3 | 31.2 43.6 35.6 | 40.6 47.8 46.3 | 34.5 42.1 35.7 | 45.8 47.4 44.0 | 52.8 52.3 60.4 | 47.9 53.4 50.5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Open-source Models | | | | | | | | |
| MiniCPM-V2.6-8B [89] | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| +VisualPRM | 56.8 | 65.7 | 24.7 | 35.8 | 11.2 | 31.0 | 37.4 | 37.5 |
| +7.0 | +4.9 | +1.3 | +16.9 | +1.4 | +14.6 | +9.8 | +8.0 | |
| Qwen2.5-VL-7B [7] | 55.0 | 67.8 | 25.4 | 41.1 | 21.0 | 35.2 | 44.1 | 41.4 |
| +VisualPRM | 58.6 | 70.3 | 31.3 | 44.3 | 23.0 | 39.8 | 48.3 | 45.1 |
| +3.6 | +2.5 | +5.9 | +3.2 | +2.0 | +4.6 | +4.2 | +3.7 | |
| InternVL2.5-8B [15] | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| +VisualPRM | 60.2 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.2 |
| +4.0 | +4.0 | +8.7 | +13.0 | +8.6 | +13.0 | +7.8 | +8.4 | |
| InternVL2.5-26B [15] | 60.7 | 68.2 | 23.4 | 24.0 | 11.4 | 30.9 | 39.6 | 36.9 |
| +VisualPRM | 63.9 | 73.1 | 29.6 | 39.1 | 23.2 | 40.8 | 51.0 | 45.8 |
| +3.2 | +4.9 | +6.2 | +15.1 | +11.8 | +9.9 | +11.4 | +8.9 | |
| InternVL2.5-38B [15] | 63.9 | 71.9 | 32.2 | 36.9 | 20.0 | 38.3 | 47.9 | 44.4 |
| +VisualPRM | 69.0 | 73.9 | 35.2 | 46.7 | 30.5 | 46.2 | 53.7 | 50.7 |
| +5.1 | +2.0 | +3.0 | +9.8 | +10.5 | +7.9 | +5.8 | +6.3 | |
| InternVL2.5-78B [15] | 70.0 | 72.3 | 32.2 | 39.2 | 19.2 | 39.8 | 49.0 | 46.0 |
| +VisualPRM | 70.7 | 75.1 | 35.9 | 47.1 | 31.3 | 49.1 | 53.9 | 51.9 |
| +0.7 | +2.8 | +3.7 | +7.9 | +12.1 | +9.3 | +4.9 | +5.9 | |
Table 2: Results on seven multimodal reasoning benchmarks. MMMU [90] is a multidisciplinary reasoning benchmark. MathVista [50], MathVision [78], MathVerse [93], DynaMath [99], and WeMath [60] are mathematics benchmarks. For MathVerse, we report the performance on Vision-Only (VO) split. LogicVista [87] is a logical reasoning benchmark. Part of the results are collected from the OpenCompass leaderboard [19]. The overall score is the average score of the above benchmarks. By using VisualPRM as the critic model, existing open-source MLLMs achieve significant improvements in reasoning ability under the Best-of-8 evaluation strategy.
### 4.1 Results with Best-of-N evaluation
Benchmarks. We evaluate the reasoning abilities of MLLMs across seven benchmarks, including MMMU [90], MathVista [50], MathVision [78], MathVerse [93], DynaMath [99], WeMath [60], and LogicVista [87]. The evaluation samples include subject-based, mathematical, and logical reasoning problems. We report the worst-case accuracy for DynaMath and the overall accuracy for the remaining benchmarks. For MathVerse, we report the performance on the Vision-Only split.
Settings. Without further explanation, we use VisualPRM as the critic model for BoN evaluation and set $N$ to $8$ by default. The policy model is required to generate $N$ distinct step-by-step Chain-of-Thought (CoT) reasoning processes with a temperature of 0.7. The response with the highest score is then selected to determine the correctness.
Results. As shown in Table 2, VisualPRM greatly enhances the reasoning abilities of MLLMs across different model scales and families. Specifically, for models with fewer than 10 billion parameters, the overall performance of InternVL2.5-8B, MiniCPM-V-8B, and Qwen2.5-VL-7B improves by 8.4, 8.0, and 3.7 points, respectively, demonstrating the effectiveness of test-time scaling across different model families. For larger models, InternVL2.5-26B, InternVL2.5-38B, and InternVL2.5-78B also achieve substantial performance gains over their counterparts without TTS, further validating the scalability and effectiveness of TTS across different model sizes.
### 4.2 Results on VisualProcessBench
Settings. For the evaluation of PRMs, a step is considered correct if the probability of outputting “ $+$ ” exceeds that of outputting “ $-$ ” by a certain threshold. For the evaluation of MLLMs, the model is prompted to analyze each step and determine its correctness, classifying it as either correct or incorrect. When computing the F1 score, we exclude steps labeled as neural by human annotators in Section 3.3.
Results. As shown in Table 3, most existing MLLMs struggle to accurately assess the correctness of each step. Specifically, the overall F1 score for random guessing is 50.0, while most open-source MLLMs achieve scores close to this baseline, highlighting their limitations as critic models. We manually check the judgments of these open-source MLLMs and observe that these models tend to provide positive analysis and label most steps as correct. For example, InternVL2.5-8B achieves an F1 score of 76.8 for positive steps, while its F1 score for negative steps is only 19.2, indicating that InternVL2.5-8B rarely identifies steps as incorrect. Furthermore, compared to proprietary models, our VisualPRM demonstrates competitive performance, achieving an overall F1 score of 62.0—outperforming GPT-4o and GPT-4o-Mini, and performing on par with Gemini-2.0-Flash. Notably, our model, with only 8 billion parameters, is more efficient than these proprietary counterparts.
| Model Random Guessing Proprietary Models | MMMU 50.0 | MathVision 50.0 | MathVerse-VO 50.0 | DynaMath 50.0 | WeMath 50.0 | Overall 50.0 |
| --- | --- | --- | --- | --- | --- | --- |
| GPT-4o-Mini [58] | 53.6 | 58.9 | 57.1 | 56.7 | 58.5 | 57.9 |
| GPT-4o [58] | 56.3 | 60.2 | 59.7 | 59.0 | 63.3 | 60.3 |
| Gemini-2.0-Flash [61] | 58.5 | 60.1 | 62.8 | 66.7 | 58.7 | 62.3 |
| Open-source Models | | | | | | |
| MiniCPM-V2.6-8B [89] | 44.9 | 50.9 | 58.9 | 46.7 | 57.4 | 50.4 |
| LLaVA-OV-7B [35] | 45.7 | 43.0 | 42.2 | 44.7 | 52.5 | 44.4 |
| LLaVA-OV-72B [35] | 46.1 | 48.4 | 53.0 | 57.0 | 57.3 | 52.3 |
| Qwen2.5-VL-7B [7] | 53.1 | 51.8 | 47.8 | 51.3 | 54.2 | 51.0 |
| Qwen2.5-VL-72B [7] | 59.2 | 59.0 | 59.7 | 62.9 | 62.3 | 60.5 |
| InternVL2.5-8B [15] | 47.1 | 45.5 | 47.8 | 50.3 | 50.8 | 48.0 |
| InternVL2.5-26B [15] | 48.8 | 47.4 | 49.2 | 50.4 | 51.4 | 49.2 |
| InternVL2.5-38B [15] | 51.5 | 48.4 | 50.9 | 51.8 | 52.5 | 50.8 |
| InternVL2.5-78B [15] | 52.0 | 51.7 | 53.7 | 50.8 | 52.5 | 52.6 |
| VisualPRM (ours) | 58.5 | 62.1 | 61.0 | 62.7 | 61.8 | 62.0 |
Table 3: Results on VisualProcessBench. We report the macro F1 of the correct and incorrect steps. The overall score is the micro average of the score from different data sources. Our VisualPRM exhibits state-of-the-art performance among open-source models.
### 4.3 Ablation Studies
Effects of BoN. Here, we increase the number of response candidates sampled from InternVL2.5-8B and select the final response using Self-Consistency (SC) [86], Outcome Reward Model (ORM), and PRM. The training data for ORM are nearly identical to those used for PRM, except that all steps are concatenated into a single step and step-wise correctness annotations are converted into a single correctness label for the outcome. As shown in Figure 4, increasing the number of response candidates $N$ improves the reasoning performance of InternVL2.5-8B and MiniCPM-V2.6-8B when using SC, ORM, or PRM, with PRM yielding the most significant improvements. Specifically, when using InternVL2.5-8B as the policy model, PRM outperforms SC and ORM by 2.4 and 1.5 points, respectively, under the Best-of-8 evaluation setting. Moreover, this performance gap widens as $N$ increases, reaching 3.1 and 4.3 points when $N$ is set to $128$ . Notably, when using ORM as the critic model, although performance improves during Best-of-8 evaluation, further increasing $N$ does not lead to consistent gains for InternVL2.5-8B. For example, the Best-of-128 performance is inferior to the Best-of-64 performance. These results highlight the effectiveness of PRM in TTS.
Effects of PRM modeling methods. Here, we compare the value-based PRM and the advantage-based PRM introduced in Section 3.2, along with different methods for aggregating step scores into a final score, including averaging, as well as selecting the maximum or minimum value. The results are presented in Table 4. We find that value-based PRMs outperform advantage-based PRMs in both BoN evaluation settings and VL-ProcessBench. We attribute this to the inherent noise in our training data, which is generated through an automatic data pipeline, making it challenging to accurately determine whether a given step contributes to higher or lower expected accuracy. We also compare two training strategies: supervising all steps (i.e., w/o early stop) versus supervising only up to the first incorrect step (i.e., w. early stop) during training. Experimental results show that the former yields better performance. Regarding different score aggregation methods, we find that selecting the maximum value results in poorer performance compared to averaging or taking the minimum value. Analyzing the generated scores reveals that most responses contain a high-scored step, close to $1$ , at the beginning of the solution. This phenomenon likely arises because most erroneous steps appear in the middle of the solution. Our statistics of VisualProcessBench presented in Section 8 further demonstrate this conclusion. Furthermore, averaging performs better than selecting the maximum value, likely because the latter relies on a single step’s score, while averaging accounts for multiple steps and can be considered as an ensemble approach, which benefits the step quality estimation.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: InternVL2.5-8B Overall Performance vs. Best of N
### Overview
This line chart depicts the relationship between the "Best of N" parameter and the "InternVL2.5-8B Overall Performance" for three different models: Self-Consistency, VisualORM-8B, and VisualPRM-8B. The chart shows how performance changes as the number of samples ("N") considered increases.
### Components/Axes
* **X-axis:** "Best of N" - Values are 1, 8, 16, 32, 64, and 128.
* **Y-axis:** "InternVL2.5-8B Overall Performance" - Values range from approximately 32 to 44.
* **Legend:** Located in the bottom-right corner.
* Green Square: "Self-Consistency"
* Red Triangle: "VisualORM-8B"
* Blue Diamond: "VisualPRM-8B"
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
* **VisualPRM-8B (Blue Diamond):** This line shows a strong upward trend.
* At N=1, performance is approximately 32.5.
* At N=8, performance jumps to approximately 41.5.
* At N=16, performance is approximately 42.2.
* At N=32, performance is approximately 42.7.
* At N=64, performance is approximately 43.5.
* At N=128, performance reaches approximately 44.0.
* **VisualORM-8B (Red Triangle):** This line shows an initial steep increase, followed by a leveling off.
* At N=1, performance is approximately 32.2.
* At N=8, performance increases to approximately 39.5.
* At N=16, performance is approximately 39.8.
* At N=32, performance is approximately 40.2.
* At N=64, performance is approximately 40.5.
* At N=128, performance is approximately 40.7.
* **Self-Consistency (Green Square):** This line shows a moderate increase, with a plateau after N=32.
* At N=1, performance is approximately 34.0.
* At N=8, performance increases to approximately 39.0.
* At N=16, performance is approximately 39.5.
* At N=32, performance is approximately 41.0.
* At N=64, performance is approximately 41.2.
* At N=128, performance is approximately 41.2.
### Key Observations
* VisualPRM-8B consistently outperforms both VisualORM-8B and Self-Consistency across all values of N.
* The performance gains for VisualPRM-8B are most significant between N=1 and N=8.
* VisualORM-8B shows a rapid initial improvement, but its performance plateaus relatively quickly.
* Self-Consistency exhibits the slowest rate of improvement and reaches a plateau at a lower performance level than the other two models.
* All three models show diminishing returns as N increases beyond 32.
### Interpretation
The data suggests that increasing the number of samples considered ("Best of N") generally improves the overall performance of these models. However, the extent of this improvement varies significantly depending on the model architecture. VisualPRM-8B appears to benefit the most from increasing N, indicating that it is more sensitive to sample diversity or benefits from a more robust selection process. VisualORM-8B shows a strong initial gain, but its performance quickly stabilizes, suggesting that it may reach its optimal performance with a smaller number of samples. Self-Consistency demonstrates the least improvement, potentially indicating limitations in its underlying mechanism for leveraging multiple samples. The diminishing returns observed at higher values of N suggest that there is a point at which the cost of considering additional samples outweighs the performance benefits. This information is valuable for optimizing the inference process of these models, balancing performance gains with computational efficiency.
</details>
(a)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: MiniCPM-V-2.6 Overall Performance vs. Best of N
### Overview
This line chart illustrates the relationship between the "Best of N" parameter and the "MiniCPM-V-2.6 Overall Performance" for three different models: Self-Consistency, VisualORM-8B, and VisualPRM-8B. The chart displays how performance changes as the number of samples ("Best of N") increases.
### Components/Axes
* **X-axis:** "Best of N" - Values are 1, 8, 16, 32, 64, and 128.
* **Y-axis:** "MiniCPM-V-2.6 Overall Performance" - Scale ranges from approximately 29 to 41.
* **Legend:** Located in the top-right corner.
* Self-Consistency (Green squares)
* VisualORM-8B (Red triangles)
* VisualPRM-8B (Blue diamonds)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
* **Self-Consistency (Green):** The line slopes upward, indicating increasing performance with increasing "Best of N".
* At N=1, performance is approximately 32.
* At N=8, performance is approximately 33.5.
* At N=16, performance is approximately 34.5.
* At N=32, performance is approximately 35.
* At N=64, performance is approximately 35.5.
* At N=128, performance is approximately 36.
* **VisualORM-8B (Red):** The line shows a steep initial increase, then plateaus.
* At N=1, performance is approximately 30.
* At N=8, performance is approximately 36.
* At N=16, performance is approximately 37.
* At N=32, performance is approximately 37.5.
* At N=64, performance is approximately 38.
* At N=128, performance is approximately 38.5.
* **VisualPRM-8B (Blue):** The line demonstrates a rapid increase initially, followed by a leveling off, achieving the highest performance.
* At N=1, performance is approximately 29.5.
* At N=8, performance is approximately 37.5.
* At N=16, performance is approximately 39.
* At N=32, performance is approximately 39.5.
* At N=64, performance is approximately 40.
* At N=128, performance is approximately 40.5.
### Key Observations
* VisualPRM-8B consistently outperforms both Self-Consistency and VisualORM-8B across all values of "Best of N".
* VisualORM-8B shows the most significant performance gain between N=1 and N=8.
* The performance gains for all models diminish as "Best of N" increases beyond 32.
* Self-Consistency exhibits the slowest rate of performance improvement.
### Interpretation
The data suggests that increasing the number of samples ("Best of N") generally improves the overall performance of these models on the MiniCPM-V-2.6 benchmark. However, the marginal benefit of increasing "Best of N" decreases as the value grows larger. VisualPRM-8B appears to be the most effective model, benefiting substantially from increased sampling, while Self-Consistency shows the least sensitivity to this parameter. This could indicate that VisualPRM-8B is more capable of leveraging information from multiple samples to refine its predictions. The plateauing effect observed at higher values of "Best of N" suggests a point of diminishing returns, where further increasing the number of samples does not yield significant performance improvements. This information is valuable for optimizing the models' configurations and balancing performance gains with computational costs.
</details>
(b)
Figure 4: Overall Best-of-N results across seven multimodal reasoning benchmarks with different policy and critic models. VisualPRM consistently enhances reasoning performance of both InternVL2.5-8B and MiniCPM-V as $N$ increases and outperforms the improvement introduced by Self-Consistency and ORM, highlighting the effectiveness of PRM in Test-Time Scaling.
| Critic Model Pass@1 Random Guessing | BoN 32.8 33.0 | VL-ProcessBench - 50.0 |
| --- | --- | --- |
| InternVL2.5-8B | 33.2 | 48.0 |
| InternVL2.5-78B | 34.2 | 52.6 |
| Advantage-based PRM | | |
| +Min | 36.8 | 55.0 |
| +Max | 36.9 | 55.0 |
| +Average | 37.4 | 55.0 |
| Value (w. early stop) | | |
| +Min | 40.3 | 61.6 |
| +Max | 37.0 | 61.6 |
| +Average | 40.6 | 61.6 |
| Value (w/o early stop) | | |
| +Min | 40.4 | 62.0 |
| +Max | 35.9 | 62.0 |
| +Average | 41.1 | 62.0 |
Table 4: Comparison of different critic models and score aggregation methods. Note that score aggregation methods do not affect performance on VisualProcessBench, which focuses on step correctness judgement. We find that supervising all steps (w/o early stop) during training perform slightly better than supervising only up to the first incorrect step (w. early steop).
MLLM-as-a-Judger. Existing MLLMs can be prompted to serve as a critic model. However, as shown in Table 4, the InternVL2.5 series struggle to improve BoN performance, resulting in only marginal improvements. Upon analyzing the generated scores, we find that these models assign similar scores to most solutions. Consistent with our observations in Section 4.2, the InternVL2.5 series tend to generate positive judgments for most steps, which hinders their ability to effectively distinguish and select the truly superior response. In addition to their effectiveness as critic models for MLLMs, their inference latency also limits efficiency. Specifically, MLLMs generate judgments for each step in an autoregressive manner, which is time-consuming. In contrast, our VisualPRM computes scores for all steps in a single forward pass by using a “+” as a placeholder for model responses and interpreting its generation probability as the step score.
Results on text-only performance. To assess the effectiveness of VisualPRM on text-only inputs, we evaluate the Qwen2.5 series [88] and InternVL2.5 series [15] on three text reasoning benchmarks under BoN evaluation settings: GSM8K [18], MATH-500 [25], and GPQA-Diamond [62]. We report accuracy as the evaluation metric for these benchmarks. As shown in Table 5, our model enhances the text reasoning abilities of both the Qwen2.5 series and the InternVL2.5 series. Specifically, Qwen2.5-7B achieves improvements of 6.1 and 5.0 points on MATH-500 and GPQA-Diamond, respectively. Similarly, Qwen2.5-72B demonstrates gains of 2.1 and 6.6 points on these benchmarks. For the InternVL2.5 series, InternVL2.5-8B, InternVL2.5-38B, and InternVL2.5-78B achieve improvements of 9.4 and 5.0, 4.6 and 8.1, and 7.4 and 3.5 points, respectively, on MATH-500 and GPQA-Diamond. These results demonstrate the effectiveness of our VisualPRM in text-only scenarios.
| Model Large Language Models Qwen2.5-7B [88] | GSM8K 91.6 | MATH-500 75.5 | GPQA 36.4 |
| --- | --- | --- | --- |
| +VisualPRM | 94.5 | 81.6 | 41.4 |
| +2.9 | +6.1 | +5.0 | |
| Qwen2.5-32B [88] | 95.9 | 83.1 | 49.5 |
| +VisualPRM | 96.1 | 85.4 | 53.5 |
| +0.2 | +2.3 | +4.0 | |
| Qwen2.5-72B [88] | 95.8 | 83.1 | 49.0 |
| +VisualPRM | 96.5 | 85.2 | 55.6 |
| +0.7 | +2.1 | +6.6 | |
| Multimodal Large Language Models | | | |
| InternVL2.5-8B [15] | 81.9 | 56.8 | 29.3 |
| +VisualPRM | 82.9 | 66.2 | 34.3 |
| +1.1 | +9.4 | +5.0 | |
| InternVL2.5-38B [15] | 94.6 | 75.4 | 44.4 |
| +VisualPRM | 95.6 | 80.0 | 52.5 |
| +1.0 | +4.6 | +8.1 | |
| InternVL2.5-78B [15] | 93.6 | 70.4 | 47.5 |
| +VisualPRM | 94.5 | 77.8 | 51.0 |
| +0.9 | +7.4 | +3.5 | |
Table 5: Results on text reasoning benchmarks. GSM8K and MATH500 are mathematics benchmarks, while GPQA is a multidisciplinary reasoning benchmark. For GPQA, we report the performance on the Diamond split. Our VisualPRM enhances the text reasoning abilities of both LLMs and MLLMs under the Best-of-8 evaluation settings.
## 5 Conclusion
In this work, we construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) capable of estimating the value score of each step during the reasoning process. With the Best-of-N (BoN) evaluation strategies, our model improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation, highlighting the effectiveness of PRMs in Test-Time Scaling. To further facilitate the development of multimodal critic models, we construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to detect incorrect steps in multimodal reasoning tasks. Evaluation results show that existing open-source MLLMs struggle to effectively judge the correctness of each step. We hope that our work can inspire more future research and contribute to the development of MLLMs.
## References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ahmadian et al. [2024] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
- Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022.
- Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024.
- Bai et al. [2023a] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Bai et al. [2023b] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- Biten et al. [2019] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NIPS, 2020.
- Cai et al. [2024] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- Cao and Xiao [2022] Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022.
- Chang et al. [2022] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022.
- Chen et al. [2024a] Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024a.
- Chen et al. [2023] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- Chen et al. [2024b] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024b.
- Chen et al. [2024c] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024c.
- Clark and Gardner [2018] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In ACL, pages 845–855, 2018.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Contributors [2023] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Dong et al. [2024] Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Gao et al. [2023] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
- Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Hu [2025] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025.
- Huang et al. [2019] Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
- Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- Kafle et al. [2018] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, pages 5648–5656, 2018.
- Kazemi et al. [2023] Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023.
- Kembhavi et al. [2016] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016.
- Lai et al. [2024] Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.
- Lambert et al. [2024] Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
- Laurençon et al. [2024] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NIPS, 36, 2024.
- Li et al. [2024a] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
- Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023.
- Li et al. [2024b] Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vlrewardbench: A challenging benchmark for vision-language generative reward models. arXiv preprint arXiv:2411.17451, 2024b.
- Li et al. [2024c] Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024c.
- Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
- Lindström and Abraham [2022] Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022.
- Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NIPS, 36, 2023a.
- Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024a.
- Liu et al. [2024b] Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. arXiv preprint arXiv:2407.15838, 2024b.
- Liu et al. [2024c] Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style. arXiv preprint arXiv:2410.16184, 2024c.
- Liu et al. [2023b] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 2023b.
- Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Lu et al. [2021a] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021a.
- Lu et al. [2021b] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021b.
- Lu et al. [2022] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NIPS, 35:2507–2521, 2022.
- Lu et al. [2023] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Luo et al. [2024] Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2, 2024.
- Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
- Masry et al. [2022] Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022.
- Mathew et al. [2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022.
- McAleese et al. [2024] Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024.
- Mishra et al. [2019] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- Muennighoff et al. [2025] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- OpenAI [2024] OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/, 2024.
- Pang et al. [2024] Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024.
- Qiao et al. [2024] Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024.
- Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Rein et al. [2024] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 35:25278–25294, 2022.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Seo et al. [2015] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015.
- Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
- Snell et al. [2024] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
- Song et al. [2025] Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. Prmbench: A fine-grained and challenging benchmark for process-level reward models. arXiv preprint arXiv:2501.03124, 2025.
- Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Team [2023] InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Team [2024] Qwen Team. Qvq: To see the world with wisdom, 2024.
- Tian et al. [2024] Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, et al. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024.
- Toshniwal et al. [2025] Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. Advances in Neural Information Processing Systems, 37:34737–34774, 2025.
- Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Wang et al. [2024a] Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, 2024a.
- Wang et al. [2024b] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024b.
- Wang et al. [2023a] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2023a.
- Wang et al. [2023b] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NIPS, 36, 2023b.
- Wang et al. [2023c] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023c.
- Wang et al. [2024c] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024c.
- Wang et al. [2024d] Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. arXiv preprint arXiv:2402.19474, 2024d.
- Wang et al. [2024e] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. In ICLR, 2024e.
- Wang et al. [2024f] Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024f.
- Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Xiao et al. [2024] Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973, 2024.
- Yang et al. [2024] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
- Yao et al. [2024] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
- Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023.
- Zhang et al. [2024a] Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024a.
- Zhang et al. [2024b] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024b.
- Zhang et al. [2025] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025.
- Zhao et al. [2025] Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, et al. Omnialign-v: Towards enhanced alignment of mllms with human preference. arXiv preprint arXiv:2502.18411, 2025.
- Zheng et al. [2024] Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559, 2024.
- Zhou et al. [2024] Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, et al. Rmb: Comprehensively benchmarking reward models in llm alignment. arXiv preprint arXiv:2410.09893, 2024.
- Zhu et al. [2024] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. NIPS, 36, 2024.
- Zou et al. [2024] Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836, 2024.
Supplementary Material
## 6 Training Hyper-parameters
During the training process of VisualPRM, the data-packing strategy [15] is enabled during training. We employ the AdamW optimizer [46] with the $\beta_{1}$ of $0.9$ , the $\beta_{2}$ of $0.999$ , and the weight decay of $0.05$ . The learning rate is initialized as $1e\text{-}5$ . The training phases include a linear warmup that lasts until the first 5% of training steps. The warmup is followed by a cosine decay strategy with a minimum learning rate of 0. We set the training epoch to 1.
## 7 More Ablation Studies
### 7.1 Effects of Training Hyper-parameters
When training our value-based Process Reward Model (PRM) using VisualPRM400K, we define a step as correct if its expected accuracy exceeds 0. In this section, we analyze the impact of varying expected accuracy thresholds for determining step correctness. As shown in Table 8, increasing the threshold results in a decline in both Best-of-8 evaluation performance and VisualProcessBench scores. These results are consistent with the observation in Qwen2.5-Math-PRM [94]. Therefore, we suggest setting the threshold to $0 0$ during training.
### 7.2 Effects of Generation Hyper-parameters
In this section, we analyze the impact of generation temperature on the Best-of-8 evaluation. As shown in Table 8, as the temperature increases from 0.3 to 1.3, the overall performance of InternVL2.5-8B first improves and then declines. We attribute this phenomenon to the trade-off between response diversity and accuracy. When the temperature is low (e.g., set to 0.3), the generated responses lack diversity, limiting the model’s performance upper bound. Conversely, when the temperature is high (e.g., set to 1.3), the responses become more random, reducing the accuracy of individual responses and lowering the model’s overall performance ceiling.
### 7.3 Effects of Best-of-N evaluation
In this section, we present the Best-of-N evaluation results as $N$ increases, as shown in Table 6 and Table 7. Our results indicate that as $N$ increases, VisualPRM consistently enhances the reasoning abilities of InternVL2.5-8B [15] and MiniCPM-V2.6 [89]. Specifically, as $N$ increases from $8$ to $128$ , the overall performance of InternVL2.5-8B improves from 41.2 to 44.0, while MiniCPM-V2.6 improves from 37.5 to 40.4, demonstrating the scalability of Test-Time Scaling for MLLMs.
| Self Consistency 8 16 | 1 58.0 58.6 | 56.2 65.9 65.8 | 64.5 23.4 26.3 | 17.0 30.5 32.1 | 22.8 18.4 19.4 | 9.4 32.7 33.0 | 23.5 43.0 43.4 | 36.0 38.8 39.8 | 32.8 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 32 | 60.4 | 66.7 | 28.0 | 32.6 | 20.8 | 34.1 | 44.7 | 41.0 | |
| 64 | 59.7 | 66.7 | 26.6 | 33.2 | 20.6 | 35.8 | 43.4 | 40.9 | |
| 128 | 60.6 | 67.4 | 25.7 | 32.0 | 22.6 | 34.7 | 43.2 | 40.9 | |
| VisualORM | 1 | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| 8 | 60.2 | 67.0 | 25.3 | 32.5 | 16.4 | 35.0 | 41.8 | 39.7 | |
| 16 | 58.3 | 67.7 | 27.0 | 33.6 | 16.6 | 33.1 | 39.1 | 39.3 | |
| 32 | 58.6 | 67.9 | 26.3 | 33.6 | 17.4 | 34.4 | 42.1 | 40.0 | |
| 64 | 59.4 | 66.8 | 28.6 | 33.9 | 17.8 | 34.1 | 42.3 | 40.4 | |
| 128 | 59.4 | 66.6 | 28.3 | 33.5 | 16.8 | 32.3 | 40.9 | 39.7 | |
| VisualPRM | 1 | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| 8 | 60.2 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.2 | |
| 16 | 60.2 | 69.9 | 27.3 | 36.4 | 19.0 | 38.8 | 42.5 | 42.0 | |
| 32 | 60.3 | 70.4 | 29.6 | 37.8 | 17.2 | 40.3 | 43.4 | 42.7 | |
| 64 | 61.4 | 69.6 | 30.6 | 38.2 | 18.8 | 40.2 | 45.4 | 43.5 | |
| 128 | 61.7 | 70.8 | 30.3 | 39.3 | 19.4 | 40.9 | 45.4 | 44.0 | |
Table 6: Overall Best-of-N results of InternVL2.5-8B across seven multimodal reasoning benchmarks with different critic models.
| Self Consistency 8 16 | 1 51.8 51.7 | 49.8 58.9 60.2 | 60.8 21.7 21.7 | 23.4 31.5 31.5 | 18.9 10.0 11.6 | 9.8 22.6 25.7 | 16.4 35.6 35.3 | 27.5 33.2 34.0 | 29.5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 32 | 52.2 | 60.1 | 24.3 | 33.1 | 11.4 | 24.3 | 36.0 | 34.5 | |
| 64 | 51.7 | 61.0 | 23.4 | 34.8 | 12.8 | 25.8 | 35.3 | 35.0 | |
| 128 | 53.2 | 61.7 | 25.7 | 33.5 | 13.0 | 25.6 | 35.6 | 35.5 | |
| VisualORM | 1 | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| 8 | 55.7 | 66.0 | 22.0 | 33.5 | 10.2 | 24.1 | 38.9 | 35.8 | |
| 16 | 56.4 | 65.3 | 24.0 | 32.1 | 10.4 | 27.3 | 36.5 | 36.0 | |
| 32 | 58.8 | 64.8 | 19.7 | 35.7 | 12.0 | 29.4 | 38.5 | 37.0 | |
| 64 | 58.2 | 67.3 | 22.7 | 35.5 | 11.0 | 30.1 | 37.6 | 37.5 | |
| 128 | 58.2 | 66.5 | 25.3 | 35.4 | 11.6 | 30.0 | 40.7 | 38.2 | |
| VisualPRM | 1 | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| 8 | 56.8 | 65.7 | 24.7 | 35.8 | 11.2 | 31.0 | 37.4 | 37.5 | |
| 16 | 58.8 | 68.6 | 24.0 | 37.3 | 12.4 | 32.7 | 39.8 | 39.1 | |
| 32 | 57.8 | 68.4 | 26.6 | 38.5 | 13.4 | 35.3 | 39.1 | 39.9 | |
| 64 | 58.6 | 69.4 | 25.3 | 39.7 | 12.2 | 38.2 | 36.9 | 40.0 | |
| 128 | 59.3 | 69.4 | 25.3 | 39.1 | 14.4 | 37.0 | 38.3 | 40.4 | |
Table 7: Overall Best-of-N results of MiniCPM-V2.6 across seven multimodal reasoning benchmarks with different critic models.
| Threshold=0.00 Threshold=0.625 Threshold=0.125 | 59.3 59.7 58.0 | 68.5 66.8 67.9 | 25.7 24.7 27.6 | 35.8 36.7 35.4 | 18.0 18.4 17.4 | 36.5 35.0 35.3 | 43.8 41.8 41.6 | 41.1 40.4 40.5 | 62.0 61.0 60.7 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Threshold=0.25 | 58.6 | 67.6 | 25.7 | 33.6 | 16.8 | 36.0 | 41.4 | 40.0 | 60.2 |
| Temperature | | | | | | | | | |
| Temperature=0.3 | 59.7 | 69.4 | 26.0 | 32.6 | 17.6 | 35.5 | 42.7 | 40.5 | - |
| Temperature=0.7 | 59.3 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.1 | - |
| Temperature=1.0 | 61.7 | 67.2 | 27.3 | 35.8 | 16.6 | 34.2 | 43.2 | 40.9 | - |
| Temperature=1.3 | 57.9 | 66.1 | 25.0 | 32.1 | 16.8 | 31.9 | 40.5 | 38.6 | - |
Table 8: Ablation studies about the effects of expected accuracy threshold and generationo temperaure.
## 8 More Statistics for VisualProcessBench
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Step Index vs. Steps (%) for Math Models
### Overview
This image presents a line chart comparing the percentage of steps taken by five different math models (MathVision, MathVerse, MMMU, DynaMath, and WeMath) across a step index ranging from approximately 0 to 30. The y-axis represents the percentage of steps, while the x-axis represents the step index. The chart uses different colored lines to represent each model, with markers indicating data points.
### Components/Axes
* **X-axis Title:** Step Index
* **Y-axis Title:** Steps (%)
* **X-axis Scale:** Linear, ranging from 0 to 30, with tick marks at integer values.
* **Y-axis Scale:** Linear, ranging from 0 to 5, with tick marks at integer values.
* **Legend:** Located in the top-right corner, listing the following models and their corresponding colors:
* MathVision (Gray)
* MathVerse (Red)
* MMMU (Blue)
* DynaMath (Green)
* WeMath (Purple)
* **Gridlines:** Vertical gridlines are present throughout the chart, aiding in reading values.
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **MathVision (Gray):** The line starts at approximately 1.5% at Step Index 0, increases to a peak of around 3.2% at Step Index 5, then gradually decreases to approximately 0.8% by Step Index 25, remaining relatively flat until Step Index 30.
* **MathVerse (Red):** This line exhibits a sharp increase from approximately 1.5% at Step Index 0 to a peak of around 5% at Step Index 10. It then rapidly declines to approximately 0.5% by Step Index 15 and remains relatively flat around 0.3% for the remainder of the chart.
* **MMMU (Blue):** The line begins at approximately 0.8% at Step Index 0, increases to a peak of around 2.5% at Step Index 7, then decreases to approximately 0.5% by Step Index 15. It remains relatively flat around 0.3% for the rest of the chart.
* **DynaMath (Green):** This line starts at approximately 1.2% at Step Index 0, increases to a peak of around 3% at Step Index 12, then declines to approximately 0.5% by Step Index 20, remaining relatively flat around 0.3% until Step Index 30.
* **WeMath (Purple):** The line begins at approximately 0.7% at Step Index 0, increases to a peak of around 1.2% at Step Index 5, then decreases to approximately 0.3% by Step Index 10. It remains relatively flat around 0.2% for the remainder of the chart.
### Key Observations
* MathVerse demonstrates the most significant initial increase in steps, peaking at 5% before rapidly declining.
* MathVision shows a more gradual increase and decrease in steps compared to MathVerse.
* MMMU, DynaMath, and WeMath exhibit similar patterns of initial increase followed by a plateau around 0.3-0.5%.
* All models converge to a very low percentage of steps (around 0.2-0.3%) after Step Index 20.
* The peak step percentages occur at different step indices for each model, indicating varying timings for maximum activity.
### Interpretation
The chart likely represents the progress or activity of different mathematical models as they iterate through a problem-solving process. The "Step Index" could refer to the iteration number, and "Steps (%)" could represent the percentage of computational effort or progress made at each step.
The initial rapid increase in MathVerse suggests it quickly focuses its computational effort, but then quickly stabilizes. The more gradual increase in MathVision suggests a more sustained, but less intense, effort. The convergence of all models to low step percentages after Step Index 20 indicates that the problem may have been solved or that the models have reached a point of diminishing returns.
The differences in peak step percentages and timings suggest that each model employs a different strategy for solving the problem. MathVerse is aggressive, MMMU is steady, and WeMath is conservative. The chart provides a comparative analysis of the efficiency and behavior of these models during the problem-solving process. The data suggests that MathVerse is the most active initially, but all models eventually reach a similar level of inactivity.
</details>
(a)
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Incorrect Steps vs. Step Index for Math Problem Solvers
### Overview
This line chart depicts the percentage of incorrect steps taken by several math problem-solving models (MathVision, MathVerse, MMU, DynaMath, and WeMath) as a function of the step index in the problem-solving process. The x-axis represents the step index, ranging from 0 to approximately 30. The y-axis represents the percentage of incorrect steps, ranging from 0% to 100%. There are shaded regions indicating periods of high uncertainty or variability in the data.
### Components/Axes
* **X-axis:** "Step Index" - Ranges from 0 to 30, with tick marks at integer values.
* **Y-axis:** "Incorrect Steps (%)" - Ranges from 0 to 100, with tick marks at 0, 20, 40, 60, 80, and 100.
* **Legend:** Located in the top-right corner. Contains the following entries:
* MathVision (Dark Gray)
* MathVerse (Red)
* MMU (Green)
* DynaMath (Light Green)
* WeMath (Purple)
* **Shaded Regions:** Several vertical shaded regions in light gray indicate periods of high variance or uncertainty. These regions span approximately from step index 23 to 28.
### Detailed Analysis
Here's a breakdown of each data series, noting trends and approximate data points.
* **MathVision (Dark Gray):** The line starts at approximately 0% incorrect steps at step index 0, rises steadily to a peak of around 48% at step index 12, then declines to approximately 25% at step index 30. There's a noticeable drop in incorrect steps around step index 16.
* **MathVerse (Red):** Starts at approximately 8% at step index 0, increases to a peak of around 40% at step index 10, then decreases to approximately 20% at step index 30. It exhibits a relatively smooth curve.
* **MMU (Green):** Begins at approximately 2% at step index 0, increases rapidly to a peak of around 65% at step index 21, and then drops sharply to approximately 20% at step index 30. This line shows the most dramatic increase and decrease.
* **DynaMath (Light Green):** Starts at approximately 0% at step index 0, increases to around 20% at step index 10, then rises sharply to approximately 65% at step index 21, and then drops to approximately 10% at step index 30.
* **WeMath (Purple):** Starts at approximately 4% at step index 0, increases to a peak of around 35% at step index 8, then decreases to approximately 15% at step index 30. It shows a relatively stable performance after step index 15.
### Key Observations
* MMU and DynaMath exhibit the highest percentage of incorrect steps, particularly between step indices 15 and 25.
* MathVision consistently shows a moderate level of incorrect steps throughout the process.
* WeMath generally has the lowest percentage of incorrect steps, especially after step index 15.
* The shaded regions suggest increased variability in performance around step index 25, potentially indicating a more challenging part of the problem-solving process.
* All models show an initial increase in incorrect steps, followed by a decrease, suggesting a learning or adaptation phase.
### Interpretation
The chart demonstrates the performance of different math problem-solving models across a series of steps. The varying trajectories suggest that each model has different strengths and weaknesses at different stages of the problem-solving process. The initial increase in incorrect steps for all models could represent the initial exploration and hypothesis-generation phase, where errors are more common. The subsequent decrease suggests that the models learn from their mistakes and improve their accuracy as they progress.
The significant performance drop of MMU and DynaMath after step index 21 could indicate that these models struggle with a specific type of step or concept encountered later in the problem. The relatively stable performance of WeMath after step index 15 suggests that it is more robust and less susceptible to these challenges.
The shaded regions highlight periods where the models' performance is less predictable, potentially due to the complexity of the problem or the inherent uncertainty in the problem-solving process. Further investigation into the specific steps within these regions could reveal valuable insights into the models' limitations and areas for improvement. The data suggests that no single model consistently outperforms the others across all steps, indicating that a combination of approaches might be optimal for solving complex math problems.
</details>
(b)
Figure 5: Step Distribution of VisualProcessBench. The X-axis represents the step index. (a) The Y-axis indicates the proportion of steps at each index relative to the total number of steps, reflecting the distribution of step positions in solutions. (b) The Y-axis represents the error rate of steps at each index, showing the likelihood of errors occurring at different step positions.
The statistics for step distribution of VisualProcessBench is presented in Figure 5. We observe that most solutions consist of fewer than 15 steps. Among these solutions with fewer than 15 steps, most solutions contain about 7 or 13 steps. For the correctness of each step, we observe that the error rate is lower in the first three steps and then increases as the step index grows. We attribute this to the fact that problems requiring more reasoning steps tend to be more challenging, leading to a gradual rise in step error rates. Notably, starting from step 15, the error rate drops sharply. This is because the number of steps in this range is relatively small, resulting in significant statistical fluctuations.
## 9 More Data Examples in VisualPRM400K
In this section, we provide more data examples of VisualPRM400K in Figure 6 from different domains, including general visual question answering (VQA) [23, 28, 52, 48], science [31, 13, 49], chart [53, 29, 12], mathematics [40, 65, 11, 47, 30, 22], OCR [54, 67, 8, 27, 56], and document [17].
## 10 More Data Examples in VisualProcessBench
In this section, we provide more data examples in VisualProcessBench from different data sources. Specifically, we randomly choose three examples from our benchmark and visualize them in Figure 7. Additionally, in Figure 8, we provide an example where the model initially generates an incorrect answer and then autonomously corrects it.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Receipt & Solution Breakdown: Geoventure Invoice
### Overview
The image presents a receipt from "Geoventure Taste Of The World Sdn Bhd" along with a step-by-step solution to calculate the total amount of the receipt. The receipt details an order for "S Tiger Bucket" and includes service charge and GST. The solution breaks down the calculation process, identifying items, adding costs, and accounting for additional charges.
### Components/Axes
The image is divided into two main sections: the receipt itself (left side) and the solution steps (right side).
**Receipt Details:**
* **Company:** Geoventure Taste Of The World Sdn Bhd
* **Address:** 99655-A, Lot 38 & 39, Boulevard, Jalan Kempas Lama, 81200 Johor Bahru, Johor.
* **Tel:** 03-8079 0255
* **GST Reg. No:** 0006457104
* **Reservation No:** 010 - 775055
* **Simple Lifted Tax Invoice**
* **Invoice:** 581355
* **Cashier:** Niki
* **POS No:** 1
* **Date:** 5/3/2018 22:52:53
* **Table No:** C38
**Receipt Table Columns:**
* ITEM
* QTY
* UP (RM) 0.00/0.50/1.00
* AMT (RM)
**Solution Steps:**
* Numbered steps from #Step-1 to #Step-6.
* Each step has a confidence score (e.g., (unc=0.5625)).
### Detailed Analysis or Content Details
**Receipt Table Data:**
| ITEM | QTY | UP (RM) | AMT (RM) |
| ---------------- | --- | ------- | -------- |
| S Tiger Bucket | 1.0 | 55.00 | 55.00 |
| Service Chg (Incl GST) | 1.0 | 0.00 | 5.83 |
| Total Sales (Incl GST) | 1.0 | 0.00 | 64.15 |
| Rounding Adjustment | 1.0 | 0.00 | 0.02 |
**Receipt Totals:**
* **TOTAL:** RM 64.15
* **CASH:** RM 100.00
* **CHANGE:** RM 35.85
**Solution Steps Breakdown:**
1. **(unc=0.5625)** Identify the items and their prices on the receipt: "S Tiger Bucket" costs RM 55.00.
2. **(unc=0.8125)** Add up the total cost of the items: Total cost of the items: RM 55.00.
3. **(unc=0.4375)** Look for any additional charges (if any): There is a service charge (S.Chrg) including GST of RM 5.83.
4. **(unc=0.5625)** Include the service charge in the total cost: Total with service charge: RM 55.00 + RM 5.83 = RM 60.83.
5. **(unc=0.8125)** Total Sales (including GST): RM 64.15 is listed under "Total Sales (Incl GST 6%)". This includes the GST (which is calculated based on the service charge).
6. **(unc=0.5)** Calculate the amount paid by the customer: Cash paid: RM 100.00.
7. **(unc=0.625)** Subtract the total sales from the amount paid: RM 100.00 - RM 64.15 = RM 35.85.
8. **(unc=0.5625)** Final answer: Therefore, the receipt is the customer's change: RM 35.85.
### Key Observations
* The receipt includes a service charge and GST.
* The solution accurately breaks down the calculation of the total amount and change.
* The confidence scores associated with each step of the solution vary, suggesting differing levels of certainty in the OCR process.
* The receipt is dated March 5, 2018.
### Interpretation
The image demonstrates a typical point-of-sale transaction and a clear explanation of how the total amount and change are calculated. The inclusion of the solution steps suggests this is part of a learning exercise or a demonstration of transparency in pricing. The receipt provides a record of the purchase, including item details, costs, and taxes. The solution confirms that the change given to the customer (RM 35.85) is correct based on the total amount (RM 64.15) and the cash received (RM 100.00). The varying confidence scores in the solution steps indicate potential OCR errors, but the overall logic and calculation remain consistent.
</details>
(a)
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Document Analysis: Question Answering Example
### Overview
The image presents a screenshot of a document with a question-and-answer format. The document appears to be a scanned letter, accompanied by a question about the date mentioned within it, and a step-by-step solution. The solution is presented with associated confidence scores (mc=...).
### Components/Axes
The image is divided into three main sections:
1. **Document:** A scanned image of a handwritten letter.
2. **Question:** A text box containing the question "What is the date mentioned in the document?".
3. **Solution:** A series of steps explaining how to determine the date, with associated confidence scores.
The document itself contains the following visible elements:
* **Sender:** Not clearly legible, but appears to start with "Mr. Fleck".
* **Recipient:** Dudley's sister (address partially visible).
* **Date:** "11/20" is explicitly mentioned.
* **Location:** Ward Island, Corpus Christi, Texas.
* **Content:** Mentions writing to him today (11/20) and that they may be in for book learning.
* **Signature:** "Charles".
* **Source:** `https://www.industrydocuments.ucsf.edu/docs/ffhi0226`
The solution steps are numbered 1-4, each with a confidence score (mc=...).
### Detailed Analysis or Content Details
Here's a transcription of the solution steps and their confidence scores:
* **Step 1 (mc=0.6625):** "To determine the date mentioned in the document, let's follow these steps: 1. Look for any reference to a date in the text. 2. Identify the date by its numerical form or any text that explicitly mentions a date."
* **Step 2 (mc=0.125):** "Steps broken down:"
* **Step 3 (mc=0.5625):** "1. In the document, the date "11/20" is mentioned in the text. 2. This is likely an abbreviation of November 20, indicating a date."
* **Step 4 (mc=0.0):** "Final answer: 11/20"
The document's date is explicitly stated as "11/20". The solution correctly identifies this date and interprets it as November 20th.
### Key Observations
* The confidence scores associated with each step of the solution are variable. Step 1 has the highest confidence (0.6625), while Step 4 has zero confidence. This suggests the model is more certain about the initial steps of the process than the final answer.
* The document is a scanned image, resulting in some text being difficult to read.
* The solution provides a logical breakdown of how to identify the date within the document.
### Interpretation
The image demonstrates a question-answering system applied to a document. The system successfully extracts the date "11/20" from the scanned letter and provides a step-by-step explanation of its reasoning. The confidence scores suggest that the system's certainty varies depending on the complexity of the task. The low confidence score for the final answer might indicate that the system relies heavily on pattern matching and struggles with inferential reasoning. The source URL suggests this document is part of a larger collection related to industry research, potentially tobacco industry documents, given the UCSF domain. The document itself appears to be a personal letter, and the question is designed to test the system's ability to extract specific information from unstructured text.
</details>
(b)
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Problem Solution: Sphere Subtraction
### Overview
The image presents a mathematics problem involving the subtraction of red shiny spheres from a set of spheres. It includes a visual representation of the spheres and a step-by-step solution.
### Components/Axes
The image is divided into three main sections: a header with the title "Mathematics", a visual section with spheres, and a solution section with numbered steps. The solution section contains text explaining the process of solving the problem.
### Detailed Analysis or Content Details
The problem asks: "Subtract all red shiny spheres. How many spheres are left?".
The solution is broken down into the following steps:
* **Step-1 (unc=0.6875)**: "To determine how many spheres are left after subtracting the red shiny sphere, let's break down the objects in the image step by step:"
1. "There are two spheres visible in the image."
2. "One of these spheres is red and shiny."
3. "The other sphere is blue and shiny."
* **Step-2 (unc=0.0625)**: "Since the question asks for the spheres after removing the red one, we need to subtract the red sphere from the total count of spheres."
* **Step-3 (unc=0.0)**: "Final answer: 1"
The image shows two spheres: one red and shiny, and one blue and shiny.
### Key Observations
The solution correctly identifies the two spheres and subtracts the red one, resulting in a final answer of 1. The "unc=" values (0.6875, 0.0625, 0.0) appear to be confidence scores or some other metric associated with each step, decreasing as the solution progresses.
### Interpretation
The image demonstrates a simple subtraction problem presented in a visual context. The step-by-step solution breaks down the problem into manageable parts, making it easier to understand. The inclusion of the visual representation helps to clarify the problem and solution. The decreasing "unc=" values suggest a decreasing level of uncertainty as the solution is completed, potentially indicating a confidence level in the correctness of each step. The problem is designed to test basic counting and subtraction skills.
</details>
(c)
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Diagram: Foot Type Identification
### Overview
The image presents a science-related diagram illustrating different foot types (Normal, Flat, Rigid, Big) and a question asking to identify the foot type corresponding to Diagram D. The diagram shows four footprint images labeled A, B, C, and D, alongside a step-by-step solution process.
### Components/Axes
The image contains the following components:
* **Footprint Diagrams:** Four footprint images labeled A, B, C, and D.
* **Question:** "Diagram D corresponds with what foot type? Normal Flat Rigid Big. Please answer the question based on the options mentioned before."
* **Solution:** A step-by-step solution process with associated confidence scores (mc=).
* Step 0 (mc=0.75): Introduction to the problem.
* Step 1 (mc=0.9375): Description of Normal Foot.
* Step 2 (mc=1.0): Description of Flat Foot.
* Step 3 (mc=1.0): Description of Rigid Foot.
* Step 4 (mc=0.9375): Description of Big Foot.
* Step 5 (mc=0.0): Analysis of Diagram D and conclusion.
* Step 6 (mc=0.0): Final answer.
### Detailed Analysis or Content Details
The footprints are arranged horizontally across the image.
* **Footprint A:** Shows a footprint with a pronounced arch.
* **Footprint B:** Shows a footprint with a very slight arch.
* **Footprint C:** Shows a footprint with a moderate arch.
* **Footprint D:** Shows a footprint where the heel and the front of the foot are touching the ground, with a distinctly visible arch.
The solution process provides the following descriptions:
* **Normal Foot:** Generally shows evenly distributed pressure across the sole, with well-defined arches.
* **Flat Foot:** Typically shows the entire foot pressing against the ground, with little to no arch (or almost no arch) visible.
* **Rigid Foot:** Harder to identify but generally tends to show little flexibility or bouncing when walking.
* **Big Foot:** Seems to refer to the general size of the foot rather than its type based on pressure distribution.
The analysis in Step 5 concludes that Diagram D represents a normal foot, where there is a clear arch leaving an impression on the ground. The final answer is "Normal".
### Key Observations
The solution process assigns confidence scores (mc) to each step, indicating the certainty of the analysis. The highest confidence scores are given to the descriptions of Flat and Rigid feet (mc=1.0). The final answer is determined by visual examination of Diagram D.
### Interpretation
The diagram demonstrates a method for identifying foot types based on footprint analysis. The step-by-step solution provides a logical reasoning process, starting with defining each foot type and then applying those definitions to the given footprint (Diagram D). The confidence scores suggest that the descriptions of Flat and Rigid feet are more definitive than those of Normal and Big feet. The diagram highlights the importance of observing the arch of the foot to determine its type. The inclusion of "Big Foot" as an option seems somewhat irrelevant to the actual analysis of foot *type* based on pressure distribution. The diagram is a pedagogical tool for understanding podiatric concepts.
</details>
(d)
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Document: Visual Question Answering (VQA) Example
### Overview
The image presents a document illustrating a Visual Question Answering (VQA) problem. It includes a stylized image of a flying object (appearing as a dashed arrow), a question about the image's symmetry, multiple-choice answers, and a step-by-step solution.
### Components/Axes
The document is structured into three main sections:
1. **Image:** A dashed arrow representing a flying object, positioned on the left side of the document.
2. **Question & Answers:** Located in the center-right, presenting the question "Does this picture have symmetry?" with answer choices A. yes and B. no.
3. **Solution:** Occupying the right side of the document, detailing the steps to determine the answer. The solution is broken down into numbered steps.
### Content Details
The text content is transcribed as follows:
**Header:** General VQA
**Question:**
Does this picture have symmetry?
A. yes
B. no
**Solution:**
To determine if the picture has symmetry, follow these steps:
## Step-1 (mc=0.625)
1. Identify the object in the picture. In this case, it is a stylized representation of a flying object.
2. Look at the object's shape. The object appears to be bilaterally symmetrical, meaning it is symmetrical around a line that runs down the middle of the object.
3. This symmetry can be demonstrated by the object appearing to be a mirror image of itself along the central vertical line.
## Step-2 (mc=0.0)
Since the object appears to meet the criteria for symmetry, the answer is:
## Step-3 (mc=0.0)
Final answer: A
The values `mc=0.625` and `mc=0.0` appear next to "Step-1" and "Step-2" respectively. These likely represent machine confidence scores or similar metrics associated with each step in the solution process.
### Key Observations
The solution explicitly states the object is bilaterally symmetrical. The final answer is "A. yes", indicating the image does possess symmetry. The `mc` values suggest a higher confidence in the first step of the solution compared to the second.
### Interpretation
This document demonstrates a VQA task where a model is presented with an image and a question, and then provides a reasoned solution. The inclusion of confidence scores (`mc`) suggests this is part of a system that evaluates its own reasoning process. The problem is designed to test the model's ability to recognize symmetry in a visual representation. The step-by-step solution provides a clear explanation of the reasoning process, making it easy to understand how the model arrived at the answer. The fact that the confidence score is higher for identifying the object and its symmetry than for concluding the answer suggests that the initial visual analysis is more reliable than the final decision-making step.
</details>
(e)
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Chart: Democrats Hold Advantages on Environment, Abortion, Education; Neither Party Has Edge on the Economy
### Overview
The image presents a chart comparing the percentage of Democrats and Republicans who believe their party is better able to handle various issues. Alongside the chart is a question and a step-by-step solution demonstrating how to calculate the ratio between Democratic and Republican viewpoints on foreign policy.
### Components/Axes
* **Title:** "Democrats Hold Advantages on Environment, Abortion, Education; Neither Party Has Edge on the Economy" (Top-center)
* **Subtitle:** "% who say each party is better able to handle…" (Below Title)
* **Issues (Y-axis):** Gun control, Terrorist threat, Budget deficit, Taxes, Immigration, Economy, Foreign policy, Health care, Education, National collaboration, Environment (Listed vertically on the left)
* **Party (X-axis):** Democratic Party, Republican Party, Diff (Listed horizontally)
* **Data Values:** Percentages are displayed for each issue and party. The "Diff" column shows the percentage point difference between the two parties.
* **Question:** "What is the ratio between Democratic and Republican Party in Foreign policy?" (Top-right)
* **Solution Steps:** Numbered steps (1-6) with calculations and explanations.
* **Source:** "Survey conducted July 20-26, 2015" (Bottom-left)
* **Logo:** "PEW RESEARCH CENTER" (Bottom-left)
### Detailed Analysis or Content Details
**Data Table:**
| Issue | Democratic Party (%) | Republican Party (%) | Diff (%) |
| --------------------- | -------------------- | -------------------- | -------- |
| Gun control | 50 | 42 | +8 |
| Terrorist threat | 34 | 44 | -10 |
| Budget deficit | 37 | 41 | -4 |
| Taxes | 41 | 42 | -1 |
| Immigration | 42 | 40 | +2 |
| Economy | 44 | 44 | 0 |
| Foreign policy | 41 | 38 | +3 |
| Health care | 46 | 38 | +8 |
| Education | 54 | 34 | +20 |
| National collaboration | 31 | 52 | -21 |
| Environment | 57 | 28 | +29 |
**Solution Steps (Transcribed):**
* **Step-1 (m=4.625):** To determine the ratio between the percentage of Democrats and Republicans who believe that their party is better able to handle foreign policy, we need to look at the corresponding values in the table.
* **Step-2 (m=4.3125):** 1. Identify the percentage of Democrats who think the Democratic Party is better at handling foreign policy. In the "Foreign policy" row, the percentage is 41%.
* **Step-3 (m=4.375):** 2. Identify the percentage of Republicans who think the Republican Party is better at handling foreign policy. In the "Foreign policy" row, the percentage is 38%.
* **Step-4 (m=4.6875):** 3. The ratio between the percentage of Democrats and the percentage of Republicans is calculated by dividing the percentage of Democrats by the percentage of Republicans.
* **Step-5 (m=4.75):** \text{Ratio} = \frac{\text{Percentage of Democrats}}{\text{Percentage of Republicans}}
* **Step-6 (m=4.0):** \text{Ratio} = \frac{41}{38} \approx 1.079. Final answer: \frac{41}{38}
### Key Observations
* Democrats consistently score higher than Republicans on issues like gun control, health care, education, and the environment.
* Republicans have an advantage on terrorist threats and national collaboration.
* The two parties are evenly matched on the economy and taxes.
* The difference in perception is most significant for the environment (+29%) and education (+20%).
* The calculated ratio for foreign policy is approximately 1.079, indicating a slight advantage for Democrats.
### Interpretation
The data suggests a clear partisan divide in perceptions of which party is better equipped to handle various issues. Democrats are seen as stronger on social and environmental issues, while Republicans are favored on security and collaboration. The foreign policy ratio, while close to 1, indicates a slight preference for the Democratic Party's approach. The step-by-step solution demonstrates a simple calculation to quantify this difference. The inclusion of the calculation and the explicit identification of the values used in the calculation suggests the document is intended to be instructional, demonstrating how to derive insights from the presented data. The "Diff" column provides a quick visual comparison of the partisan gap on each issue. The source information (Pew Research Center) lends credibility to the data.
</details>
(f)
Figure 6: More data examples from VisualPRM400K.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Diagram: Elements of Art
### Overview
The image presents a diagram illustrating the "Elements of Art". It consists of a header with the title "Elements of Art", followed by six visual representations of different elements, each accompanied by a label. A question and solution related to the diagram are also present on the right side of the image.
### Components/Axes
The diagram features the following elements:
* **Line:** Represented by a series of vertical lines.
* **Shape:** Represented by a square with a pattern.
* **Form:** Represented by a three-dimensional shape resembling a cube.
* **Space:** Represented by a series of red dots.
* **Texture:** Represented by a patterned, swirling design.
* **Color:** Represented by a color wheel.
The right side of the image contains a question: "Based on <image !>. When lines meet to form an enclosed area, this is formed." with multiple-choice answers: A. Space B. Shape C. Form D. Line.
Below the question is a "Solution" section, broken down into steps 1-6.
### Content Details
The "Solution" section provides definitions for each element of art:
* **Step 1:** Space - refers to the area between or around elements of art. It can be positive (occupied by objects) or negative (empty).
* **Step 2:** Shape - is an enclosed area created by lines or other elements. It can be geometric or organic.
* **Step 3:** Form - refers to the three-dimensional aspect of an object, giving it volume and depth.
* **Step 4:** Line - is a mark made by a moving point.
* **Step 5:** When lines meet to form an enclosed area, this is called a "Shape."
* **Step 6:** Final answer: B.
### Key Observations
The diagram clearly defines and visually represents the fundamental elements of art. The solution section provides a step-by-step explanation of how to arrive at the correct answer to the question. The question focuses on the relationship between lines and shapes.
### Interpretation
The diagram serves as an educational tool to introduce and explain the core concepts of visual art. It demonstrates how basic elements like lines, shapes, and forms combine to create more complex artistic expressions. The question and solution section reinforce the understanding of these concepts through a practical application. The diagram is designed to be self-explanatory, with each element clearly labeled and visually distinct. The solution section provides a logical reasoning process, making it easy to understand why "Shape" is the correct answer. The diagram is a foundational piece for art education, providing a basic vocabulary and understanding of visual principles.
</details>
(a)
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Educational Material: Music Theory Problem & Solution
### Overview
The image presents a music theory problem and its step-by-step solution. It includes a musical staff with notes, a multiple-choice question asking for the key of the melody, and a detailed solution explaining the reasoning.
### Components/Axes
The image is structured into three main sections:
1. **Musical Notation:** A staff with notes representing a melody.
2. **Question:** A multiple-choice question with four options: A. B major, B. E major, C. A minor, D. E minor.
3. **Solution:** A breakdown of the solution into steps, labeled as Step-0, Step-1, Step-2, and Step-3. Each step is further categorized as either "Neural" or "Correct".
### Content Details
**Musical Notation:**
The musical staff displays a melody with a key signature of one sharp (F#). The notes appear to be primarily within the range of a fifth, centered around E.
**Question:**
"Choose the correct key of each of the melody. <Image 1>"
Options:
A. B major
B. E major
C. A minor
D. E minor
**Solution:**
* **Step-0 (Neural):** "To determine the key of the melody, we analyze the key signature, accidentals, and the tonal center. Here's the breakdown:"
* **Step-1 (Correct):** "**Key Signature**": The melody has a key signature with one sharp (F#), which corresponds to either "**G major**" or "**E minor**".
* **Step-2 (Correct):** "**Accidentals**": The melody contains a D# accidental, which is the leading tone in the key of E minor. This suggests the melody is in "**E minor**".
* **Step-3 (Correct):** "**Tonal Center**": The melody revolves around E as the tonic, further confirming "**E minor**" as the key.
* **Correct Answer:** "**D. E minor**"
### Key Observations
The solution methodically breaks down the process of key identification. It starts with the key signature, then considers accidentals, and finally confirms the tonal center. The presence of the D# accidental is crucial in determining the key as E minor rather than G major.
### Interpretation
This material is designed to teach students how to determine the key of a melody. It emphasizes a systematic approach, starting with the key signature and then refining the possibilities based on accidentals and tonal center. The use of "Neural" and "Correct" labels suggests a learning process where the student might initially attempt to solve the problem intuitively ("Neural") before being guided to the correct solution. The step-by-step explanation provides a clear and logical pathway to understanding the underlying principles of music theory. The image demonstrates a pedagogical approach to music theory, focusing on analytical skills and problem-solving. The inclusion of the musical notation allows for direct application of the concepts being taught.
</details>
(b)
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Chart: Phillips Curve and Money Supply Growth
### Overview
The image presents a Phillips Curve diagram illustrating the relationship between inflation and unemployment rates, with an analysis of how increased money supply growth affects the economy in the short and long run. The chart includes two curves, labeled "Curve 1" and "Curve 2", and several points (A, B, C, D, E, P) marked on the curves. Alongside the chart is a step-by-step solution to a question regarding the economic movement when the money supply growth rate increases.
### Components/Axes
* **X-axis:** Unemployment Rate (%) - Scale ranges from 0 to 10, with increments of 1.
* **Y-axis:** Inflation Rate (%) - Scale ranges from 0 to 10, with increments of 1.
* **Curves:**
* Curve 1: A downward-sloping curve, colored green.
* Curve 2: A downward-sloping curve, colored blue.
* **Points:**
* A: Approximately (6, 4)
* B: Approximately (6, 3)
* C: Approximately (6, 1)
* D: Approximately (8, 7)
* E: Approximately (2, 2)
* P: Approximately (6, 1)
* **Solution Steps:** Numbered from #Step-0 to #Step-7, with indicators of correctness ("Correct" or "Incorrect").
### Detailed Analysis or Content Details
The chart shows the initial state of the economy at point B (Unemployment Rate ≈ 6%, Inflation Rate ≈ 3%). The solution explains that an increase in the money supply typically leads to higher inflation in the short run, moving the economy along the Phillips Curve to a point with higher inflation but not changing the natural rate of unemployment. In the long run, the economy returns to its natural rate of unemployment, but with a higher inflation rate, represented by point C.
Here's a breakdown of the solution steps:
* **#Step-0 (Correct):** Introduces the Phillips Curve and its inverse relationship between inflation and unemployment.
* **#Step-1 (Correct):** Identifies the initial point B (6%, 3%).
* **#Step-2 (Correct):** Explains the impact of increased money supply growth on inflation.
* **#Step-3 (Correct):** Describes the Phillips Curve movement in the long run.
* **#Step-4 (Incorrect):** States the new point is not the same unemployment rate.
* **#Step-5 (NearAt):** Indicates a near-accurate step.
* **#Step-6 (Incorrect):** Incorrectly states the economy moves from B to C.
* **#Step-7 (Incorrect):** Final answer is A.
The text accompanying the chart provides the following information:
* The Phillips Curve represents the inverse relationship between inflation and unemployment.
* The natural rate of unemployment is also known as the Non-Accelerating Inflation Rate of Unemployment (NAIRU).
* Increased money supply growth leads to higher inflation in the short run.
* In the long run, the economy returns to its natural rate of unemployment, but with a higher inflation rate.
### Key Observations
* The solution steps contain errors, specifically in #Step-4, #Step-6, and #Step-7.
* The chart visually demonstrates the short-run and long-run effects of increased money supply growth on the Phillips Curve.
* Point C represents the long-run equilibrium with the same unemployment rate as the initial point B, but with a lower inflation rate.
* The curves are not precisely defined mathematically, and the point locations are approximate.
### Interpretation
The data suggests that while increased money supply growth can temporarily reduce unemployment (moving along the Phillips Curve), it ultimately leads to higher inflation without a permanent reduction in unemployment. The long-run Phillips Curve is vertical at the natural rate of unemployment, indicating that monetary policy cannot sustainably lower unemployment below this level. The errors in the solution steps highlight the complexity of understanding macroeconomic concepts and the importance of careful analysis. The chart and accompanying text are designed to illustrate a fundamental principle in economics – the trade-off between inflation and unemployment – and the limitations of monetary policy in achieving long-term economic goals. The positioning of points B and C on the same vertical line (approximately x=6) visually reinforces the idea that the unemployment rate returns to its initial level in the long run.
</details>
(c)
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Problem Solving Example: MathVision Data Example
### Overview
The image presents a math problem with a visual representation of addition and a step-by-step solution process. The problem involves arranging the digits 2, 3, 4, and 5 into two numbers to maximize their sum. The solution attempts to demonstrate the logic, but contains errors.
### Components/Axes
The image is structured into three main sections:
1. **Problem Statement:** Includes the question and a hint.
2. **Visual Representation:** Shows two empty boxes representing the numbers to be added.
3. **Solution:** A numbered list of steps detailing the solution process, with indications of whether each step is correct or incorrect.
### Content Details
**Problem Statement:**
* **Question:** "Please answer the question and provide the final answer at the end. Question: Each of the digits 2, 3, 4 and 5 will be placed in a square. There then will be two numbers, which will be added together. What is the biggest number that they could make?"
* **Hint:** "To maximize the sum of the two numbers, we should form the largest possible numbers by arranging the digits in descending order. Here's how:"
**Visual Representation:**
* Two empty rectangles are shown, separated by a "+" sign. These represent the two numbers to be formed and added.
**Solution:**
* **Step 0 (Correct):** "To maximize the sum of the two numbers, we should form the largest possible numbers by arranging the digits in descending order. Here's how:"
* **Step 1 (Incorrect):** "1. Arrange the digits \(2, 3, 4, 5\) in descending order: \(5, 4, 3, 2\) ."
* **Step 2 (Incorrect):** "2. Split them into two numbers such that the largest digits are in the tens place. This ensures the largest sum. - First number: \(54\) ) - Second number: \(32\) )"
* **Step 3 (Correct):** "3. Add the two numbers: \(\ln 54 + 32 = 86 \ln \)"
* **Step 4 (Incorrect):** "## Final Answer: **86**"
### Key Observations
* The solution attempts to maximize the sum by arranging the digits in descending order, which is a valid strategy.
* However, the splitting of the digits into 54 and 32 is not optimal. The largest possible sum is achieved by forming the numbers 52 and 43 (or 53 and 42).
* The final answer of 86 is incorrect. The correct answer is 95 (52 + 43 or 53 + 42).
* The use of `\ln` is unexplained and appears to be an error.
### Interpretation
The image demonstrates a problem-solving approach to a mathematical puzzle. The solution process highlights a common strategy for optimization (arranging digits in descending order). However, it also reveals a potential error in the application of this strategy, leading to an incorrect final answer. The inclusion of "Correct" and "Incorrect" labels for each step suggests this is a learning example, intended to illustrate both successful and flawed reasoning. The presence of the `\ln` notation is anomalous and likely a typographical error. The problem is designed to test understanding of place value and maximizing sums. The incorrect answer suggests a misunderstanding of how to optimally distribute the digits to achieve the largest possible sum.
</details>
(d)
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Diagram: Quadratic Function Piece Identification
### Overview
The image presents a mathematical problem involving identifying which of five graph pieces does not belong to the same quadratic function. It includes five small graphs (A-E), a question, answer choices, and a step-by-step solution. The solution explains the properties of quadratic functions and analyzes each graph piece.
### Components/Axes
The image is divided into several sections:
* **Header:** "Data Example from MathVision"
* **Question Section:** Contains the question prompt and answer choices.
* **Graph Section:** Displays five graphs labeled A, B, C, D, and E. Each graph has x and y axes.
* **Solution Section:** Provides a step-by-step solution with annotations.
The graphs have the following axes:
* **x-axis:** Ranges from approximately -2 to 4 for graphs A, B, D, and E. Graph C ranges from -1 to 3.
* **y-axis:** Ranges from approximately -2 to 2 for all graphs.
### Content Details
**Question:**
"Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.
Question: Four of the following five pictures show pieces of the graph of the same quadratic function. Which piece does not belong?
Choices: (A) A (B) B (C) C (D) D (E) E"
**Graphs:**
* **(A):** A parabola opening upwards. The vertex is near (0, -1). The curve increases from left to right.
* **(B):** A parabola opening downwards. The vertex is near (0, 1). The curve decreases from left to right.
* **(C):** A straight line with a negative slope. It passes through approximately (0, 1) and (2, -1).
* **(D):** A parabola opening upwards. The vertex is near (2, -2). The curve increases from left to right.
* **(E):** A parabola opening downwards. The vertex is near (2, 1). The curve decreases from left to right.
**Solution:**
* **Step-0 (Neural):** "To solve this, we analyze the graphs:"
* **Step-1 (Correct):** "- A quadratic function has a parabolic shape, either opening upwards or downwards. - The correct graphs must align with this property."
* **Step-2 (Correct):** "### Analysis:
*Option A**: A curve increasing from left to right, consistent with a quadratic function opening upwards.
*Option B**: A curve decreasing from left to right, consistent with a quadratic function opening downwards.
*Option C**: A line, not a curve, inconsistent with a quadratic function.
*Option D**: A curve increasing from left to right, consistent with a quadratic function opening upwards.
*Option E**: A curve decreasing from left to right, consistent with a quadratic function opening downwards."
* **Step-3 (Correct):** "### Conclusion:
The graph in **C** does not belong because it is not a parabola."
* **Step-4 (Correct):** "**Correct Answer: C**"
### Key Observations
The solution correctly identifies graph C as the outlier because it represents a linear function, not a quadratic function. Graphs A, B, D, and E all exhibit parabolic shapes, consistent with quadratic functions.
### Interpretation
The problem tests the understanding of quadratic functions and their graphical representation. A quadratic function is defined by an equation that results in a parabolic curve when graphed. The key characteristic used to identify the outlier is the shape of the graph. A straight line does not represent a quadratic function. The step-by-step solution effectively breaks down the problem by first stating the defining property of a quadratic function (parabolic shape) and then analyzing each graph piece against that property. The neural network step suggests an automated approach to solving this type of problem. The problem is designed to assess conceptual understanding rather than complex calculations.
</details>
(e)
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Diagram: Fish and Ring Problem Solution
### Overview
The image presents a visual problem involving fish swimming around a ring, along with a step-by-step solution process. The problem asks how many fish will have their heads pointing towards the ring when a curved line is straightened. The solution is presented as a series of steps, with indications of whether each step is "Neural" (presumably AI-generated) or "Correct".
### Components/Axes
The image is divided into three main sections:
1. **Visual Problem:** A depiction of several fish swimming around a ring.
2. **Question:** A text box containing the problem statement.
3. **Solution:** A numbered list detailing the steps to solve the problem, with annotations indicating the step's validity.
The question is: "How many fish will have their heads pointing towards the ring when we straighten the line?"
The hint is: "Please answer the question and provide the final answer at the end."
### Content Details
The solution is broken down into the following steps:
* **# Step-0 (Neural):** "To determine how many fish will have their heads pointing towards the ring when we straighten the line, follow these steps: 1. Identify the fish with their heads pointing towards the ring in the current configuration. 2. Visualize straightening the line and see which fish’s heads will align with the ring."
* **# Step-1 (Correct):** "1. Identify the fish with their heads pointing towards the ring in the current configuration."
* **# Step-2 (Neural):** "Step-by-step reasoning:"
* **# Step-3 (Incorrect):** "1. In the current configuration, the fish with their heads pointing towards the ring are: - The green fish on the left side."
* **# Step-4 (Incorrect):** "2. When the line is straightened, the fish with their heads pointing towards the ring will be those whose heads are aligned with the ring’s position in the straightened line."
* **# Step-5 (Incorrect):** "From the image, it appears that only the green fish at the far left is aligned with the ring’s position in the straightened line."
* **# Step-6 (Incorrect):** "Final answer: 3"
The image shows approximately 10 fish of various colors (green, orange, purple, and gray) swimming in a circular pattern around a golden ring. The line connecting some of the fish is curved.
### Key Observations
The solution process appears to be flawed. While the initial steps are logical, the identification of fish pointing towards the ring in the current configuration is incorrect. The solution only identifies the green fish on the left, while a visual inspection suggests more than one fish is initially aligned with the ring. The final answer of 3 is likely incorrect based on the flawed reasoning. The annotations "Neural" and "Correct/Incorrect" suggest an evaluation of an AI-generated solution.
### Interpretation
This image demonstrates a problem-solving scenario where a visual task is combined with logical reasoning. The inclusion of "Neural" and "Incorrect" labels indicates an attempt to evaluate the performance of an AI model in solving this type of problem. The errors in the solution highlight the challenges AI faces in accurately interpreting visual information and applying logical steps. The problem tests spatial reasoning and the ability to visualize changes in configuration. The fact that the AI-generated solution fails suggests that current AI models may struggle with tasks requiring nuanced visual understanding and accurate counting. The image serves as a case study for evaluating and improving AI capabilities in visual problem-solving.
</details>
(f)
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Chart/Diagram Type: Number Plane with Inequality Shading & Solution Steps
### Overview
The image presents a number plane (Cartesian coordinate system) with a shaded region representing an inequality. Alongside the plane, a step-by-step solution process is provided to determine the inequality. The solution steps are marked as "Step-1 (Neural)", "Step-2 (Incorrect)", "Step-3 (Incorrect)", and "Step-4 (Incorrect)". The question posed is: "According to the question shown in the image, please directly answer the question and provide the final value, e.g., 1, 2, 5, 300."
### Components/Axes
* **Number Plane:**
* X-axis: Ranges from approximately -15 to 15.
* Y-axis: Ranges from approximately -10 to 10.
* Origin: Located at (0, 0).
* Gridlines: Present for both axes, aiding in coordinate identification.
* **Shaded Region:** The area above a solid line is shaded.
* **Line:** A solid line passes through the points (-5, -5) and (0, 5).
* **Solution Steps:** A series of text blocks outlining the reasoning and calculations.
* **Question:** Text block at the top-left corner.
* **Header:** "Data Example from MathVerse" at the top-center.
### Detailed Analysis or Content Details
**Number Plane:**
* The line intersects the x-axis at approximately x = -5 and the y-axis at y = 5.
* The shaded region extends infinitely upwards from the line.
**Solution Steps (Transcribed):**
* **Step-1 (Neural):**
* **Reasoning:**
* "The problem asks to find the inequality that represents the shaded region in the number plane."
* "The boundary line is shown as a solid line, indicating that the line itself is included in the inequality (≤ or ≥)."
* "The shaded region lies above the line."
* **Step-2 (Incorrect):**
* **Find the Equation of the Line:**
* "To derive the boundary line equation, two points on the line are identified:"
* "Point 1: (-5, -5)"
* "Point 2: (0, 5)"
* "Using the slope formula, calculate the slope (m):"
* `m = (y₂ - y₁) / (x₂ - x₁) = (5 - (-5)) / (0 - (-5)) = (10) / (5) = 2`
* "The line equation is in slope-intercept form (y = mx + b). Substitute one point to find b:"
* `5 = 2(0) + b implies b = 5`
* "The equation of the line is:"
* `y = 2x + 5`
* **Step-3 (Incorrect):**
* **Determine the Inequality:**
* "Since the shaded region is above the line, the inequality is y > (geq 2x + 5)."
* **Step-4 (Incorrect):**
* **Final Answer:**
* `y (geq 2x + 5)`
**Question:**
* "According to the question shown in the image, please directly answer the question and provide the final value, e.g., 1, 2, 5, 300."
### Key Observations
* The solution steps are labeled as "Incorrect" despite providing a seemingly logical process.
* The final answer provided is `y (geq 2x + 5)`, which is incorrect. The correct inequality should be `y ≥ 2x + 5`.
* The slope calculation and line equation derivation are accurate.
* The error lies in the final inequality determination, where ">=" is used instead of "≥".
### Interpretation
The image demonstrates a problem-solving approach to determining the inequality represented by a shaded region on a number plane. The solution attempts to find the equation of the boundary line and then determine the inequality based on the shaded region's position relative to the line. However, the final answer is incorrect, indicating a misunderstanding of the inequality symbols or a typographical error. The question at the top is a meta-question, asking for a numerical answer to a problem that requires an inequality as a solution, which is a mismatch. The "Neural" label on Step-1 suggests this might be output from an AI system, and the subsequent "Incorrect" labels indicate a failure in the AI's reasoning or output. The image serves as a case study in identifying errors in mathematical reasoning and the importance of careful attention to detail. The provided solution is a good attempt, but ultimately flawed.
</details>
(g)
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Diagram: Geometry Problem Solution
### Overview
The image presents a geometry problem with a diagram, a question, and a step-by-step solution. The problem involves finding the degree of an angle formed by a tangent to a circle. The solution is presented with correctness indicators for each step.
### Components/Axes
The image is divided into three main sections:
1. **Problem Statement & Diagram (Left):** Contains the geometric figure, the question, and answer choices.
2. **Solution Steps (Right):** A numbered list detailing the solution process. Each step is marked as "Correct" or "Incorrect".
3. **Textual Explanation:** Detailed explanation of each step.
The diagram shows a circle with center O. Point C is outside the circle, and a tangent line is drawn from C to the circle, touching the circle at point A. Point B is on the circle, and line segment CB intersects the circle. Angle CBD is labeled as 50°.
The question asks for the degree of angle A. The answer choices are: A:20°, B:25°, C:40°, D:50°.
### Detailed Analysis or Content Details
**Diagram Details:**
* Circle center: O
* Tangent line from C to A
* Angle CBD = 50°
* Line segments: CA, CB, CD (where D is a point on CB)
**Question:**
"According to the question shown in the image, please directly answer the question and provide the correct option letter, e.g., A, B, C, D."
**Solution Steps:**
* **Step-1 (Correct):** "Identify the key points: The arcs AC and BC. Angle A is subtended by these arcs at point O (the center of the circle)."
* **Step-2 (Correct):** "Understand the given information: CD is a tangent at point C. The angle between the tangent and the chord is equal to the angle subtended by the chord at the circumference. Therefore, angle CDB = 50°/0060."
* **Step-3 (Incorrect):** "Angle properties: The angle A is subtended by arc BC at the center. Since CD is a tangent, angle A = 50°/0060 (as the inscribed angle is half the central angle)."
* **Step-4 (Correct):** "Thus, the degree of angle A is 20°/0060."
* **Step-5 (Correct):** "Final answer: A"
### Key Observations
* Step 3 is marked as incorrect, despite presenting a logical application of tangent-chord angle theorem. The calculation in step 3 appears to be flawed.
* The final answer is A, corresponding to 20°.
* The notation "/0060" appears after some angle values, which is likely a formatting artifact or an error.
### Interpretation
The image demonstrates a geometry problem-solving process. The solution attempts to leverage the relationship between tangent-chord angles and central angles. However, there's an error in Step 3, where the application of the theorem seems incorrect, leading to a flawed calculation. Despite this error, the final answer is correctly identified as 20°. The inclusion of "Correct" and "Incorrect" labels for each step provides a self-assessment mechanism for the solution process. The "/0060" notation is anomalous and likely a data entry or rendering issue. The problem highlights the importance of careful application of geometric principles and accurate calculations. The question is presented in a slightly awkward phrasing, suggesting it might be automatically generated or translated.
</details>
(h)
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Mathematical Problem & Solution: Roots of Quadratic Functions
### Overview
The image presents a mathematical problem involving finding the common roots of two quadratic functions, f(x) and g(x). It includes a graph of f(x), the question, a step-by-step solution with reasoning, and multiple-choice answers. The solution attempts to find the roots of each function and then compare them. The steps are marked as "Correct" or "Incorrect" with a neural network symbol next to the "Neural" step.
### Components/Axes
* **Graph:** A Cartesian coordinate system with the x-axis ranging from -7 to 7 and the y-axis ranging from -5 to 7. The function f(x) = x² - x - 6 is plotted as a parabola. The parabola is labeled "y = f(x)".
* **Question:** "According to the question shown in the image, please directly answer the question and provide the correct option letter, e.g., A, B, C, D."
* **Problem Statement:** "f(x)=x²-x-6. How many roots do the functions have in common, f(x) and g(x)?"
* **Choices:**
* A: f and g share the same roots.
* B: f and g share one root in common but each have another root that is not shared.
* C: f and g share no roots in common.
* **Solution Steps:** Numbered from 0 to 4, each with a "Correct" or "Incorrect" label.
* **Reasoning:** Detailed explanations for each step.
### Detailed Analysis or Content Details
**1. Graph Analysis:**
The parabola y = f(x) = x² - x - 6 intersects the x-axis at approximately x = -2 and x = 3.
**2. Problem Statement:**
The problem asks for the number of common roots between f(x) = x² - x - 6 and g(x) (which is not explicitly defined, but implied to be a different quadratic function).
**3. Solution Step 0 (Neural):**
This step is marked as "Neural" and "Correct". It doesn't contain any mathematical content.
**4. Solution Step 1 (Correct):**
* **Roots of f(x):** The solution states that f(x) = x² - x - 6 can be factored as (x-3)(x+2).
* **Roots:** The roots are identified as x = 3 and x = 2. *Note: There is a typo here. The root should be x = -2.*
**5. Solution Step 2 (Incorrect):**
* **Roots of g(x):** The solution states that observing the graph of g(x), the parabola intersects the x-axis at x = -3 and x = 4.
* **Roots:** The roots are identified as x = -3 and x = 4.
**6. Solution Step 3 (Incorrect):**
* **Comparison of Roots:**
* f(x) has roots x = -3 and x = 2. *Note: This is incorrect. The roots are x = -2 and x = 3.*
* g(x) has roots x = -3 and x = 4.
* The functions f(x) and g(x) share one root in common, x = -3, but they each have another root that is not shared (x = 2 for f(x) and x = 4 for g(x)).
**7. Solution Step 4 (Incorrect):**
* **Answer:** The correct option is B: f(x) and g(x) share one root in common but each have another root that is not shared.
### Key Observations
* There are several errors in the solution. The roots of f(x) are incorrectly stated in multiple steps.
* The function g(x) is not explicitly defined, making it difficult to verify the solution.
* The "Neural" step is unexplained.
* The solution incorrectly identifies the roots of f(x) as -3 and 2, when they are actually -2 and 3.
### Interpretation
The image presents a flawed solution to a mathematical problem. The errors in identifying the roots of f(x) lead to an incorrect conclusion about the common roots of f(x) and g(x). The inclusion of a "Neural" step suggests an attempt to incorporate machine learning into the solution process, but its purpose is unclear. The overall quality of the solution is poor due to the mathematical errors and lack of clarity. The problem demonstrates the importance of accurate root finding and careful comparison of solutions when determining common roots of functions. The solution attempts to use a graphical approach to find the roots of g(x), but without a clear definition of g(x), this approach is unreliable. The final answer, while based on the flawed reasoning, correctly identifies the potential relationship between the roots if the incorrect roots were accurate.
</details>
(i)
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Diagram: Protractor Angle Measurement Example
### Overview
The image presents a math problem from "Data Example from DynaMath" involving the measurement of an angle using a protractor. It includes a diagram of a protractor with a blue angle indicated, a question asking for the angle's measurement, and a step-by-step solution. The solution attempts to guide the user through the process, but is marked as incorrect at multiple steps.
### Components/Axes
* **Protractor Diagram:** A semi-circular protractor with markings from 0 to 180 degrees. The scale is present on both the inner and outer arcs.
* **Blue Angle:** A blue line segment originating from the center of the protractor, extending to the 40-degree mark.
* **Question:** "Based on the measurement results shown in the diagram, this blue angle is ( ) \u00b0."
* **Solution Steps:**
* Step-0 (Neural)
* Step-1 (Incorrect)
* Step-2 (Incorrect)
* Step-3 (Incorrect)
* **Header:** "Data Example from DynaMath"
### Detailed Analysis or Content Details
The protractor diagram shows an angle starting at 0 degrees and extending to approximately 40 degrees. The protractor has two scales: one increasing clockwise from 0 to 180, and another increasing counter-clockwise from 0 to 180.
The solution steps are as follows:
* **Step-0 (Neural):** "To determine the measure of the blue angle in the diagram, follow these steps:"
* **Step-1 (Incorrect):**
1. "Observe the protractor: The protractor measures angles from 0\u00b0 to 180\u00b0."
2. "Identify the measurement: The blue angle is indicated by a line pointing to a specific degree on the protractor."
3. "Locate the angle: The blue line points to the number 40\u00b0 on the protractor."
* **Step-2 (Incorrect):** "Thus, the measure of the blue angle is 40\u00b0."
* **Step-3 (Incorrect):** "Final answer: 40"
### Key Observations
The solution steps, despite correctly identifying the angle as approximately 40 degrees, are all marked as "Incorrect". This suggests an error in the overall problem setup or the expected solution format. The protractor diagram is clear and accurately depicts an angle of approximately 40 degrees.
### Interpretation
The image demonstrates a basic angle measurement exercise using a protractor. The inclusion of "Incorrect" labels on each step of the solution is unusual and suggests a pedagogical approach that emphasizes identifying errors or a flawed problem design. The problem aims to test the user's ability to read a protractor and determine the angle's measure. The fact that the answer is marked incorrect despite being visually accurate suggests the problem may be testing for a more nuanced understanding of angle measurement or the correct way to express the answer. The use of the Unicode character "\u00b0" represents the degree symbol (°).
</details>
(j)
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Data Table: Sales Numbers by City with Solution Steps
### Overview
The image presents a data table listing sales numbers for different cities, followed by a question about the range of these numbers and a step-by-step solution attempting to calculate it. The solution steps are marked with "Step-()" and an assessment of correctness ("Neural" or "Incorrect").
### Components/Axes
The image contains the following components:
* **Header:** "Data Example from DynaMath"
* **Data Table:** Two columns labeled "city" and "Sales number".
* **Question:** "A real estate agent looked into how many houses were sold in different cities. What is the range of the numbers?"
* **Solution:** A numbered list of steps with explanations and calculations.
* **Step Labels:** Each step is labeled as "Step-()" with an assessment of correctness.
### Content Details
The data table contains the following information:
| City | Sales number |
|---------------|--------------|
| Briggs Corner | 83 |
| New Humburg | 83 |
| Melville | 86 |
| Fremont | 81 |
| Liberty City | 92 |
| Charles Falls | 84 |
| Pleasant Town | 93 |
| Penny Town | 97 |
The question asks for the range of the sales numbers.
The solution steps are as follows:
* **Step-0 (Neural):** "Let me help you find the range of the sales numbers."
* **Step-1 (Incorrect):** "The range is calculated by subtracting the minimum value from the maximum value."
* **Step-2 (Incorrect):** "Minimum sales number 81 (Fremont) Maximum sales number 97 (Penny Town)"
* **Step-3 (Incorrect):** "Range = Maximum – Minimum Range = 97 – 81 = 16"
* **Step-4 (Incorrect):** "Therefore, the range of house sales numbers across these cities is 16."
### Key Observations
The solution steps correctly state the formula for calculating the range (maximum - minimum). However, the steps are labeled as "Incorrect" despite arriving at the correct answer (range = 16). There appears to be an error in the labeling of the steps. The solution correctly identifies the minimum sales number as 81 (Fremont) and the maximum sales number as 97 (Penny Town).
### Interpretation
The image demonstrates a problem-solving scenario in a mathematical context (likely statistics or data analysis). The data table provides a set of sales numbers for different cities. The question tests the understanding of the concept of "range" in statistics. The solution attempts to guide the user through the calculation of the range, but the "Incorrect" labels on the steps are misleading, as the final answer is correct. This suggests a potential issue with the automated assessment system or a deliberate design choice to highlight the process rather than just the outcome. The "Neural" label on Step-0 suggests an AI-driven assistance component. The presence of "n" in steps 2 and 3 is likely a typographical error.
</details>
(k)
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Radar Chart: Data Example from DynaMath
### Overview
The image presents a radar chart alongside a question, multiple-choice answers, and a step-by-step solution. The radar chart visually represents data for five categories (A, B, C, D, and E) on a scale from 0 to 100. The question asks which category is second highest. The solution attempts to identify the second highest category through a series of steps, but incorrectly identifies the answer.
### Components/Axes
* **Chart Type:** Radar Chart (also known as Spider Chart or Star Chart)
* **Categories:** A, B, C, D, E
* **Scale:** 0 to 100, marked at intervals of 20 (20, 40, 60, 80, 100)
* **Axes:** Five radial axes, one for each category, emanating from the center.
* **Question:** "The radar chart below represents different categories. Which category is second highest?"
* **Choices:** (A) A, (B) B, (C) C, (D) D, (E) E
* **Solution Steps:**
* Step 0 (Neutral): "Looking at the radar chart, I can see that there are 5 categories (A, B, C, D, and E) plotted on a scale from 0 to 100. After analyzing the values:"
* Step 1 (Incorrect): "Category C appears to be the highest- Category D shows the second highest value- Categories A, B, and E have lower values"
* Step 2 (Incorrect): "Therefore, category D (choice D) is the second highest in the radar chart."
* Step 3 (Incorrect): "The answer is (D)."
### Detailed Analysis or Content Details
The radar chart shows the following approximate values:
* **Category A:** ~30
* **Category B:** ~10
* **Category C:** ~90
* **Category D:** ~70
* **Category E:** ~20
The lines representing each category extend from the center of the chart to the corresponding value on its axis. The chart is positioned in the top-left corner of the image. The question and solution are positioned to the right of the chart.
### Key Observations
* Category C has the highest value.
* Category B and E have the lowest values.
* The solution incorrectly identifies Category D as the second highest. Category D is indeed higher than A, B, and E, but is lower than C.
* The solution steps are labeled with their correctness (Neutral, Incorrect).
### Interpretation
The radar chart is used to visually compare the values of five different categories. The question tests the ability to interpret the chart and identify the second highest category. The provided solution demonstrates a misunderstanding of the chart, incorrectly identifying Category D as the second highest. The chart clearly shows Category C as the highest, and Category D as the second highest. The solution's error highlights the importance of careful visual analysis when interpreting charts and graphs. The labeling of the solution steps as "Incorrect" suggests this is a learning example, designed to illustrate common mistakes in data interpretation.
</details>
(l)
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Problem Solution: Conical Container Radius Calculation
### Overview
This image presents a math problem and its step-by-step solution, likely from an educational platform (WeMath). The problem involves calculating the radius of the base of a conical water container given the circumference of its upper edge. A diagram of a cone is included, with a height labeled as 20cm. The solution is broken down into numbered steps, each marked as "Correct".
### Components/Axes
The image contains the following components:
* **Question:** A textual problem statement.
* **Multiple Choice Answers:** A, B, 9, C, 10, D, Cannot be determined, E. No correct answer.
* **Solution:** A series of numbered steps explaining the calculation.
* **Diagram:** A 2D representation of a cone with a labeled height.
### Detailed Analysis or Content Details
The problem statement is: "Given that the circumference of the upper edge of a conical water container is 62.8 cm, what is the radius of its base circle in cm?"
The multiple choice answers are: A. 8, B. 9, C. 10, D. Cannot be determined, E. No correct answer.
The solution steps are as follows:
1. **Step-1 (Correct)**: To determine the *radius* of the base of the cone:
The formula for the circumference of a circle is:
`C = 2πr`
where `C` is the circumference and `r` is the radius.
2. **Step-2 (Correct)**: We are given `C = 62.8 \text{ cm}`. Substitute this value into the formula:
`62.8 = 2πr`
3. **Step-3 (Correct)**: Solve for `r`:
`r = \frac{62.8}{2π}`
4. **Step-4 (Correct)**: Use the approximation `π \approx 3.14`:
`r = \frac{62.8}{2 \times 3.14} = \frac{62.8}{6.28} = 10 \text{ cm}`
5. **Step-5 (Correct)**: Thus, the radius of the base circle is *10 cm*.
6. **Step-6 (Correct)**: ## Final Answer: **C. 10**
The diagram shows a cone with a height of 20cm. The base radius is not labeled.
### Key Observations
* The solution is presented in a clear, step-by-step manner.
* Each step is explicitly marked as "Correct", indicating a verified solution.
* The problem utilizes the standard formula for the circumference of a circle.
* The final answer corresponds to option C in the multiple-choice selection.
* The diagram provides a visual aid but doesn't directly contribute to the numerical solution.
### Interpretation
The image demonstrates a typical math problem-solving approach. It presents a real-world scenario (a conical container) and requires the application of a geometric formula to find the unknown radius. The step-by-step solution breaks down the problem into manageable parts, making it easier to understand and follow. The use of the approximation `π ≈ 3.14` simplifies the calculation. The diagram serves as a visual representation of the problem, aiding in comprehension. The problem is designed to test the student's understanding of circumference and radius relationships. The final answer of 10 cm is consistent with the given circumference of 62.8 cm and the approximation of pi.
</details>
(m)
<details>
<summary>x27.png Details</summary>

### Visual Description
## Diagram: Geometry Problem Solution
### Overview
The image presents a geometry problem from WeMath, along with its solution. The problem involves a quadrilateral ABCD with EF as the axis of symmetry, and asks for the perimeter of parallelogram ABFD given AD = 3cm and CD = 5cm. The solution is presented in a step-by-step format, with correctness indicators (# Correct) and (# Incorrect) next to each step. A diagram of the quadrilateral is included.
### Components/Axes
The image consists of the following components:
* **Question:** Text describing the geometry problem.
* **Solution:** A numbered list of steps explaining the solution process.
* **Diagram:** A visual representation of the quadrilateral ABCD with EF as the axis of symmetry.
* **Correctness Indicators:** Labels indicating whether each step in the solution is correct or incorrect.
The diagram includes the following labels:
* A, B, C, D: Vertices of the quadrilateral.
* E, F: Points on sides AD and BC respectively, defining the axis of symmetry EF.
* 3cm: Length of AE and ED.
* 5cm: Length of BF and FC.
* EF: Axis of symmetry (dashed line).
### Content Details
**Question:**
"As shown in the diagram, EF is the axis of symmetry for quadrilateral ABCD. Given that CD = 5 cm and AD = 3 cm, what is the perimeter of parallelogram ABFD in cm?"
Options: A. 12 B. 10 C. 14 D. 16 E. No correct answer
**Solution:**
* **Step-1 (# Correct):** "To find the perimeter of parallelogram ABFD, we need to use the given information about the quadrilateral ABCD and its symmetry with respect to EF."
* **Step-2 (# Incorrect):** "Since EF is the axis of symmetry and AD = 3 cm, AD = DF."
* **Step-3 (# Incorrect):** "CD is given as 5 cm. Since EF is the axis of symmetry, CD = BF."
* **Step-4 (# Correct):** "Now, let's calculate the perimeter of the parallelogram ABFD: - AD + DF + BF + AB"
* **Step-5 (# Incorrect):** "Since AD = DF and CD = BF, we can substitute the measurements: - AD = 3 cm - BF = CD = 5 cm"
* **Step-6 (# Incorrect):** "Therefore, the perimeter is: - AD + DF + BF + AB = 3 cm + 3 cm + 5 cm + 5 cm = 16 cm"
* **Step-7 (# Correct):** "Final answer: D"
**Diagram Details:**
* Quadrilateral ABCD is shown with EF as a dashed line representing the axis of symmetry.
* AE = ED = 3cm
* BF = FC = 5cm
### Key Observations
* Steps 2, 3, 5, and 6 in the solution are marked as incorrect, despite the final answer being correct.
* The solution attempts to find the perimeter of ABFD by summing its sides.
* The diagram visually represents the given information about the quadrilateral and its symmetry.
* The solution incorrectly states AD = DF and CD = BF. It should be AB = DF and AD = BF.
### Interpretation
The problem tests understanding of symmetry in quadrilaterals and the properties of parallelograms. The solution attempts to leverage the symmetry to find the perimeter of ABFD. However, the intermediate steps contain logical errors in applying the symmetry property. The final answer is correct, but the reasoning is flawed. The incorrect steps highlight a misunderstanding of how the symmetry affects the side lengths of the parallelogram. The diagram is crucial for visualizing the problem and understanding the relationships between the different segments. The correctness indicators are useful for identifying the errors in the solution process. The solution demonstrates a common mistake of misapplying symmetry properties. The correct approach would be to recognize that AB = DF and AD = BF due to the symmetry.
</details>
(n)
<details>
<summary>x28.png Details</summary>

### Visual Description
\n
## Diagram: WeMath Problem - Book Color Identification
### Overview
The image presents a visual problem from WeMath, depicting two desk setups with stacks of books. The problem asks for the color of the bottommost book on Desk u2460. A solution is provided, outlining a step-by-step visual analysis of the book stacks.
### Components/Axes
The image is divided into three main sections:
1. **Desk 1 (Left):** A desk with a stack of three books.
2. **Desk 2 (Right):** A desk with a stack of three books.
3. **Question & Solution (Right):** Textual question and a step-by-step solution.
The question provides multiple-choice answers:
A. Red
B. Green
C. Blue
D. Yellow
E. No correct answer
The solution is structured as a series of steps labeled "# Step-1 (Neural)", "# Step-1 (Correct)", "# Step-2 (Correct)", and "# Step-3 (Correct)".
### Detailed Analysis or Content Details
**Desk 1 (Left):**
- The stack of books from top to bottom appears to be: Green, Red, Blue.
**Desk 2 (Right):**
- The stack of books from top to bottom appears to be: Yellow, Red, Green.
**Question:**
"As shown in the diagram, what is the color of the bottommost book placed on Desk u2460?"
**Solution:**
- **# Step-1 (Neural):** "To solve this question, we need to visually analyze the stack of books on Desk u2460 depicted in the diagram. Starting from the top and moving downward, the books are arranged in the following order:"
- **# Step-1 (Correct):**
1. "**Green**" (topmost book)
2. "**Red**" (second book)
3. "**Blue**" (third book, bottommost book)
- **# Step-2 (Correct):** "Thus, the color of the bottommost book on Desk u2460 is "**Blue**"."
- **# Step-3 (Correct):** "### Final Answer: "**C. Blue**"
### Key Observations
- The problem focuses on visual deduction.
- The solution explicitly states the order of books on Desk u2460 (Green, Red, Blue).
- The final answer correctly identifies the bottommost book as Blue, corresponding to answer choice C.
- Desk u2460 is the desk on the left side of the image.
### Interpretation
The image demonstrates a simple problem-solving scenario where visual information is used to answer a question. The solution breaks down the process into logical steps, emphasizing the importance of careful observation. The use of labels like "Neural" and "Correct" suggests a potential learning or assessment context, possibly involving an AI-assisted system. The problem is designed to test the ability to accurately perceive and interpret visual data. The solution is clear and concise, providing a direct answer based on the visual evidence presented. The problem is well-defined and has a single, unambiguous solution.
</details>
(o)
Figure 7: More data examples from VisualProcessBench.
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Diagram: Geometry Problem Solution
### Overview
The image presents a geometry problem with a diagram of a triangle ABC, along with a step-by-step solution process from a system called "ProcessBench". The problem asks for the measure of angle 1, given that triangle ABC is isosceles with angle B = 25 degrees and side AB = 25. The solution attempts to find the angle using various reasoning steps, some of which are initially incorrect.
### Components/Axes
The image is divided into several sections:
* **Problem Statement:** Top-left corner, containing the question and multiple-choice answers.
* **Diagram:** Left side, showing triangle ABC with labeled angles and sides.
* **Solution Steps:** Right side, a numbered list detailing the reasoning process.
* **Ellipsis:** Representing omitted steps in the solution process.
The diagram includes:
* Triangle ABC
* Angle A = 130 degrees
* Angle B = 25 degrees
* Angle C = 25 degrees
* Exterior angle 1
* Angle adjacent to angle 1 = 35 degrees
The solution steps are labeled "Step-1", "Step-2", etc., and are marked as "Neural", "Correct", or "Incorrect".
### Detailed Analysis or Content Details
**Diagram Details:**
* Triangle ABC has angle A = 130°, angle B = 25°, and angle C = 25°.
* Angle 1 is an exterior angle at vertex C.
* The angle adjacent to angle 1 is 35°.
**Problem Statement:**
* Question: "As shown in the figure, triangle ABC is an isosceles triangle, AB = 25", then the angle of ∠1 is ()"
* Options: A. 95°, B. 145°, C. 120°, D. No correct answer
**Solution Steps (Transcribed):**
* Step-1 (Neural): "Let's see. I have this problem here about an isosceles triangle. It says that in triangle ABC, angle B is 25 degrees, and it's an isosceles triangle. I need to find the measure of angle 1. The options are A. 95°, B. 145°, C. 120°, and D. no correct answer."
* Step-2 (Correct): "First, I should recall that in an isosceles triangle, two sides are equal in length, and the angles opposite those sides are also equal. So, in triangle ABC, if angle B is 25 degrees, then either angle A or angle C is also 25 degrees. But wait, looking at the figure, it seems like angle C is also 25 degrees because there's a mark indicating that C is 25 degrees."
* Step-3 (Incorrect): "So, if angle 1 is an exterior angle at point C, then its measure should be equal to the sum of the measures of the angles at points A and B. Given that angle A is 130 degrees and angle B is 25 degrees, then angle 1 should be 130° + 25° = 155°."
* Step-4 (Correct): "But that's not one of the options. Option D says "no correct answer," but maybe I'm missing something. Wait, perhaps the figure is not exactly as I described. Let me look again."
* Step-12 (Correct): "Looking back at the figure, if angle 1 is the exterior angle at C, which is 155 degrees, and there's a 35-degree angle adjacent to it on the straight line, then the angle between them would be 155° - 35° = 120°."
"Therefore, angle 1 is 120 degrees."
"Yes, that makes sense."
"So, the answer should be option C. 120°."
"Final Answer: \[boxed{C}]"
### Key Observations
* The initial attempt to calculate angle 1 (Step-3) is incorrect, as it incorrectly applies the exterior angle theorem.
* The solution process demonstrates a "think-aloud" approach, with the system verbalizing its reasoning and correcting its mistakes.
* The final answer (120°) is derived by recognizing that angle 1 and the 35° angle are supplementary.
### Interpretation
The image illustrates a problem-solving process within an AI system (ProcessBench). The system initially struggles with the geometry problem, making an incorrect assumption about the exterior angle theorem. However, it then revisits the diagram, corrects its reasoning, and arrives at the correct answer. This demonstrates the system's ability to learn from its mistakes and refine its understanding of the problem. The inclusion of "Neural", "Correct", and "Incorrect" labels provides insight into the system's internal state and confidence level during the solution process. The step-by-step breakdown of the reasoning is valuable for understanding how the system approaches and solves geometric problems. The problem highlights the importance of careful diagram analysis and accurate application of geometric principles. The final answer of 120 degrees is consistent with the given information and the properties of exterior angles.
</details>
Figure 8: Data example with model reflection from VisualProcessBench. Red highlights the incorrect answer, orange highlights the reflection words, and green highlights the correct answer.