# VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Abstract
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in this page. 🖂 Corresponding Author: wangwenhai@pjlab.org.cn
1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Bar Chart: Overall Performance Comparison of Policy Models
### Overview
The image is a bar chart comparing the overall performance of different policy models. The y-axis represents "Overall Performance," ranging from 25 to 55. The x-axis represents "Policy Model," with several models listed. There are three data series, each represented by a different color bar: "Pass@1" (green), "BoN with InternVL2.5-8B" (red), and "BoN with VisualPRM-8B (ours)" (blue).
### Components/Axes
* **Y-axis:** "Overall Performance," with a scale from 25 to 55 in increments of 5.
* **X-axis:** "Policy Model," listing the following models: MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, InternVL2.5-26B, InternVL2.5-38B, InternVL2.5-78B.
* **Legend:** Located at the top-left of the chart, it identifies the three data series:
* Green: "Pass@1"
* Red: "BoN with InternVL2.5-8B"
* Blue: "BoN with VisualPRM-8B (ours)"
### Detailed Analysis
Here's a breakdown of the performance for each policy model across the three data series:
* **MiniCPM-V2.6:**
* Pass@1 (Green): 29.5
* BoN with InternVL2.5-8B (Red): 28.6
* BoN with VisualPRM-8B (Blue): 37.5
* **QwenVL2.5-7B:**
* Pass@1 (Green): 41.4
* BoN with InternVL2.5-8B (Red): 41.6
* BoN with VisualPRM-8B (Blue): 45.1
* **InternVL2.5-8B:**
* Pass@1 (Green): 32.8
* BoN with InternVL2.5-8B (Red): 33.2
* BoN with VisualPRM-8B (Blue): 41.2
* **InternVL2.5-26B:**
* Pass@1 (Green): 36.9
* BoN with InternVL2.5-8B (Red): 39.1
* BoN with VisualPRM-8B (Blue): 45.8
* **InternVL2.5-38B:**
* Pass@1 (Green): 44.4
* BoN with InternVL2.5-8B (Red): 44.9
* BoN with VisualPRM-8B (Blue): 50.7
* **InternVL2.5-78B:**
* Pass@1 (Green): 46.0
* BoN with InternVL2.5-8B (Red): 46.4
* BoN with VisualPRM-8B (Blue): 51.9
### Key Observations
* The "BoN with VisualPRM-8B (ours)" (blue) consistently outperforms the other two models across all policy models.
* The performance of all three data series generally increases as the "InternVL2.5" model number increases (e.g., from 26B to 78B).
* The "Pass@1" (green) and "BoN with InternVL2.5-8B" (red) models have very similar performance for each policy model, with the red bar being slightly higher in most cases.
### Interpretation
The chart demonstrates that "BoN with VisualPRM-8B (ours)" achieves the highest overall performance compared to "Pass@1" and "BoN with InternVL2.5-8B" across the tested policy models. The trend suggests that increasing the size or complexity of the "InternVL2.5" model generally leads to improved performance for all three configurations. The consistent outperformance of the "VisualPRM-8B" model indicates its effectiveness in the given task or benchmark. The similar performance between "Pass@1" and "InternVL2.5-8B" might suggest a correlation or similar underlying mechanism in their approach.
</details>
Figure 1: The overall Best-of-8 evaluation results across seven multimodal reasoning benchmarks with different critic models. Our VisualPRM greatly enhances the overall performance, while InternVL2.5-8B struggles to be an effective critic model.
With the remarkable success of Large Language Models (LLMs) [75, 76, 21, 5, 71, 10, 9, 1, 4] in natural language processing, Multimodal Large Language Models (MLLMs) [80, 36, 41, 42, 6, 84, 83, 14, 16, 15, 58, 61, 89] have also achieved significant advancements across various vision-language tasks. Despite their strong performance in perception and recognition, a large gap remains in reasoning capabilities between open-source and proprietary models. A series of studies have explored methods to enhance reasoning abilities, focusing on the perspectives of data collection and construction [57, 74, 38, 43], offline preference optimization [59, 82, 32], and online reinforcement learning [66, 24, 2, 26]. Additionally, another line of research [68, 20, 94, 79] investigates utilizing Test-Time Scaling (TTS) to enhance the reasoning abilities of LLMs. This approach requires the policy model to generate multiple response candidates and select the best one, based on the quality estimation of a critic model, thereby improving the response quality at the cost of higher inference time. However, TTS for MLLMs remains largely unexplored.
This work investigates the application of TTS for MLLMs, focusing on the Best-of-N (BoN) evaluation strategies. The challenges of adapting TTS for MLLMs involves: (1) Lack of effective critic models. In BoN evaluation, a critic model is required to estimate the quality of each response candidate. However, as shown in Figure 1, existing open-source MLLMs struggle to serve as critic models, leading to marginal improvements compared to models without TTS. This limitation stems from the lack of sufficient critic data in their training corpus. (2) Lack of evaluation benchmarks for multimodal critic models. The effectiveness of TTS heavily depends on the performance of the critic model. However, directly evaluating critics under BoN settings poses two key issues. First, the evaluation cost of BoN is expensive. Although the focus is on the performance of critic models, the policy model is required to generate $N$ reasoning processes, with the majority of computational costs arising from the policy model. Second, BoN performance is also affected by the policy model, making it difficult to compare different critic models when paired with varying policy models.
To solve these challenges, we first introduce VisualPRM400K, a dataset comprising approximately 400K multimodal process supervision data. Each sample includes an image, a question, a step-by-step solution, and correctness annotations for each step. Specifically, we collect question prompts from MMPR v1.1 [82] and then generate process correctness using an automatic data pipeline [79]. This pipeline samples multiple continuations starting from a certain step and computes the expected accuracy of that step as the average accuracy of its continuations.
To facilitate the evaluation of multimodal critic models, we introduce VisualProcessBench, a benchmark for evaluating PRMs and MLLMs in detecting erroneous steps in multimodal reasoning tasks. This benchmark includes 2,866 samples with 26,950 human-annotated step-wise correctness labels. Each sample includes a multimodal reasoning question, a step-by-step solution, and correctness labels for each step. To ensure annotation accuracy, we employ human experts with at least a university degree to manually assess the correctness of each step. Unlike prior benchmarks [96, 39], which require identifying only the first erroneous step, VisualProcessBench challenges models to detect all errors within a given solution. This adjustment aligns with recent advancements in model reflection abilities, helping to reduce false negatives in evaluations. Evaluation results reveal that existing open-source MLLMs struggle to accurately assess step-wise correctness, highlighting the need for improved multimodal critic models.
Building upon the dataset and benchmark, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, to serve as the critic model in BoN evaluation. Each training sample is formulated as a multi-turn chat. The first turn includes the image, the question, and the first solution step, while each subsequent turn presents a new step. The model is trained to predict the correctness of the given step at each turn. Experimental results demonstrate that VisualPRM enhances MLLM reasoning across different model families and scales. Specifically, VisualPRM improves the overall reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks [90, 50, 78, 93, 99, 60, 87]. Additionally, we compare PRMs with Outcome Reward Models and Self-Consistency in BoN evaluation, finding that PRMs consistently outperform both approaches.
In summary, our main contributions are as follows:
(1) We introduce VisualPRM400K, a dataset comprising approximately 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal PRM to serve as the critic model in the BoN evaluation.
(2) We construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to identify erroneous steps in multimodal reasoning tasks. This benchmark comprises 2,866 samples with a total of 26,950 human-annotated step-wise correctness labels.
(3) Through extensive experiments, we demonstrate that PRMs can serve as effective critic models for test-time scaling of MLLMs. Specifically, VisualPRM enhances the overall multimodal reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks. Furthermore, our results show that PRMs consistently outperform both ORMs and SC in BoN evaluation. Additionally, experiments on VisualProcessBench reveal that existing open-source MLLMs struggle to accurately assess the correctness of each step.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Math Problems with Solutions
### Overview
The image presents two distinct math problems, each with a question and a step-by-step solution. The first problem involves finding the measure of an angle in a quadrilateral, while the second involves finding the measure of an angle formed by intersecting lines. Each step in the solution is annotated with a label indicating correctness or a confidence score.
### Components/Axes
**Problem 1 (VisualPRM400K):**
* **Question:** Find m ∠S.
* **Diagram:** A quadrilateral labeled Q, R, S, and T.
* ∠R = x°
* ∠T = x°
* ∠Q = (2x + 5)°
* ∠S = (2x + 7)°
* **Solution Steps:**
* Step 0: (mc=0.75)
* Step 1: (mc=0.5)
* Step 2: (mc=0.5625)
* Step 3: (mc=0.5)
* Step 4: (mc=0.0)
* Step 5: (mc=0.0)
**Problem 2 (VisualProcessBench):**
* **Question:** As shown in the figure, AB parallel CD, EG bisects angle BEF, then angle 2 is equal to (). Choices: A:50° B:60° C:65° D:90°
* **Diagram:** Two parallel lines AB and CD intersected by a transversal EG.
* ∠1 is labeled as 50°.
* EG bisects ∠BEF, creating ∠BEG and ∠GEF.
* ∠2 is formed by EG and CD.
* **Solution Steps:**
* Step 0: (Neural)
* Step 1: (Correct)
* Step 2: (Correct)
* Step 3: (Correct)
* Step 4: (Incorrect)
* Step 5: (Correct)
* Step 6: (Incorrect)
* Step 7: (Incorrect)
### Detailed Analysis or ### Content Details
**Problem 1 (VisualPRM400K):**
* **Solution:**
* The solution uses the property that the sum of interior angles of a quadrilateral is 360°.
* Step 1: (2x + 5)° + x° + (2x + 7)° + x° = 360°
* Step 2: 2x + 5 + x + 2x + 7 + x = 360 => 6x + 12 = 360
* Step 3: 6x = 360 - 12 => 6x = 348 => x = 58
* Step 4: m ∠S = 2x + 7 = 2(58) + 7 = 116 + 7 = 123
* Final Answer: 58 (This is incorrect based on the previous step. It should be 123)
**Problem 2 (VisualProcessBench):**
* **Solution:**
* The solution uses properties of parallel lines and angle bisectors.
* Given: AB || CD, EG bisects ∠BEF, ∠1 = 50°
* Since AB || CD, the transversal EG creates alternate interior angles and corresponding angles.
* EG bisects ∠BEF, meaning ∠BEG = ∠GEF.
* ∠BEF = ∠BEG + ∠GEF
* Since EG bisects ∠BEF, ∠BEG = ∠GEF = 50°
* ∠2 and ∠GEF are congruent because AB || CD and EG is a transversal.
* Therefore, ∠2 = ∠GEF = 50°
* Final Answer: The correct option is A.
### Key Observations
* In Problem 1, the final answer provided (58) is the value of 'x', not the measure of angle S, which should be 2x+7 = 123.
* In Problem 2, the steps are labeled as "Correct" or "Incorrect," providing feedback on the solution process.
### Interpretation
The image presents two geometry problems with detailed solutions. Problem 1 demonstrates how to find an unknown angle in a quadrilateral using the property of the sum of interior angles. However, the final answer is incorrect, as it provides the value of 'x' instead of the measure of angle S. Problem 2 demonstrates how to find an angle formed by intersecting parallel lines using properties of angle bisectors and transversals. The steps are labeled with correctness, which is useful for understanding the reasoning process.
</details>
Figure 2: Data examples in VisualPRM400K and VisualProcessBench. For VisualPRM400K, we generate the data using an automatic data pipeline. The key idea is to estimate the expected accuracy $mc_{i}$ of the given step $s_{≤ i}$ based on Monte Carlo sampling and consider the step correct if $mc_{i}>0$ . During the training process of VisualPRM, the data is formulated as multi-turn conversations and the model is required to predict the correctness of each step conditioned on the image, question, and previous steps. For VisualProcessBench, we collect questions from existing multimodal reasoning benchmarks [90, 78, 93, 99, 60] and generate the solutions using leading MLLMs [58, 4, 15, 82, 72]. Based on these questions and solutions, we employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions.
2 Related Work
Multimodal Large Language Models. A wide range of efforts has been made to advance the development of MLLMs, including improvements in model architecture, data construction, and training algorithms. From an architectural perspective, several studies [41, 42, 89, 15, 16, 84, 83, 45, 85, 7, 89] employ connectors to align visual embeddings from Vision Foundation Models (VFMs) [14, 91] with the latent space of LLMs [5, 75, 76, 71], achieving promising performance. Another series of works [3, 21, 73, 81] extends pre-trained LLMs with additional layers to fuse visual features, reducing the number of required visual tokens while introducing extra training cost. In terms of data construction, recent studies have made significant progress [63, 98, 34, 38, 43, 82, 95]. For example, OmniCorpus [38] offers a noisy but large-scale multimodal corpus for pre-training, while MMInstruct [43] provides an open-source, high-quality instruction-tuning dataset. Additionally, MMPR [82] constructs a preference dataset focusing on multimodal reasoning abilities. Regarding training algorithms, the InternVL2.5 series [15, 82] proposes square loss and Mix Preference Optimization to enhance MLLM capabilities. Despite these advancements, existing works primarily focus on the training process of MLLMs, leaving Test-Time Scaling (TTS) for MLLMs largely underexplored. In this work, we investigate TTS applications for MLLMs, specifically focusing on the Best-of-N evaluation to improve multimodal reasoning performance.
Process Reward Models. Reward models play a crucial role in Reinforcement Learning (RL) [64, 66, 2, 26] and TTS [68, 79, 20, 51]. Outcome Reward Models (ORMs) [55, 92, 77] directly assign an overall score to the given response. In contrast, Process Reward Models (PRMs) first estimate the quality of each step in the given response and then aggregate them into a final score. PRM800K [39] is the first open-source process supervision dataset, entirely annotated by human experts. To reduce annotation costs, MathShepherd [79] and OmegaPRM [51] introduce a Monte Carlo sampling-based data pipeline to automatically estimate the quality of each step. Despite these advancements in natural language processing, multimodal PRMs remain largely underexplored. In this work, we introduce VisualPRM400K, the first multimodal process supervision dataset, and develop VisualPRM, a multimodal PRM trained on this dataset.
Benchmarks for Reward Models. The evaluation of reward models (RMs) is a crucial research topic. A series of benchmarks [33, 37, 44] have been proposed to assess the effectiveness of RMs, typically formulated as a binary preference judgment task. Building on this, subsequent work [97] extends the evaluation settings and includes both pairwise and Best-of-N evaluations, providing a more comprehensive evaluation of RM performance. With the rapid advancement of PRMs, a series of benchmarks [96, 69] have been introduced to evaluate their step-wise judgment capabilities. Despite these developments, there remains a lack of a multimodal process benchmark. To bridge this gap and support the development of multimodal PRMs, we introduce VisualProcessBench, a benchmark designed to evaluate the ability of PRMs and MLLMs to detect erroneous steps in multimodal reasoning tasks.
3 Method
<details>
<summary>x3.png Details</summary>

### Visual Description
## Flow Diagram: Math Problem Solving Steps
### Overview
The image is a flow diagram illustrating the steps to solve a geometry problem, specifically finding the measure of angle S in a quadrilateral. It shows the progression of steps, the expected accuracy (mc), and the performance of two different problem-solving methods: Value-based PRM and Advantage-based PRM.
### Components/Axes
* **Nodes:** Rectangular boxes representing steps, accuracy, and performance.
* **Arrows:** Indicate the flow of the problem-solving process.
* **Text Labels:** Describe the steps, accuracy values, and performance assessments.
* **Geometric Diagram:** A quadrilateral labeled with angle measures.
* **Legend (Right Side):**
* Question & Solution (Grey)
* Expected Accuracy (Light Blue)
* Value-based PRM (Light Green)
* Advantage-based PRM (Light Orange)
### Detailed Analysis or ### Content Details
**1. Question & Initial Diagram (Top-Left):**
* **Question:** "Find m ∠S."
* **Diagram:** A quadrilateral labeled Q, R, S, and T.
* ∠Q = (2x + 5)°
* ∠R = x°
* ∠T = x°
* ∠S = (2x + 7)°
**2. Step-by-Step Solution (Top Row):**
* **Step-0:** "To find \(m \angle S \) ..."
* **Step-1:** "Write the equation for ..."
* **Step-4:** "Substitute \(x\) back ..."
* **Step-5:** "Final answer: 58"
**3. Expected Accuracy (Second Row):**
* **Step-0:** mc = 0.75
* **Step-1:** mc = 0.5
* **Step-4:** mc = 0.0
* **Step-5:** mc = 0.0
**4. Value-based PRM Performance (Third Row):**
* **Step-0:** Correct (+)
* **Step-1:** Correct (+)
* **Step-4:** Incorrect (-)
* **Step-5:** Incorrect (-)
**5. Advantage-based PRM Performance (Bottom Row):**
* **Step-0:** Good (+)
* **Step-1:** Bad (-)
* **Step-4:** Bad (-)
* **Step-5:** Tie (=)
**6. Ellipsis:**
* There is an ellipsis (...) between Step-1 and Step-4, indicating that some steps are omitted for brevity.
### Key Observations
* The expected accuracy (mc) decreases as the problem progresses, reaching 0.0 in the final steps.
* The Value-based PRM starts with correct steps but ends with incorrect steps.
* The Advantage-based PRM starts with a good step but degrades to a tie in the final step.
### Interpretation
The diagram illustrates a problem-solving process for a geometry question, highlighting the performance of two different methods (Value-based PRM and Advantage-based PRM) at each step. The decreasing expected accuracy suggests that the problem becomes more complex or error-prone as it progresses. The performance of the two methods varies, indicating that their effectiveness depends on the specific step in the problem-solving process. The ellipsis indicates that the diagram is simplified and does not show all the intermediate steps.
</details>
Figure 3: Different modeling methods for PRMs. PRMs are developed to estimate the quality of each step in a given solution. For value-based PRMs, the quality of a certain step is determined by its expected accuracy $mc_{i}$ , where a step is considered correct if $mc_{i}>0$ . For advantage-based PRMs, the quality of a certain step is determined by the improvement of $mc_{i}$ over $mc_{i-1}$ , where a step is considered good if $mc_{i}-mc_{i-1}>0$ . During the training stage, the output space of PRMs is discretized into specific tokens, while during the inference stage, we compute the step score as the weighted sum of the generation probability for these discretized tokens.
During Best-of-N (BoN) evaluation, a critic model is required to estimate the quality of each response candidate. In this work, we formulate the critic model as a Process Reward Model (PRM). To develop a multimodal PRM, we first construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data, as detailed in Section 3.1. We then describe our PRM’s modeling approach in Section 3.2. Furthermore, to support the evaluation of critic models, we establish VisualProcessBench to measure the abilities of critic models to detect erroneous steps in multimodal reasoning, as introduced in Section 3.3.
3.1 VisualPRM400K
Definition. As shown in Figure 2, each data sample in our VisualPRM400K consists of an image $I∈\mathcal{I}$ , a question $q∈\mathcal{Q}$ , a step-by-step solution $s=\{s_{0},s_{1},·s,s_{n}\}∈\mathcal{S}$ , and the expected accuracy annotation $mc=\{mc_{0},mc_{1},·s,mc_{n}\},mc_{i}∈\mathbb{R}_{≥ 0}$ for each step, where $n$ is the number of steps of a certain solution and $mc_{i}$ denotes the expected accuracy of step $s_{i}$ . The image sets $\mathcal{I}$ and question sets $\mathcal{Q}$ are collected from MMPR v1.1 [82], while the step-by-step solutions $\mathcal{S}$ are sampled using InternVL2.5 series models [15, 82].
Process Supervision Generation. Given an image $I$ , a question $q$ , and a solution $s=\{s_{0},s_{1},·s,s_{n}\}$ , we annotate the correctness of each step $s_{i}$ using an automatic data pipeline. The key idea is to estimate the expected accuracy of given steps $s_{≤ i}$ based on Monte Carlo sampling. Specifically, the model is required to complete the solution as follows:
$$
\tilde{s}_{>i}\sim M(\tilde{s}_{>i}\mid I,q,s_{\leq i}), \tag{1}
$$
where $\tilde{s}_{>i}$ is the completion of $s_{≤ i}$ . Besides, the expected accuracy of $s_{i}$ is defined as:
$$
mc_{i}=\frac{\text{num(correct completions)}}{\text{num(sampled completions)}}. \tag{2}
$$
Notably, to reduce the data construction costs, we set the max number of steps to 12 and evenly merge the steps if the number of current steps exceeds the threshold.
Statistics. During the construction process, we sample $4$ solutions for each image-question pair and split each of them into at most $12$ steps. For each step, we sample $16$ continuations and compute $m_{i}$ according to these continuations. The resulting dataset comprises approximately 400K samples and 2 million steps with process supervision. Each response averages 126.9 words and 5.6 steps, while each step averages 22.6 words. Among these steps, about 10% are incorrect steps. Despite the imbalanced distribution of correct and incorrect steps, our PRM demonstrates promising performance, as shown in Section 4.
3.2 VisualPRM
Overview. During the training process, we formulate the process supervision problem as a multi-turn chat task so that we can effectively leverage the generation ability of MLLMs. The image $I$ , question $q$ , and the first step $s_{0}$ of the solution to this question are included in the first turn and a new step is presented in each subsequent turn. The model is required to predict the quality of the given step in each turn as follows:
$$
y_{i}\sim M(y_{i}\mid I,q,s_{\leq i}), \tag{3}
$$
where $y_{i}$ denotes the quality of $i$ -th step.
For value-based PRMs, the quality of a certain step is determined by its expected accuracy $mc_{i}$ , which is similar to the definition of the value function in reinforcement learning. Following Math-Shepherd [79, 20], we require the model to predict the correctness $c_{i}∈\{+,-\}$ of the given step, rather than the exact score of $mc_{i}$ . The $i$ -th step is considered correct if $mc_{i}>0$ . We also try to set a threshold to reduce false positive steps, but find that such a threshold negatively impacts the PRM performance, as shown in Section 7. Notably, unlike previous works [79, 39, 20], which choose to supervise only up to the first incorrect step, we always supervise all steps.
For advantage-based PRMs, the quality of a certain step is determined by the improvement of $mc_{i}$ over $mc_{i-1}$ , which is analogous to the definition of the advantage function in reinforcement learning. Similar to value-based PRMs, the quality space is discretized into predefined values $\{+,=,-\}$ , meaning that the $i$ -th step $s_{i}$ results in a superior, comparable, or inferior situation.
During inference stage, we first compute the scores of each step and then merge them to obtain the response score. Specifically, the score for each step is defined as the weighted sum of the generation probability for the discretized scores. For value-based PRMs, the weights for $\{+,-\}$ are $\{1,0\}$ . For advantage-based PRMs, the weights for $\{+,=,-\}$ are $\{1,0,-1\}$ . Without further explanation, we average the scores of each step as the response score.
3.3 VisualProcessBench
| Total Samples - MMMU [90] - MathVision [78] | 2866 267 712 |
| --- | --- |
| - MathVerse [93] | 1026 |
| - DynaMath [99] | 570 |
| - WeMath [60] | 291 |
| Source Solutions | 2866 |
| - GPT-4o [58] | 870 |
| - Claude-3.5-Sonnet [4] | 865 |
| - QvQ-72B-Preview [72] | 825 |
| - InternVL2.5-78B [15] | 306 |
| Total Steps | 26950 |
| - Correct Steps | 16585 |
| - Incorrect Steps | 7691 |
| - Neural Steps | 2674 |
| Query Word Length Quartile | (22, 24, 50) |
| Response Word Length Quartile | (137, 193, 552) |
| Step Word Length Quartile | (13, 31, 67) |
| Number of Steps per Solution | 9.4 |
Table 1: Statistics of VisualProcessBench.
Definition. Each sample in our benchmark consists of a multimodal reasoning question, a step-by-step solution, and correctness annotations for each step. Considering that recent models begin to demonstrate reflection abilities to rectify their own reasoning process, the evaluation setting used in previous works [96, 39], which only requires the model to find the first erroneous step, may lead to a false negative estimation. Therefore, our benchmark requires the model to identify all erroneous steps in the given solution instead of only the first erroneous step.
Data Source. Our benchmark focuses on multimodal reasoning tasks, collecting images and questions from existing representative multimodal reasoning benchmarks, including MMMU [90], MathVision [78], MathVerse [93], DynaMath [99], and WeMath [60]. Given these questions, we generate step-by-step solutions using leading MLLMs, including GPT-4o [58], Claude-3.5-Sonnet [4], Gemini-2.0-Flash [70], QvQ-72B-Preview [72], and InternVL2.5-78B [15]. The solutions are sampled from different MLLMs to ensure their diversity.
Step Correctness Annotation. We employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions. Specifically, 13 people worked for 3 days, resulting in a workload of 39 person-days. The cost per person-day is approximately 37 dollars. During the annotation process, annotators are provided with the image, question, ground truth answer, and each step of the solution. Their task is to assign each step in the solution a label of positive, negative, or neutral, as illustrated in Figure 2. A positive label indicates that the step is correct, while a negative label signifies an incorrect step. The neural label is assigned to steps that do not involve any reasoning process or provide no additional information. To ensure the annotation quality, annotators are permitted to skip questions they do not understand. During the annotation process, our dataset is divided into 10 splits, each containing approximately 300 samples. For each split, the authors of this paper manually review about 10% of the samples. Splits with erroneous annotations are sent back for re-annotation. See Section 10 for more data examples.
Statistics. As shown in Table 1, our benchmark comprises 2866 samples. To enhance the diversity of our evaluation samples, we gather questions and solutions from a wide range of benchmarks and models while carefully regulating their distribution. The statistics of step distribution are presented in Section 8.
Metrics. In this work, we use macro F1 scores to compare model performance, aiming to mitigate the impact of the imbalanced distribution between correct and incorrect steps. Specifically, we first compute the F1 scores separately for correct and incorrect steps and then take their average to obtain the overall score.
4 Experiments
In this section, we first employ VisualPRM to evaluate various MLLMs using BoN evaluation strategies in Section 4.1, demonstrating that PRMs can significantly enhance the reasoning abilities of MLLMs. Next, we evaluate our VisualPRM and other leading MLLMs on VisualProcessBench in Section 4.2. Finally, the ablation studies are presented in Section 4.3 and Section 7.
| GPT-4o [58] Gemini-2.0-Flash [61] Claude-3.5-Sonnet [4] | 70.7 69.9 66.4 | 60.0 70.4 65.3 | 31.2 43.6 35.6 | 40.6 47.8 46.3 | 34.5 42.1 35.7 | 45.8 47.4 44.0 | 52.8 52.3 60.4 | 47.9 53.4 50.5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Open-source Models | | | | | | | | |
| MiniCPM-V2.6-8B [89] | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| +VisualPRM | 56.8 | 65.7 | 24.7 | 35.8 | 11.2 | 31.0 | 37.4 | 37.5 |
| +7.0 | +4.9 | +1.3 | +16.9 | +1.4 | +14.6 | +9.8 | +8.0 | |
| Qwen2.5-VL-7B [7] | 55.0 | 67.8 | 25.4 | 41.1 | 21.0 | 35.2 | 44.1 | 41.4 |
| +VisualPRM | 58.6 | 70.3 | 31.3 | 44.3 | 23.0 | 39.8 | 48.3 | 45.1 |
| +3.6 | +2.5 | +5.9 | +3.2 | +2.0 | +4.6 | +4.2 | +3.7 | |
| InternVL2.5-8B [15] | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| +VisualPRM | 60.2 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.2 |
| +4.0 | +4.0 | +8.7 | +13.0 | +8.6 | +13.0 | +7.8 | +8.4 | |
| InternVL2.5-26B [15] | 60.7 | 68.2 | 23.4 | 24.0 | 11.4 | 30.9 | 39.6 | 36.9 |
| +VisualPRM | 63.9 | 73.1 | 29.6 | 39.1 | 23.2 | 40.8 | 51.0 | 45.8 |
| +3.2 | +4.9 | +6.2 | +15.1 | +11.8 | +9.9 | +11.4 | +8.9 | |
| InternVL2.5-38B [15] | 63.9 | 71.9 | 32.2 | 36.9 | 20.0 | 38.3 | 47.9 | 44.4 |
| +VisualPRM | 69.0 | 73.9 | 35.2 | 46.7 | 30.5 | 46.2 | 53.7 | 50.7 |
| +5.1 | +2.0 | +3.0 | +9.8 | +10.5 | +7.9 | +5.8 | +6.3 | |
| InternVL2.5-78B [15] | 70.0 | 72.3 | 32.2 | 39.2 | 19.2 | 39.8 | 49.0 | 46.0 |
| +VisualPRM | 70.7 | 75.1 | 35.9 | 47.1 | 31.3 | 49.1 | 53.9 | 51.9 |
| +0.7 | +2.8 | +3.7 | +7.9 | +12.1 | +9.3 | +4.9 | +5.9 | |
Table 2: Results on seven multimodal reasoning benchmarks. MMMU [90] is a multidisciplinary reasoning benchmark. MathVista [50], MathVision [78], MathVerse [93], DynaMath [99], and WeMath [60] are mathematics benchmarks. For MathVerse, we report the performance on Vision-Only (VO) split. LogicVista [87] is a logical reasoning benchmark. Part of the results are collected from the OpenCompass leaderboard [19]. The overall score is the average score of the above benchmarks. By using VisualPRM as the critic model, existing open-source MLLMs achieve significant improvements in reasoning ability under the Best-of-8 evaluation strategy.
4.1 Results with Best-of-N evaluation
Benchmarks. We evaluate the reasoning abilities of MLLMs across seven benchmarks, including MMMU [90], MathVista [50], MathVision [78], MathVerse [93], DynaMath [99], WeMath [60], and LogicVista [87]. The evaluation samples include subject-based, mathematical, and logical reasoning problems. We report the worst-case accuracy for DynaMath and the overall accuracy for the remaining benchmarks. For MathVerse, we report the performance on the Vision-Only split.
Settings. Without further explanation, we use VisualPRM as the critic model for BoN evaluation and set $N$ to $8$ by default. The policy model is required to generate $N$ distinct step-by-step Chain-of-Thought (CoT) reasoning processes with a temperature of 0.7. The response with the highest score is then selected to determine the correctness.
Results. As shown in Table 2, VisualPRM greatly enhances the reasoning abilities of MLLMs across different model scales and families. Specifically, for models with fewer than 10 billion parameters, the overall performance of InternVL2.5-8B, MiniCPM-V-8B, and Qwen2.5-VL-7B improves by 8.4, 8.0, and 3.7 points, respectively, demonstrating the effectiveness of test-time scaling across different model families. For larger models, InternVL2.5-26B, InternVL2.5-38B, and InternVL2.5-78B also achieve substantial performance gains over their counterparts without TTS, further validating the scalability and effectiveness of TTS across different model sizes.
4.2 Results on VisualProcessBench
Settings. For the evaluation of PRMs, a step is considered correct if the probability of outputting “ $+$ ” exceeds that of outputting “ $-$ ” by a certain threshold. For the evaluation of MLLMs, the model is prompted to analyze each step and determine its correctness, classifying it as either correct or incorrect. When computing the F1 score, we exclude steps labeled as neural by human annotators in Section 3.3.
Results. As shown in Table 3, most existing MLLMs struggle to accurately assess the correctness of each step. Specifically, the overall F1 score for random guessing is 50.0, while most open-source MLLMs achieve scores close to this baseline, highlighting their limitations as critic models. We manually check the judgments of these open-source MLLMs and observe that these models tend to provide positive analysis and label most steps as correct. For example, InternVL2.5-8B achieves an F1 score of 76.8 for positive steps, while its F1 score for negative steps is only 19.2, indicating that InternVL2.5-8B rarely identifies steps as incorrect. Furthermore, compared to proprietary models, our VisualPRM demonstrates competitive performance, achieving an overall F1 score of 62.0—outperforming GPT-4o and GPT-4o-Mini, and performing on par with Gemini-2.0-Flash. Notably, our model, with only 8 billion parameters, is more efficient than these proprietary counterparts.
| Model Random Guessing Proprietary Models | MMMU 50.0 | MathVision 50.0 | MathVerse-VO 50.0 | DynaMath 50.0 | WeMath 50.0 | Overall 50.0 |
| --- | --- | --- | --- | --- | --- | --- |
| GPT-4o-Mini [58] | 53.6 | 58.9 | 57.1 | 56.7 | 58.5 | 57.9 |
| GPT-4o [58] | 56.3 | 60.2 | 59.7 | 59.0 | 63.3 | 60.3 |
| Gemini-2.0-Flash [61] | 58.5 | 60.1 | 62.8 | 66.7 | 58.7 | 62.3 |
| Open-source Models | | | | | | |
| MiniCPM-V2.6-8B [89] | 44.9 | 50.9 | 58.9 | 46.7 | 57.4 | 50.4 |
| LLaVA-OV-7B [35] | 45.7 | 43.0 | 42.2 | 44.7 | 52.5 | 44.4 |
| LLaVA-OV-72B [35] | 46.1 | 48.4 | 53.0 | 57.0 | 57.3 | 52.3 |
| Qwen2.5-VL-7B [7] | 53.1 | 51.8 | 47.8 | 51.3 | 54.2 | 51.0 |
| Qwen2.5-VL-72B [7] | 59.2 | 59.0 | 59.7 | 62.9 | 62.3 | 60.5 |
| InternVL2.5-8B [15] | 47.1 | 45.5 | 47.8 | 50.3 | 50.8 | 48.0 |
| InternVL2.5-26B [15] | 48.8 | 47.4 | 49.2 | 50.4 | 51.4 | 49.2 |
| InternVL2.5-38B [15] | 51.5 | 48.4 | 50.9 | 51.8 | 52.5 | 50.8 |
| InternVL2.5-78B [15] | 52.0 | 51.7 | 53.7 | 50.8 | 52.5 | 52.6 |
| VisualPRM (ours) | 58.5 | 62.1 | 61.0 | 62.7 | 61.8 | 62.0 |
Table 3: Results on VisualProcessBench. We report the macro F1 of the correct and incorrect steps. The overall score is the micro average of the score from different data sources. Our VisualPRM exhibits state-of-the-art performance among open-source models.
4.3 Ablation Studies
Effects of BoN. Here, we increase the number of response candidates sampled from InternVL2.5-8B and select the final response using Self-Consistency (SC) [86], Outcome Reward Model (ORM), and PRM. The training data for ORM are nearly identical to those used for PRM, except that all steps are concatenated into a single step and step-wise correctness annotations are converted into a single correctness label for the outcome. As shown in Figure 4, increasing the number of response candidates $N$ improves the reasoning performance of InternVL2.5-8B and MiniCPM-V2.6-8B when using SC, ORM, or PRM, with PRM yielding the most significant improvements. Specifically, when using InternVL2.5-8B as the policy model, PRM outperforms SC and ORM by 2.4 and 1.5 points, respectively, under the Best-of-8 evaluation setting. Moreover, this performance gap widens as $N$ increases, reaching 3.1 and 4.3 points when $N$ is set to $128$ . Notably, when using ORM as the critic model, although performance improves during Best-of-8 evaluation, further increasing $N$ does not lead to consistent gains for InternVL2.5-8B. For example, the Best-of-128 performance is inferior to the Best-of-64 performance. These results highlight the effectiveness of PRM in TTS.
Effects of PRM modeling methods. Here, we compare the value-based PRM and the advantage-based PRM introduced in Section 3.2, along with different methods for aggregating step scores into a final score, including averaging, as well as selecting the maximum or minimum value. The results are presented in Table 4. We find that value-based PRMs outperform advantage-based PRMs in both BoN evaluation settings and VL-ProcessBench. We attribute this to the inherent noise in our training data, which is generated through an automatic data pipeline, making it challenging to accurately determine whether a given step contributes to higher or lower expected accuracy. We also compare two training strategies: supervising all steps (i.e., w/o early stop) versus supervising only up to the first incorrect step (i.e., w. early stop) during training. Experimental results show that the former yields better performance. Regarding different score aggregation methods, we find that selecting the maximum value results in poorer performance compared to averaging or taking the minimum value. Analyzing the generated scores reveals that most responses contain a high-scored step, close to $1$ , at the beginning of the solution. This phenomenon likely arises because most erroneous steps appear in the middle of the solution. Our statistics of VisualProcessBench presented in Section 8 further demonstrate this conclusion. Furthermore, averaging performs better than selecting the maximum value, likely because the latter relies on a single step’s score, while averaging accounts for multiple steps and can be considered as an ensemble approach, which benefits the step quality estimation.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: InternVL2.5-8B Overall Performance
### Overview
The image is a line chart comparing the overall performance of three different models: "Self-Consistency", "VisualORM-8B", and "VisualPRM-8B". The x-axis represents "Best of N", with values ranging from 1 to 128. The y-axis represents "InternVL2.5-8B Overall Performance", with values ranging from 32 to 44.
### Components/Axes
* **X-axis:** "Best of N" with markers at 1, 8, 16, 32, 64, and 128.
* **Y-axis:** "InternVL2.5-8B Overall Performance" with markers at 32, 34, 36, 38, 40, 42, and 44.
* **Legend:** Located at the bottom-right of the chart.
* Green line with square markers: "Self-Consistency"
* Red line with triangle markers: "VisualORM-8B"
* Blue line with diamond markers: "VisualPRM-8B"
### Detailed Analysis
* **Self-Consistency (Green):**
* Trend: Increases sharply from N=1 to N=8, then increases less sharply from N=8 to N=32, then remains relatively flat from N=32 to N=128.
* Data Points:
* N=1: ~32.7
* N=8: ~38.8
* N=16: ~39.7
* N=32: ~41.0
* N=64: ~40.9
* N=128: ~40.9
* **VisualORM-8B (Red):**
* Trend: Increases sharply from N=1 to N=8, then decreases slightly from N=8 to N=16, then increases from N=16 to N=64, then decreases from N=64 to N=128.
* Data Points:
* N=1: ~32.6
* N=8: ~39.6
* N=16: ~39.2
* N=32: ~39.9
* N=64: ~40.3
* N=128: ~39.5
* **VisualPRM-8B (Blue):**
* Trend: Increases sharply from N=1 to N=8, then increases steadily from N=8 to N=128.
* Data Points:
* N=1: ~32.8
* N=8: ~41.2
* N=16: ~42.0
* N=32: ~42.7
* N=64: ~43.3
* N=128: ~44.0
### Key Observations
* VisualPRM-8B consistently outperforms the other two models across all values of "Best of N".
* Self-Consistency and VisualORM-8B have similar performance at N=1, but diverge significantly as N increases.
* The performance of VisualORM-8B peaks at N=64 and then declines slightly.
* The performance of Self-Consistency plateaus after N=32.
### Interpretation
The chart demonstrates that increasing the "Best of N" parameter generally improves the overall performance of the InternVL2.5-8B model. However, the extent of improvement varies depending on the specific model configuration. VisualPRM-8B shows the most consistent and significant improvement with increasing N, suggesting it benefits the most from this parameter. The plateauing or slight decline in performance for Self-Consistency and VisualORM-8B at higher N values may indicate diminishing returns or potential overfitting.
</details>
(a)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: MiniCPM-V-2.6 Overall Performance
### Overview
The image is a line chart comparing the overall performance of three different models: Self-Consistency, VisualORM-8B, and VisualPRM-8B. The x-axis represents "Best of N," with values ranging from 1 to 128. The y-axis represents "MiniCPM-V-2.6 Overall Performance," with values ranging from 29 to 41.
### Components/Axes
* **X-axis Title:** Best of N
* **X-axis Scale:** 1, 8, 16, 32, 64, 128
* **Y-axis Title:** MiniCPM-V-2.6 Overall Performance
* **Y-axis Scale:** 29, 31, 33, 35, 37, 39, 41
* **Legend:** Located in the bottom-right corner.
* **Green Line with Square Markers:** Self-Consistency
* **Red Line with Triangle Markers:** VisualORM-8B
* **Blue Line with Diamond Markers:** VisualPRM-8B
### Detailed Analysis
* **Self-Consistency (Green Line):** The line slopes upward, indicating an increase in performance as "Best of N" increases.
* At Best of N = 1, Performance ≈ 29.5
* At Best of N = 8, Performance ≈ 33
* At Best of N = 16, Performance ≈ 34
* At Best of N = 32, Performance ≈ 34.5
* At Best of N = 64, Performance ≈ 35
* At Best of N = 128, Performance ≈ 35.5
* **VisualORM-8B (Red Line):** The line slopes upward, indicating an increase in performance as "Best of N" increases.
* At Best of N = 1, Performance ≈ 29.5
* At Best of N = 8, Performance ≈ 35.5
* At Best of N = 16, Performance ≈ 36
* At Best of N = 32, Performance ≈ 37
* At Best of N = 64, Performance ≈ 37.5
* At Best of N = 128, Performance ≈ 38.5
* **VisualPRM-8B (Blue Line):** The line slopes upward, indicating an increase in performance as "Best of N" increases.
* At Best of N = 1, Performance ≈ 29.5
* At Best of N = 8, Performance ≈ 37.5
* At Best of N = 16, Performance ≈ 39
* At Best of N = 32, Performance ≈ 39.5
* At Best of N = 64, Performance ≈ 39.5
* At Best of N = 128, Performance ≈ 40
### Key Observations
* VisualPRM-8B consistently outperforms the other two models across all values of "Best of N."
* Self-Consistency has the lowest performance among the three models.
* The performance of all three models increases as "Best of N" increases, but the rate of increase varies. VisualPRM-8B shows a steeper initial increase compared to the other two.
### Interpretation
The chart demonstrates the impact of increasing the "Best of N" parameter on the overall performance of three different models. VisualPRM-8B appears to benefit the most from increasing "Best of N," suggesting it is more effective at leveraging multiple samples to improve performance. Self-Consistency shows a more gradual improvement, indicating it may be less sensitive to the "Best of N" parameter. VisualORM-8B falls in between the other two models in terms of performance and sensitivity to "Best of N." The data suggests that VisualPRM-8B is the most effective model overall, while Self-Consistency may be a more stable but less performant option.
</details>
(b)
Figure 4: Overall Best-of-N results across seven multimodal reasoning benchmarks with different policy and critic models. VisualPRM consistently enhances reasoning performance of both InternVL2.5-8B and MiniCPM-V as $N$ increases and outperforms the improvement introduced by Self-Consistency and ORM, highlighting the effectiveness of PRM in Test-Time Scaling.
| Critic Model Pass@1 Random Guessing | BoN 32.8 33.0 | VL-ProcessBench - 50.0 |
| --- | --- | --- |
| InternVL2.5-8B | 33.2 | 48.0 |
| InternVL2.5-78B | 34.2 | 52.6 |
| Advantage-based PRM | | |
| +Min | 36.8 | 55.0 |
| +Max | 36.9 | 55.0 |
| +Average | 37.4 | 55.0 |
| Value (w. early stop) | | |
| +Min | 40.3 | 61.6 |
| +Max | 37.0 | 61.6 |
| +Average | 40.6 | 61.6 |
| Value (w/o early stop) | | |
| +Min | 40.4 | 62.0 |
| +Max | 35.9 | 62.0 |
| +Average | 41.1 | 62.0 |
Table 4: Comparison of different critic models and score aggregation methods. Note that score aggregation methods do not affect performance on VisualProcessBench, which focuses on step correctness judgement. We find that supervising all steps (w/o early stop) during training perform slightly better than supervising only up to the first incorrect step (w. early steop).
MLLM-as-a-Judger. Existing MLLMs can be prompted to serve as a critic model. However, as shown in Table 4, the InternVL2.5 series struggle to improve BoN performance, resulting in only marginal improvements. Upon analyzing the generated scores, we find that these models assign similar scores to most solutions. Consistent with our observations in Section 4.2, the InternVL2.5 series tend to generate positive judgments for most steps, which hinders their ability to effectively distinguish and select the truly superior response. In addition to their effectiveness as critic models for MLLMs, their inference latency also limits efficiency. Specifically, MLLMs generate judgments for each step in an autoregressive manner, which is time-consuming. In contrast, our VisualPRM computes scores for all steps in a single forward pass by using a “+” as a placeholder for model responses and interpreting its generation probability as the step score.
Results on text-only performance. To assess the effectiveness of VisualPRM on text-only inputs, we evaluate the Qwen2.5 series [88] and InternVL2.5 series [15] on three text reasoning benchmarks under BoN evaluation settings: GSM8K [18], MATH-500 [25], and GPQA-Diamond [62]. We report accuracy as the evaluation metric for these benchmarks. As shown in Table 5, our model enhances the text reasoning abilities of both the Qwen2.5 series and the InternVL2.5 series. Specifically, Qwen2.5-7B achieves improvements of 6.1 and 5.0 points on MATH-500 and GPQA-Diamond, respectively. Similarly, Qwen2.5-72B demonstrates gains of 2.1 and 6.6 points on these benchmarks. For the InternVL2.5 series, InternVL2.5-8B, InternVL2.5-38B, and InternVL2.5-78B achieve improvements of 9.4 and 5.0, 4.6 and 8.1, and 7.4 and 3.5 points, respectively, on MATH-500 and GPQA-Diamond. These results demonstrate the effectiveness of our VisualPRM in text-only scenarios.
| Model Large Language Models Qwen2.5-7B [88] | GSM8K 91.6 | MATH-500 75.5 | GPQA 36.4 |
| --- | --- | --- | --- |
| +VisualPRM | 94.5 | 81.6 | 41.4 |
| +2.9 | +6.1 | +5.0 | |
| Qwen2.5-32B [88] | 95.9 | 83.1 | 49.5 |
| +VisualPRM | 96.1 | 85.4 | 53.5 |
| +0.2 | +2.3 | +4.0 | |
| Qwen2.5-72B [88] | 95.8 | 83.1 | 49.0 |
| +VisualPRM | 96.5 | 85.2 | 55.6 |
| +0.7 | +2.1 | +6.6 | |
| Multimodal Large Language Models | | | |
| InternVL2.5-8B [15] | 81.9 | 56.8 | 29.3 |
| +VisualPRM | 82.9 | 66.2 | 34.3 |
| +1.1 | +9.4 | +5.0 | |
| InternVL2.5-38B [15] | 94.6 | 75.4 | 44.4 |
| +VisualPRM | 95.6 | 80.0 | 52.5 |
| +1.0 | +4.6 | +8.1 | |
| InternVL2.5-78B [15] | 93.6 | 70.4 | 47.5 |
| +VisualPRM | 94.5 | 77.8 | 51.0 |
| +0.9 | +7.4 | +3.5 | |
Table 5: Results on text reasoning benchmarks. GSM8K and MATH500 are mathematics benchmarks, while GPQA is a multidisciplinary reasoning benchmark. For GPQA, we report the performance on the Diamond split. Our VisualPRM enhances the text reasoning abilities of both LLMs and MLLMs under the Best-of-8 evaluation settings.
5 Conclusion
In this work, we construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) capable of estimating the value score of each step during the reasoning process. With the Best-of-N (BoN) evaluation strategies, our model improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation, highlighting the effectiveness of PRMs in Test-Time Scaling. To further facilitate the development of multimodal critic models, we construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to detect incorrect steps in multimodal reasoning tasks. Evaluation results show that existing open-source MLLMs struggle to effectively judge the correctness of each step. We hope that our work can inspire more future research and contribute to the development of MLLMs.
References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ahmadian et al. [2024] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
- Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022.
- Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024.
- Bai et al. [2023a] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Bai et al. [2023b] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- Biten et al. [2019] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NIPS, 2020.
- Cai et al. [2024] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- Cao and Xiao [2022] Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022.
- Chang et al. [2022] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022.
- Chen et al. [2024a] Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024a.
- Chen et al. [2023] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- Chen et al. [2024b] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024b.
- Chen et al. [2024c] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024c.
- Clark and Gardner [2018] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In ACL, pages 845–855, 2018.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Contributors [2023] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Dong et al. [2024] Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Gao et al. [2023] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
- Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Hu [2025] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025.
- Huang et al. [2019] Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
- Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- Kafle et al. [2018] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, pages 5648–5656, 2018.
- Kazemi et al. [2023] Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023.
- Kembhavi et al. [2016] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016.
- Lai et al. [2024] Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.
- Lambert et al. [2024] Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
- Laurençon et al. [2024] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NIPS, 36, 2024.
- Li et al. [2024a] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
- Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023.
- Li et al. [2024b] Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vlrewardbench: A challenging benchmark for vision-language generative reward models. arXiv preprint arXiv:2411.17451, 2024b.
- Li et al. [2024c] Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024c.
- Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
- Lindström and Abraham [2022] Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022.
- Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NIPS, 36, 2023a.
- Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024a.
- Liu et al. [2024b] Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. arXiv preprint arXiv:2407.15838, 2024b.
- Liu et al. [2024c] Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style. arXiv preprint arXiv:2410.16184, 2024c.
- Liu et al. [2023b] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 2023b.
- Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Lu et al. [2021a] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021a.
- Lu et al. [2021b] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021b.
- Lu et al. [2022] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NIPS, 35:2507–2521, 2022.
- Lu et al. [2023] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Luo et al. [2024] Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2, 2024.
- Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
- Masry et al. [2022] Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022.
- Mathew et al. [2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022.
- McAleese et al. [2024] Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024.
- Mishra et al. [2019] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- Muennighoff et al. [2025] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- OpenAI [2024] OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/, 2024.
- Pang et al. [2024] Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024.
- Qiao et al. [2024] Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024.
- Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Rein et al. [2024] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 35:25278–25294, 2022.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Seo et al. [2015] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015.
- Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
- Snell et al. [2024] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
- Song et al. [2025] Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. Prmbench: A fine-grained and challenging benchmark for process-level reward models. arXiv preprint arXiv:2501.03124, 2025.
- Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Team [2023] InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Team [2024] Qwen Team. Qvq: To see the world with wisdom, 2024.
- Tian et al. [2024] Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, et al. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024.
- Toshniwal et al. [2025] Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. Advances in Neural Information Processing Systems, 37:34737–34774, 2025.
- Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Wang et al. [2024a] Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, 2024a.
- Wang et al. [2024b] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024b.
- Wang et al. [2023a] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2023a.
- Wang et al. [2023b] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NIPS, 36, 2023b.
- Wang et al. [2023c] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023c.
- Wang et al. [2024c] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024c.
- Wang et al. [2024d] Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. arXiv preprint arXiv:2402.19474, 2024d.
- Wang et al. [2024e] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. In ICLR, 2024e.
- Wang et al. [2024f] Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024f.
- Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Xiao et al. [2024] Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973, 2024.
- Yang et al. [2024] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
- Yao et al. [2024] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
- Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023.
- Zhang et al. [2024a] Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024a.
- Zhang et al. [2024b] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024b.
- Zhang et al. [2025] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025.
- Zhao et al. [2025] Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, et al. Omnialign-v: Towards enhanced alignment of mllms with human preference. arXiv preprint arXiv:2502.18411, 2025.
- Zheng et al. [2024] Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559, 2024.
- Zhou et al. [2024] Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, et al. Rmb: Comprehensively benchmarking reward models in llm alignment. arXiv preprint arXiv:2410.09893, 2024.
- Zhu et al. [2024] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. NIPS, 36, 2024.
- Zou et al. [2024] Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836, 2024.
\thetitle
Supplementary Material
6 Training Hyper-parameters
During the training process of VisualPRM, the data-packing strategy [15] is enabled during training. We employ the AdamW optimizer [46] with the $\beta_{1}$ of $0.9$ , the $\beta_{2}$ of $0.999$ , and the weight decay of $0.05$ . The learning rate is initialized as $1e\text{-}5$ . The training phases include a linear warmup that lasts until the first 5% of training steps. The warmup is followed by a cosine decay strategy with a minimum learning rate of 0. We set the training epoch to 1.
7 More Ablation Studies
7.1 Effects of Training Hyper-parameters
When training our value-based Process Reward Model (PRM) using VisualPRM400K, we define a step as correct if its expected accuracy exceeds 0. In this section, we analyze the impact of varying expected accuracy thresholds for determining step correctness. As shown in Table 8, increasing the threshold results in a decline in both Best-of-8 evaluation performance and VisualProcessBench scores. These results are consistent with the observation in Qwen2.5-Math-PRM [94]. Therefore, we suggest setting the threshold to $0 0$ during training.
7.2 Effects of Generation Hyper-parameters
In this section, we analyze the impact of generation temperature on the Best-of-8 evaluation. As shown in Table 8, as the temperature increases from 0.3 to 1.3, the overall performance of InternVL2.5-8B first improves and then declines. We attribute this phenomenon to the trade-off between response diversity and accuracy. When the temperature is low (e.g., set to 0.3), the generated responses lack diversity, limiting the model’s performance upper bound. Conversely, when the temperature is high (e.g., set to 1.3), the responses become more random, reducing the accuracy of individual responses and lowering the model’s overall performance ceiling.
7.3 Effects of Best-of-N evaluation
In this section, we present the Best-of-N evaluation results as $N$ increases, as shown in Table 6 and Table 7. Our results indicate that as $N$ increases, VisualPRM consistently enhances the reasoning abilities of InternVL2.5-8B [15] and MiniCPM-V2.6 [89]. Specifically, as $N$ increases from $8$ to $128$ , the overall performance of InternVL2.5-8B improves from 41.2 to 44.0, while MiniCPM-V2.6 improves from 37.5 to 40.4, demonstrating the scalability of Test-Time Scaling for MLLMs.
| Self Consistency 8 16 | 1 58.0 58.6 | 56.2 65.9 65.8 | 64.5 23.4 26.3 | 17.0 30.5 32.1 | 22.8 18.4 19.4 | 9.4 32.7 33.0 | 23.5 43.0 43.4 | 36.0 38.8 39.8 | 32.8 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 32 | 60.4 | 66.7 | 28.0 | 32.6 | 20.8 | 34.1 | 44.7 | 41.0 | |
| 64 | 59.7 | 66.7 | 26.6 | 33.2 | 20.6 | 35.8 | 43.4 | 40.9 | |
| 128 | 60.6 | 67.4 | 25.7 | 32.0 | 22.6 | 34.7 | 43.2 | 40.9 | |
| VisualORM | 1 | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| 8 | 60.2 | 67.0 | 25.3 | 32.5 | 16.4 | 35.0 | 41.8 | 39.7 | |
| 16 | 58.3 | 67.7 | 27.0 | 33.6 | 16.6 | 33.1 | 39.1 | 39.3 | |
| 32 | 58.6 | 67.9 | 26.3 | 33.6 | 17.4 | 34.4 | 42.1 | 40.0 | |
| 64 | 59.4 | 66.8 | 28.6 | 33.9 | 17.8 | 34.1 | 42.3 | 40.4 | |
| 128 | 59.4 | 66.6 | 28.3 | 33.5 | 16.8 | 32.3 | 40.9 | 39.7 | |
| VisualPRM | 1 | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| 8 | 60.2 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.2 | |
| 16 | 60.2 | 69.9 | 27.3 | 36.4 | 19.0 | 38.8 | 42.5 | 42.0 | |
| 32 | 60.3 | 70.4 | 29.6 | 37.8 | 17.2 | 40.3 | 43.4 | 42.7 | |
| 64 | 61.4 | 69.6 | 30.6 | 38.2 | 18.8 | 40.2 | 45.4 | 43.5 | |
| 128 | 61.7 | 70.8 | 30.3 | 39.3 | 19.4 | 40.9 | 45.4 | 44.0 | |
Table 6: Overall Best-of-N results of InternVL2.5-8B across seven multimodal reasoning benchmarks with different critic models.
| Self Consistency 8 16 | 1 51.8 51.7 | 49.8 58.9 60.2 | 60.8 21.7 21.7 | 23.4 31.5 31.5 | 18.9 10.0 11.6 | 9.8 22.6 25.7 | 16.4 35.6 35.3 | 27.5 33.2 34.0 | 29.5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 32 | 52.2 | 60.1 | 24.3 | 33.1 | 11.4 | 24.3 | 36.0 | 34.5 | |
| 64 | 51.7 | 61.0 | 23.4 | 34.8 | 12.8 | 25.8 | 35.3 | 35.0 | |
| 128 | 53.2 | 61.7 | 25.7 | 33.5 | 13.0 | 25.6 | 35.6 | 35.5 | |
| VisualORM | 1 | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| 8 | 55.7 | 66.0 | 22.0 | 33.5 | 10.2 | 24.1 | 38.9 | 35.8 | |
| 16 | 56.4 | 65.3 | 24.0 | 32.1 | 10.4 | 27.3 | 36.5 | 36.0 | |
| 32 | 58.8 | 64.8 | 19.7 | 35.7 | 12.0 | 29.4 | 38.5 | 37.0 | |
| 64 | 58.2 | 67.3 | 22.7 | 35.5 | 11.0 | 30.1 | 37.6 | 37.5 | |
| 128 | 58.2 | 66.5 | 25.3 | 35.4 | 11.6 | 30.0 | 40.7 | 38.2 | |
| VisualPRM | 1 | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| 8 | 56.8 | 65.7 | 24.7 | 35.8 | 11.2 | 31.0 | 37.4 | 37.5 | |
| 16 | 58.8 | 68.6 | 24.0 | 37.3 | 12.4 | 32.7 | 39.8 | 39.1 | |
| 32 | 57.8 | 68.4 | 26.6 | 38.5 | 13.4 | 35.3 | 39.1 | 39.9 | |
| 64 | 58.6 | 69.4 | 25.3 | 39.7 | 12.2 | 38.2 | 36.9 | 40.0 | |
| 128 | 59.3 | 69.4 | 25.3 | 39.1 | 14.4 | 37.0 | 38.3 | 40.4 | |
Table 7: Overall Best-of-N results of MiniCPM-V2.6 across seven multimodal reasoning benchmarks with different critic models.
| Threshold=0.00 Threshold=0.625 Threshold=0.125 | 59.3 59.7 58.0 | 68.5 66.8 67.9 | 25.7 24.7 27.6 | 35.8 36.7 35.4 | 18.0 18.4 17.4 | 36.5 35.0 35.3 | 43.8 41.8 41.6 | 41.1 40.4 40.5 | 62.0 61.0 60.7 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Threshold=0.25 | 58.6 | 67.6 | 25.7 | 33.6 | 16.8 | 36.0 | 41.4 | 40.0 | 60.2 |
| Temperature | | | | | | | | | |
| Temperature=0.3 | 59.7 | 69.4 | 26.0 | 32.6 | 17.6 | 35.5 | 42.7 | 40.5 | - |
| Temperature=0.7 | 59.3 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.1 | - |
| Temperature=1.0 | 61.7 | 67.2 | 27.3 | 35.8 | 16.6 | 34.2 | 43.2 | 40.9 | - |
| Temperature=1.3 | 57.9 | 66.1 | 25.0 | 32.1 | 16.8 | 31.9 | 40.5 | 38.6 | - |
Table 8: Ablation studies about the effects of expected accuracy threshold and generationo temperaure.
8 More Statistics for VisualProcessBench
<details>
<summary>x6.png Details</summary>

### Visual Description
## Step Plot: Algorithm Performance Comparison
### Overview
The image is a step plot comparing the performance of five different algorithms (MathVision, MathVerse, MMMU, DynaMath, and WeMath) across a range of step indices. The y-axis represents the percentage of steps, and the x-axis represents the step index. The plot shows how the percentage of steps varies for each algorithm as the step index increases.
### Components/Axes
* **X-axis:** Step Index, ranging from 0 to 30. Axis markers are present at intervals of 5 (5, 10, 15, 20, 25, 30).
* **Y-axis:** Steps (%), ranging from 0 to 5. Axis markers are present at intervals of 1 (1, 2, 3, 4, 5).
* **Legend (Top-Right):**
* MathVision (Gray)
* MathVerse (Red)
* MMMU (Blue)
* DynaMath (Green)
* WeMath (Purple)
### Detailed Analysis or ### Content Details
**MathVision (Gray):**
* Trend: Initially increases, peaks around step index 14, then decreases.
* Step Index 4: ~1.7%
* Step Index 6: ~2.3%
* Step Index 8: ~2.4%
* Step Index 10: ~1.6%
* Step Index 12: ~2.1%
* Step Index 14: ~5.0%
* Step Index 16: ~0.3%
* Step Index 18: ~0.3%
* Step Index 20: ~0.2%
* Step Index 22: ~0.1%
* Step Index 24: ~0.1%
* Step Index 26: ~0.0%
* Step Index 28: ~0.0%
* Step Index 30: ~0.0%
**MathVerse (Red):**
* Trend: Initially increases sharply, peaks around step index 7, then decreases.
* Step Index 4: ~1.9%
* Step Index 6: ~3.9%
* Step Index 8: ~5.3%
* Step Index 10: ~4.2%
* Step Index 12: ~2.7%
* Step Index 14: ~1.4%
* Step Index 16: ~0.3%
* Step Index 18: ~0.3%
* Step Index 20: ~0.2%
* Step Index 22: ~0.2%
* Step Index 24: ~0.1%
* Step Index 26: ~0.1%
* Step Index 28: ~0.0%
* Step Index 30: ~0.1%
**MMMU (Blue):**
* Trend: Generally low, with a small peak around step index 12, then decreases.
* Step Index 4: ~0.8%
* Step Index 6: ~0.7%
* Step Index 8: ~1.2%
* Step Index 10: ~0.5%
* Step Index 12: ~1.2%
* Step Index 14: ~0.8%
* Step Index 16: ~0.1%
* Step Index 18: ~0.1%
* Step Index 20: ~0.1%
* Step Index 22: ~0.1%
* Step Index 24: ~0.1%
* Step Index 26: ~0.0%
* Step Index 28: ~0.0%
* Step Index 30: ~0.0%
**DynaMath (Green):**
* Trend: Relatively stable initially, peaks around step index 7, then decreases.
* Step Index 4: ~2.5%
* Step Index 6: ~2.3%
* Step Index 8: ~2.5%
* Step Index 10: ~1.5%
* Step Index 12: ~0.7%
* Step Index 14: ~2.7%
* Step Index 16: ~0.2%
* Step Index 18: ~0.1%
* Step Index 20: ~0.1%
* Step Index 22: ~0.0%
* Step Index 24: ~0.0%
* Step Index 26: ~0.0%
* Step Index 28: ~0.0%
* Step Index 30: ~0.0%
**WeMath (Purple):**
* Trend: Increases initially, peaks around step index 6, then decreases.
* Step Index 4: ~0.8%
* Step Index 6: ~1.6%
* Step Index 8: ~1.3%
* Step Index 10: ~0.5%
* Step Index 12: ~0.3%
* Step Index 14: ~0.2%
* Step Index 16: ~0.1%
* Step Index 18: ~0.1%
* Step Index 20: ~0.1%
* Step Index 22: ~0.0%
* Step Index 24: ~0.0%
* Step Index 26: ~0.0%
* Step Index 28: ~0.0%
* Step Index 30: ~0.0%
### Key Observations
* MathVerse has the highest percentage of steps at its peak, reaching approximately 5.3% at step index 8.
* MathVision has a later peak at step index 14, reaching approximately 5.0%.
* MMMU, DynaMath, and WeMath generally have lower percentages of steps compared to MathVision and MathVerse.
* All algorithms show a decreasing trend in the percentage of steps as the step index increases beyond their respective peaks.
* The percentage of steps for all algorithms converges towards 0 as the step index approaches 30.
### Interpretation
The step plot illustrates the performance of different algorithms in terms of the percentage of steps they take across a range of step indices. MathVerse and MathVision appear to be more active in the earlier steps, with higher percentages, while MMMU, DynaMath, and WeMath are less active. The fact that all algorithms converge towards 0 at higher step indices suggests that they become less relevant or effective as the step index increases. The choice of algorithm may depend on the specific requirements of the task and the importance of performance at different step indices.
</details>
(a)
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Incorrect Steps vs. Step Index
### Overview
The image is a line chart comparing the percentage of incorrect steps across different mathematical problem-solving approaches (MathVision, MathVerse, MMMU, DynaMath, and WeMath) over a series of steps indexed from 0 to 30. The chart displays how the error rate changes as the problem-solving process progresses.
### Components/Axes
* **X-axis:** "Step Index", ranging from 0 to 30 in increments of 5.
* **Y-axis:** "Incorrect Steps (%)", ranging from 0 to 100 in increments of 20.
* **Legend:** Located in the top-left corner, mapping colors to the problem-solving approaches:
* Gray: MathVision
* Red: MathVerse
* Blue: MMMU
* Green: DynaMath
* Purple: WeMath
* **Gridlines:** Present in the background, aiding in value estimation.
### Detailed Analysis
* **MathVision (Gray):**
* Trend: Initially increases, peaks around step index 12, then decreases and stabilizes at 0 after step index 26.
* Data Points:
* Step 0: ~8%
* Step 5: ~36%
* Step 10: ~48%
* Step 12: ~53%
* Step 15: ~32%
* Step 20: ~22%
* Step 25: ~33%
* Step 27: ~0%
* **MathVerse (Red):**
* Trend: Increases initially, peaks around step index 12, then decreases and stabilizes at 0 after step index 26.
* Data Points:
* Step 0: ~5%
* Step 5: ~41%
* Step 10: ~40%
* Step 12: ~44%
* Step 15: ~28%
* Step 20: ~25%
* Step 25: ~58%
* Step 27: ~0%
* **MMMU (Blue):**
* Trend: Increases initially, peaks around step index 25, then stabilizes at 0 after step index 26.
* Data Points:
* Step 0: ~10%
* Step 5: ~30%
* Step 10: ~35%
* Step 12: ~30%
* Step 15: ~23%
* Step 20: ~33%
* Step 25: ~100%
* Step 27: ~0%
* **DynaMath (Green):**
* Trend: Increases initially, peaks around step index 22, then stabilizes at 0 after step index 26.
* Data Points:
* Step 0: ~15%
* Step 5: ~25%
* Step 10: ~30%
* Step 12: ~28%
* Step 15: ~25%
* Step 20: ~67%
* Step 25: ~2%
* Step 27: ~0%
* **WeMath (Purple):**
* Trend: Increases initially, peaks around step index 12, then decreases and stabilizes at 0 after step index 26.
* Data Points:
* Step 0: ~3%
* Step 5: ~20%
* Step 10: ~38%
* Step 12: ~39%
* Step 15: ~15%
* Step 20: ~18%
* Step 25: ~2%
* Step 27: ~0%
### Key Observations
* All methods eventually reach 0% incorrect steps by step index 27.
* MMMU has a spike in incorrect steps at step index 25, reaching 100%.
* WeMath generally has the lowest percentage of incorrect steps in the initial phase.
* MathVision and MathVerse have similar trends, peaking around step index 12.
* DynaMath shows a significant increase in incorrect steps around step index 20.
### Interpretation
The chart illustrates the performance of different mathematical problem-solving approaches in terms of error rates across a series of steps. The fact that all methods eventually converge to 0% incorrect steps suggests that they are all ultimately successful in solving the problem, but they differ in their efficiency and error patterns along the way.
The spike in MMMU's error rate at step 25 indicates a potential critical point in the problem-solving process where this method is particularly prone to errors. WeMath's consistently lower error rate in the initial phase suggests it might be a more robust approach for the early stages of problem-solving. The similar trends of MathVision and MathVerse could indicate shared underlying mechanisms or vulnerabilities. DynaMath's late increase in errors suggests a potential issue with its handling of later steps in the problem.
</details>
(b)
Figure 5: Step Distribution of VisualProcessBench. The X-axis represents the step index. (a) The Y-axis indicates the proportion of steps at each index relative to the total number of steps, reflecting the distribution of step positions in solutions. (b) The Y-axis represents the error rate of steps at each index, showing the likelihood of errors occurring at different step positions.
The statistics for step distribution of VisualProcessBench is presented in Figure 5. We observe that most solutions consist of fewer than 15 steps. Among these solutions with fewer than 15 steps, most solutions contain about 7 or 13 steps. For the correctness of each step, we observe that the error rate is lower in the first three steps and then increases as the step index grows. We attribute this to the fact that problems requiring more reasoning steps tend to be more challenging, leading to a gradual rise in step error rates. Notably, starting from step 15, the error rate drops sharply. This is because the number of steps in this range is relatively small, resulting in significant statistical fluctuations.
9 More Data Examples in VisualPRM400K
In this section, we provide more data examples of VisualPRM400K in Figure 6 from different domains, including general visual question answering (VQA) [23, 28, 52, 48], science [31, 13, 49], chart [53, 29, 12], mathematics [40, 65, 11, 47, 30, 22], OCR [54, 67, 8, 27, 56], and document [17].
10 More Data Examples in VisualProcessBench
In this section, we provide more data examples in VisualProcessBench from different data sources. Specifically, we randomly choose three examples from our benchmark and visualize them in Figure 7. Additionally, in Figure 8, we provide an example where the model initially generates an incorrect answer and then autonomously corrects it.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Receipt Analysis and Solution
### Overview
The image presents a receipt from "Geoventure Taste Of The World Sdn Bhd" along with a question asking for the total amount of the receipt and a step-by-step solution to determine the answer.
### Components/Axes
**Receipt Header:**
* Company Name: Geoventure Taste Of The World Sdn Bhd
* Company ID: 965857-A
* Address: Lot 38, 39 101 Boulevard, Jalan Kenari 5, Bandar Puchong Jaya, 47170 Puchong, Selangor
* Phone: 03-8079 0255
* GST REG NO: 000136577024
* RESERVATION NO: 010-7745055
* Invoice Type: SIMPLIFIED TAX INVOICE
**Receipt Details:**
* Invoice No: 581355
* Cashier: NINI
* POS No: 1
* Pax: 0
* Date: 5/3/2018 22:52:33
* Table No: C38
**Item Table:**
* Columns: ITEM, QTY, U/P (RM), DISC(%), Amt+GST (RM)
* Item 1: S Tiger Bucket GTG02, QTY: 1, U/P (RM): 55.00, DISC(%): 0+0, Amt+GST (RM): 58.30
* Service Charge: Service Chrg (Incl GST 6%), RM 5.83
* Total Sales: Total Sales (Incl. GST 6%), RM 64.13
* Rounding Adjustment: RM 0.02
**Totals:**
* TOTAL: RM 64.15
* CASH: RM 100.00
* CHANGE: RM 35.85
**Other Details:**
* Main Item Count: 1
* Total Discount: RM 0.00
* GST Summary:
* S (6%): Amount (RM) 60.50, Tax (RM) 3.63
* Z (0%): Amount (RM) 0.00, Tax (RM) 0.00
**Question:**
* What is the total amount of this receipt?
**Solution Steps:**
The solution is presented as a series of steps, each labeled with "Step-N (mc=value)". The "mc" value appears to be a confidence score.
* **Step-0 (mc=0.5625):** To determine the total amount on this receipt, follow these steps:
* **Step-1 (mc=0.8125):** 1. Identify the items and their prices on the receipt: "S Tiger Bucket" costs RM 55.00.
* **Step-2 (mc=0.5625):** 2. Add up the total cost of the items: Total cost of the items: RM 55.00.
* **Step-3 (mc=0.4375):** 3. Look for any additional charges (if any): There is a service charge (S.Chrg) including GST of RM 5.83.
* **Step-4 (mc=0.5625):** 4. Include the service charge in the total cost: Total with service charge: RM 55.00 + RM 5.83 = RM 60.83.
* **Step-5 (mc=0.8125):** 5. Total Sales (including GST): RM 64.15 is listed under "Total Sales (Incl GST 6%)". This includes the GST (which is calculated based on the service charge).
* **Step-6 (mc=0.5):** 6. Calculate the amount paid by the customer: Cash paid: RM 100.00.
* **Step-7 (mc=0.625):** 7. Subtract the total sales from the amount paid: Change received: RM 100.00 - RM 64.15 = RM 35.85.
* **Step-8 (mc=0.0625):** So, the total amount on this receipt is the customer's cash payment:
* **Step-9 (mc=0.0):** Final answer: 100.00
### Detailed Analysis or ### Content Details
* The receipt shows a single item, "S Tiger Bucket," costing RM 55.00.
* A service charge of RM 5.83 is added, which includes GST.
* The total sales, including GST, is RM 64.13.
* The customer paid RM 100.00 in cash.
* The change given is RM 35.85.
* GST is summarized as 6% on RM 60.50, resulting in RM 3.63 tax, and 0% on RM 0.00, resulting in RM 0.00 tax.
### Key Observations
* The solution correctly identifies the cash payment as the "total amount on this receipt."
* The steps outline the process of calculating the total cost, including service charges and GST.
* The "mc" values associated with each step vary, potentially indicating the confidence level of the OCR in correctly identifying each step.
### Interpretation
The image presents a receipt and a step-by-step guide to understanding the total amount paid. The solution correctly identifies the cash payment (RM 100.00) as the answer to the question. The steps provided break down the calculation of the total cost, including the item price, service charge, and GST. The "mc" values could be interpreted as a measure of the OCR's confidence in each step's accuracy. The receipt provides a clear record of the transaction, including the items purchased, taxes, and payment details.
</details>
(a)
<details>
<summary>x9.png Details</summary>

### Visual Description
## Document Analysis: Date Extraction
### Overview
The image presents a document and a question regarding the date mentioned within it. The document is a letter, and the question asks for the date mentioned in the letter. A solution is provided, outlining the steps to identify the date and providing the final answer.
### Components/Axes
The image is divided into two main sections:
1. **Document:** Contains the text of a letter.
2. **Question/Solution:** Contains the question and the step-by-step solution.
### Detailed Analysis or ### Content Details
**1. Document:**
* The letter begins with "Mr. Place:".
* It mentions "Dudley's sister gave me his new address."
* The address is:
* PDW
* Company B - USNTS
* Ward Island, Corpus Christie, Texas.
* The letter states, "I wrote to him today, (11/20) and repeated my that you were busy in NYC today, Washington Monday."
* It also mentions, "This Island is a training base and supplements his book learning."
* The letter is signed "Charles."
* The source is given as: "Source: https://www.industrydocuments.ucsf.edu/docs/ffhl0226"
**2. Question/Solution:**
* **Question:** "What is the date mentioned in the document?"
* **Solution:** The solution is broken down into steps:
* **Step 0:** "To determine the date mentioned in the document, let's follow these steps:"
* **Step 1:**
1. "Look for any reference to a date in the text."
2. "Identify the date by its numerical form or any text that explicitly mentions a date."
* **Step 2:** "Steps broken down:"
* **Step 3:**
1. "In the document, the date "11/20" is mentioned in the text."
2. "This is likely an abbreviation of November 20, indicating a date."
* **Step 4:** "Final answer: 11/20"
### Key Observations
* The date "11/20" is explicitly mentioned in the document.
* The solution correctly identifies and extracts this date.
* The solution provides a logical step-by-step approach to finding the date.
### Interpretation
The image demonstrates a simple information retrieval task. The question asks for a specific piece of information (the date) from a given document. The solution provides a clear and concise method for extracting this information. The document itself is a letter containing the date in parentheses. The solution correctly identifies the date as "11/20" and interprets it as November 20.
</details>
(b)
<details>
<summary>x10.png Details</summary>

### Visual Description
## Mathematics Problem: Sphere Subtraction
### Overview
The image presents a mathematics problem involving the subtraction of red shiny spheres from a collection of objects. It includes a visual representation of the objects and a step-by-step solution to the problem.
### Components/Axes
* **Title:** Mathematics
* **Question:** Subtract all red shiny spheres. How many spheres are left?
* **Solution:** A step-by-step explanation leading to the final answer.
* **Visual Representation:** An image showing several objects, including a red sphere, a blue sphere, a yellow cube, a purple cube, a gray cylinder, and two blue cylinders.
### Detailed Analysis
The solution is broken down into the following steps:
* **Step-0 (mc=0.6875):** "To determine how many spheres are left after subtracting the red shiny sphere, let's break down the objects in the image step by step:"
* **Step-1 (mc=0.0):**
1. "There are two spheres visible in the image."
2. "One of these spheres is red and shiny."
3. "The other sphere is blue and shiny."
* **Step-2 (mc=0.0625):** "Since the question asks for the spheres after removing the red one, we need to subtract the red sphere from the total count of spheres."
* **Step-3 (mc=0.0):** "Final answer: 1"
### Key Observations
* The problem focuses on identifying and subtracting a specific type of object (red shiny spheres).
* The solution explicitly states the presence of two spheres, one red and one blue.
* The final answer indicates that after subtracting the red sphere, one sphere remains.
### Interpretation
The problem is a basic subtraction exercise framed within a visual context. The steps break down the problem into identifying the relevant objects (spheres) and then performing the subtraction based on the given criteria (red and shiny). The "mc" values associated with each step are not explained within the image context and may represent a confidence score or other metric related to the solution process. The image demonstrates a simple mathematical concept using visual aids and a step-by-step explanation.
</details>
(c)
<details>
<summary>x11.png Details</summary>

### Visual Description
## Footprint Analysis: Determining Foot Type from Footprint Diagrams
### Overview
The image presents a science question related to identifying foot types based on footprint diagrams. It includes four footprint diagrams labeled A, B, C, and D, along with a question asking which foot type Diagram D corresponds to (Normal, Flat, Rigid, or Big). A solution is provided, analyzing the characteristics of each foot type and concluding that Diagram D represents a normal foot.
### Components/Axes
* **Title:** Science (top-left)
* **Footprint Diagrams:** Four black footprint diagrams labeled A, B, C, and D with blue rectangles.
* Diagram A: Footprint showing a gap in the midfoot area.
* Diagram B: Footprint showing a relatively complete impression of the foot.
* Diagram C: Footprint showing a relatively complete impression of the foot, but smaller than B.
* Diagram D: Footprint showing a gap in the midfoot area, similar to A.
* **Question:**
* "Diagram D corresponds with what foot type?"
* Options: Normal, Flat, Rigid, Big
* "Please answer the question based on the options mentioned before."
* **Solution:**
* Steps 0-6 with associated 'mc' values (mc = 0.75, 0.9375, 1.0, 1.0, 0.9375, 0.0, 0.0)
* Descriptions of Normal Foot, Flat Foot, Rigid Foot, and Big Foot.
* Analysis of Diagram D's footprint.
* Final answer: Normal
### Detailed Analysis or Content Details
* **Diagram A:** Shows a footprint with a distinct gap in the arch area.
* **Diagram B:** Shows a footprint with a full impression, indicating a flat foot.
* **Diagram C:** Shows a footprint with a full impression, similar to B, but smaller.
* **Diagram D:** Shows a footprint with a distinct gap in the arch area, similar to A.
* **Step-0:** mc=0.75
* **Step-1:** mc=0.9375
* **Step-2:** mc=1.0
* **Step-3:** mc=1.0
* **Step-4:** mc=0.9375
* **Step-5:** mc=0.0
* **Step-6:** mc=0.0
* **Normal Foot Description:** Evenly distributed pressure, well-defined arches.
* **Flat Foot Description:** Entire foot pressing against the ground, little to no arch visible.
* **Rigid Foot Description:** Harder to identify, shows little flexibility or bouncing when walking.
* **Big Foot Description:** Refers to the general size of the foot rather than its type based on pressure distribution.
* **Analysis of Diagram D:** Heel and front of the foot touching the ground, arch is distinctly visible.
### Key Observations
* Diagrams A and D both show footprints with a visible arch, suggesting a normal foot type.
* Diagrams B and C show footprints with a full impression, suggesting a flat foot type.
* The solution correctly identifies Diagram D as representing a normal foot based on the visible arch.
* The 'mc' values associated with the steps are not clearly explained in the context of the image.
### Interpretation
The image effectively demonstrates how to identify foot types based on footprint analysis. The solution provides clear descriptions of each foot type and accurately matches Diagram D to a normal foot. The 'mc' values are not explained, and their purpose is unclear without additional context. The image highlights the importance of observing the arch area in a footprint to determine foot type.
</details>
(d)
<details>
<summary>x12.png Details</summary>

### Visual Description
## Diagram: Symmetry Question and Solution
### Overview
The image presents a question about symmetry, followed by a solution that explains why the answer is "yes." The image contains a stylized representation of a flying object within a square, with a dashed line indicating the axis of symmetry.
### Components/Axes
* **Title:** General VQA
* **Question:** Does this picture have symmetry?
* **Options:**
* A. yes
* B. no
* **Solution:** A step-by-step explanation of how to determine if the picture has symmetry.
* **Image:** A square containing a stylized blue object (described as a flying object) and a dashed line running diagonally from the top-left corner to the bottom-right corner.
* **Step Labels:**
* `## Step-0 (mc=0.625)`
* `## Step-1 (mc=0.0)`
* `## Step-2 (mc=0.0)`
* `## Step-3 (mc=0.0)`
### Detailed Analysis or ### Content Details
* **Question:** The question asks whether the picture has symmetry.
* **Object Description:** The object is described as a stylized representation of a flying object.
* **Symmetry Explanation:** The solution explains that the object appears to be bilaterally symmetrical around a line that runs down the middle of the object.
* **Steps:**
1. Identify the object in the picture.
2. Look at the object's shape.
3. Demonstrate symmetry by the object appearing to be a mirror image of itself along the central vertical line.
* **Final Answer:** A (yes)
### Key Observations
* The dashed line in the image visually represents the axis of symmetry.
* The solution provides a logical explanation for why the object is considered symmetrical.
* The `mc` values associated with each step are present but their meaning is not explicitly defined within the image.
### Interpretation
The image presents a simple visual question about symmetry and provides a clear, step-by-step solution. The stylized representation of the flying object is designed to be easily recognizable and symmetrical, making the question straightforward. The solution reinforces the concept of bilateral symmetry and how it can be visually identified. The `mc` values are likely metadata related to the question's difficulty or confidence level, but without further context, their precise meaning remains unclear.
</details>
(e)
<details>
<summary>x13.png Details</summary>

### Visual Description
## Chart: Democrats vs. Republicans on Key Issues
### Overview
The image presents a chart comparing the percentage of people who believe either the Democratic or Republican party is better able to handle various issues. It also includes a question about the ratio between Democratic and Republican performance in foreign policy, along with a step-by-step solution.
### Components/Axes
* **Title:** Democrats Hold Advantages on Environment, Abortion, Education; Neither Party Has Edge on the Economy
* **Subtitle:** % who say each party is better able to handle...
* **Y-axis:** Lists various issues: Gun control, Terrorist threat, Budget deficit, Taxes, Immigration, Economy, Foreign policy, Health care, Education, Abortion & contraception, Environment.
* **X-axis:** Percentage scale (implied, but not explicitly marked).
* **Data Series:**
* Democratic Party (represented by blue bars)
* Republican Party (represented by red bars)
* **Additional Column:** "Diff" indicating the difference between Democratic and Republican percentages.
* **Legend:**
* Democratic Party (Blue)
* Republican Party (Red)
* **Source:** Survey conducted July 14-20, 2015, PEW RESEARCH CENTER
### Detailed Analysis or ### Content Details
**Chart Data:**
| Issue | Democratic Party | Republican Party | Diff |
| ------------------------ | ---------------- | ---------------- | ------ |
| Gun control | 36% | 48% | R+12 |
| Terrorist threat | 34% | 44% | R+10 |
| Budget deficit | 37% | 41% | R+4 |
| Taxes | 41% | 42% | R+1 |
| Immigration | 42% | 40% | D+2 |
| Economy | 44% | 41% | D+3 |
| Foreign policy | 41% | 38% | D+3 |
| Health care | 46% | 36% | D+10 |
| Education | 46% | 34% | D+12 |
| Abortion & contraception | 50% | 31% | D+19 |
| Environment | 53% | 27% | D+26 |
**Question and Solution:**
* **Question:** What is the ratio between Democratic and Republican Party in Foreign policy?
* **Solution Steps:**
* Step 1: Identify the percentage of Democrats who think the Democratic Party is better at handling foreign policy. In the "Foreign policy" row, the percentage is 41%.
* Step 2: Identify the percentage of Republicans who think the Republican Party is better at handling foreign policy. In the "Foreign policy" row, the percentage is 38%.
* Step 3: The ratio between the percentage of Democrats and Republicans is calculated by dividing the percentage of Democrats by the percentage of Republicans.
* Step 4: \[ \text{Ratio} = \frac{\text{Percentage of Democrats}}{\text{Percentage of Republicans}} \]
* Step 5: \[ \text{Ratio} = \frac{41}{38} \]
* Final answer: \(\frac{41}{38} \)
### Key Observations
* Republicans are perceived to be better at handling Gun control, Terrorist threats, Budget deficits, and Taxes.
* Democrats are perceived to be better at handling Immigration, Economy, Foreign policy, Health care, Education, Abortion & contraception, and Environment.
* The largest differences favoring Democrats are in Environment, Abortion & contraception, and Education.
* The largest differences favoring Republicans are in Gun control and Terrorist threat.
### Interpretation
The chart illustrates public perception of which party is better equipped to handle specific issues. The data suggests a clear partisan divide, with Republicans seen as stronger on issues related to security and fiscal responsibility, while Democrats are favored on social and environmental issues. The provided question and solution demonstrate a simple calculation of the ratio between Democratic and Republican performance in a specific area (foreign policy), highlighting a practical application of the data presented in the chart. The step-by-step solution is designed to guide the user through the process of extracting and using the relevant information.
</details>
(f)
Figure 6: More data examples from VisualPRM400K.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Educational Material: Elements of Art Question and Solution
### Overview
The image presents an educational example from MMMU, focusing on the elements of art. It poses a question related to the formation of enclosed areas by lines and provides a step-by-step solution, defining key terms like space, shape, form, and line. The image is divided into two main sections: a visual representation of the elements of art and a question/solution section.
### Components/Axes
* **Title:** Data Example from MMMU
* **Elements of Art (Visual Section):**
* The visual section displays six elements of art, each with a corresponding image:
* Line
* Shape
* Form
* Space
* Texture
* Value
* Color
* **Question:**
* Presents a multiple-choice question: "Based on <image 1>. When lines meet to form an enclosed area, this is formed."
* Answer choices: A. Space, B. Shape, C. Form, D. Line
* **Solution:**
* Provides a step-by-step breakdown to arrive at the correct answer.
* Each step is marked with "## Step-X (Correct/Neural)".
* Definitions of Space, Shape, Form, and Line are provided.
* The final answer is stated as "B".
### Detailed Analysis or ### Content Details
* **Visual Section:**
* **Line:** Image shows various lines, including straight, curved, and spiral lines.
* **Shape:** Image shows geometric shapes like squares, triangles, and circles.
* **Form:** Image shows 3D forms like cubes, cones, and spheres.
* **Space:** Image shows objects arranged to create positive and negative space.
* **Texture:** Image shows a textured surface created with various strokes.
* **Value:** Image shows a gradient of colors from light to dark.
* **Color:** Image shows a composition of different colors.
* **Solution Steps:**
* **Step-0 (Neural):** Initial assessment step.
* **Step-1 (Correct):** Defines Space as the area between or around elements of art.
* **Step-2 (Correct):** Defines Shape as an enclosed area created by lines or other elements.
* **Step-3 (Correct):** Defines Form as the three-dimensional aspect of an object.
* **Step-4 (Correct):** Defines Line as a mark made by a moving point.
* **Step-5 (Correct):** States that when lines meet to form an enclosed area, it is called a "Shape."
* **Step-6 (Correct):** Concludes that the final answer is B (Shape).
### Key Observations
* The image effectively combines visual examples with textual explanations.
* The solution is presented in a clear, step-by-step manner.
* The definitions provided are concise and relevant to the question.
* The use of "## Step-X (Correct/Neural)" provides a clear indication of the reasoning process.
### Interpretation
The image serves as an educational tool to teach the elements of art. It uses a question-and-solution format to reinforce the understanding of key concepts. The visual examples help to illustrate the definitions, making it easier for learners to grasp the differences between line, shape, form, space, texture, value, and color. The step-by-step solution demonstrates a logical approach to answering art-related questions, emphasizing the importance of understanding definitions and applying them to specific scenarios. The "Neural" tag on Step-0 suggests an initial, perhaps intuitive, assessment before the formal definitions are applied.
</details>
(a)
<details>
<summary>x15.png Details</summary>

### Visual Description
## Diagram: Music Theory Example
### Overview
The image presents a music theory example, likely from a course or textbook. It includes a musical excerpt, a question asking for the correct key, and a step-by-step solution explaining how to determine the key of the melody.
### Components/Axes
* **Title:** "Data Example from MMMU" (top-left)
* **(a):** Label for the musical excerpt (top-left, next to the excerpt)
* **Musical Excerpt:** A short musical phrase written in standard notation. It includes a treble clef, a key signature with one sharp (F#), and several notes.
* **Question:** "Choose the correct key of each of the melody. <image 1>"
* **Answer Choices:**
* A. B major
* B. E major
* C. A minor
* D. E minor
* **Solution:** A step-by-step explanation of how to determine the key of the melody.
* **Steps:**
* Step-0 (Neural)
* Step-1 (Correct)
* Step-2 (Correct)
* Step-3 (Correct)
* **Explanation:**
* "To determine the key of the melody, we analyze the key signature, accidentals, and the tonal center. Here's the breakdown:"
* "1. **Key Signature**: The melody has a key signature with one sharp (F#), which corresponds to either **G major** or **E minor**."
* "2. **Accidentals**: The melody contains a D# accidental, which is the leading tone in the key of E minor. This suggests the melody is in **E minor**."
* "3. **Tonal Center**: The melody revolves around E as the tonic, further confirming **E minor** as the key."
* "### Correct Answer: **D. E minor**"
### Detailed Analysis
The solution breaks down the process of identifying the key into three steps: analyzing the key signature, accidentals, and tonal center. The key signature of one sharp (F#) suggests either G major or E minor. The presence of a D# accidental, which is the leading tone in E minor, further supports the E minor key. Finally, the melody revolving around E as the tonic confirms E minor as the key. The correct answer provided is "D. E minor".
### Key Observations
* The solution uses a process of elimination and confirmation to arrive at the correct key.
* The explanation highlights the importance of considering multiple factors when determining the key of a melody.
* The steps are labeled with "Correct" except for Step-0 which is labeled "Neural".
### Interpretation
The example demonstrates a standard method for determining the key of a musical excerpt. It emphasizes the importance of analyzing the key signature, accidentals, and tonal center. The step-by-step solution provides a clear and logical explanation of the process. The "Neural" label on Step-0 is unclear without further context, but it may indicate an initial, less refined assessment. The example suggests that a combination of theoretical knowledge and careful observation is necessary to accurately identify the key of a melody.
</details>
(b)
<details>
<summary>x16.png Details</summary>

### Visual Description
## Chart/Diagram Type: Phillips Curve Diagram
### Overview
The image presents a Phillips Curve diagram, illustrating the relationship between inflation and unemployment rates. It includes a downward-sloping curve (Curve 2) representing the short-run trade-off between inflation and unemployment, and a vertical line (Curve 1) representing the natural rate of unemployment. Several points (A, B, C, D, E) are marked on the diagram, and the image includes a question and solution related to the movement of the economy along the Phillips Curve in response to changes in the money supply growth rate.
### Components/Axes
* **Title:** Data Example from MMMU
* **X-axis:** Unemployment Rate (%) with scale from 0 to 10, incrementing by 1.
* **Y-axis:** Inflation Rate (%) with scale from 0 to 10, incrementing by 1.
* **Curve 2:** A downward-sloping curve representing the short-run Phillips Curve.
* **Curve 1:** A vertical line at approximately 6.5% unemployment rate, representing the natural rate of unemployment.
* **Points:** A (4.5, 5), B (6.5, 3), C (8.5, 1), D (6.5, 6), E (6.5, 2)
* **Question:** If the economy starts at B and the money supply growth rate increases, in the long run the economy A. moves to C. B. moves to D. C. stays at B.
* **Solution:** A step-by-step explanation of the movement along the Phillips Curve.
### Detailed Analysis or Content Details
* **Curve 2 (Phillips Curve):**
* The curve slopes downward, indicating an inverse relationship between unemployment and inflation.
* Point A: Approximately (4.5, 5)
* Point B: Approximately (6.5, 3)
* Point C: Approximately (8.5, 1)
* **Curve 1 (Natural Rate of Unemployment):**
* A vertical line at approximately 6.5% unemployment.
* Point D: Approximately (6.5, 6)
* Point E: Approximately (6.5, 2)
* **Solution Steps:**
* **Step-0 (Correct):** General statement about the Phillips Curve.
* **Step-1 (Correct):** Identifies point B as the initial point (6% unemployment, 3% inflation).
* **Step-2 (Correct):** Explains the impact of increased money supply growth rate.
* **Step-3 (Correct):** Describes the Phillips Curve movement in the long run.
* **Step-4 (Incorrect):** Incorrectly identifies the new point.
* **Step-5 (Neutral):** Given these steps.
* **Step-6 (Incorrect):** Incorrect statement about the economy's movement.
* **Step-7 (Incorrect):** Incorrect statement.
* **Final answer:** A
### Key Observations
* The Phillips Curve (Curve 2) demonstrates the trade-off between inflation and unemployment.
* Curve 1 represents the natural rate of unemployment, where inflation does not affect unemployment in the long run.
* The solution explains how an increase in the money supply growth rate leads to a movement along the Phillips Curve.
### Interpretation
The diagram illustrates the economic concept of the Phillips Curve, which suggests an inverse relationship between inflation and unemployment in the short run. The vertical line represents the natural rate of unemployment, indicating that in the long run, changes in inflation do not affect the unemployment rate. The question and solution demonstrate how an increase in the money supply growth rate can lead to a movement along the Phillips Curve, resulting in higher inflation while maintaining the same natural rate of unemployment. The incorrect steps in the solution highlight potential misunderstandings of the long-run implications of monetary policy.
</details>
(c)
<details>
<summary>x17.png Details</summary>

### Visual Description
## Diagram: Math Problem and Solution
### Overview
The image presents a math problem involving the addition of two two-digit numbers formed from the digits 2, 3, 4, and 5. It shows a visual representation of the addition problem with empty boxes, followed by the problem statement, and a step-by-step solution, indicating which steps are correct and incorrect.
### Components/Axes
* **Title:** Data Example from MathVision
* **Visual Representation:** Two sets of two empty squares separated by a "+" sign.
* **Question:** A text block posing the math problem.
* **Solution:** A step-by-step breakdown of the solution, with correctness indicated.
### Detailed Analysis
**1. Visual Representation:**
* Two sets of two empty squares are shown, separated by a plus sign. This visually represents the addition of two two-digit numbers.
**2. Question:**
* The question asks to find the largest possible sum of two numbers formed by placing the digits 2, 3, 4, and 5 in a square.
* The question provides a hint to answer the question and provide the final answer at the end.
**3. Solution:**
The solution is broken down into steps, with each step marked as either "Correct" or "Incorrect".
* **Step-0 (Correct):** To maximize the sum of the two numbers, we should form the largest possible numbers by arranging the digits in descending order. Here's how:
* **Step-1 (Incorrect):** Arrange the digits (2, 3, 4, 5) in descending order: (5, 4, 3, 2).
* **Step-2 (Incorrect):** Split them into two numbers such that the largest digits are in the tens place. This ensures the largest sum.
* First number: (54) - Second number: (32)
* **Step-3 (Correct):** Add the two numbers:
* 54 + 32 = 86
* **Step-4 (Incorrect):** Final Answer: **86**
### Key Observations
* The problem involves finding the maximum sum of two two-digit numbers formed using the digits 2, 3, 4, and 5.
* The solution attempts to guide the user through the process, but some steps are marked as incorrect.
* The final answer provided is 86.
### Interpretation
The image presents a math problem and a step-by-step solution. The steps marked as "Incorrect" suggest that there might be a misunderstanding or error in the reasoning process. The final answer of 86 is the correct answer. The steps marked as incorrect are not actually incorrect, but rather are just steps in the process of solving the problem.
</details>
(d)
<details>
<summary>x18.png Details</summary>

### Visual Description
## Chart/Diagram Type: Quadratic Function Identification
### Overview
The image presents a multiple-choice question related to quadratic functions. It asks the user to identify which of five provided graphs does not belong with the others, as the other four represent pieces of the same quadratic function. The image includes the question, possible answers (A through E), and a detailed solution explaining the reasoning behind the correct answer.
### Components/Axes
* **Title:** "Data Example from MathVision"
* **Question:** Asks which of the five graphs does not belong with the others, as the other four represent pieces of the same quadratic function.
* **Choices:** (A) A, (B) B, (C) C, (D) D, (E) E
* **Graphs:** Five separate coordinate plane graphs labeled (A) through (E). Each graph displays a different line or curve.
* **Axes:** Each graph has an x-axis and a y-axis.
* **Graph A:** x-axis marked at 1 and 2.
* **Graph B:** y-axis marked at -2 and -1.
* **Graph C:** x-axis marked at 3 and 4.
* **Graph D:** y-axis marked at -2 and -1.
* **Graph E:** x-axis marked at -4 and -3.
* **Solution:** Provides a step-by-step explanation of how to solve the problem.
* **Step-0:** (Neural)
* **Step-1:** (Correct)
* **Step-2:** (Correct)
* **Step-3:** (Correct)
* **Step-4:** (Correct)
* **Analysis:** Describes each option (A-E) and its consistency with a quadratic function.
* **Conclusion:** States that graph C does not belong because it is not a parabola.
* **Correct Answer:** C
### Detailed Analysis or ### Content Details
* **Graph A:** A straight line with a positive slope.
* **Graph B:** A curve opening downwards.
* **Graph C:** A straight line with a negative slope.
* **Graph D:** A curve opening upwards.
* **Graph E:** A curve opening downwards.
The solution explains that quadratic functions have a parabolic shape (either opening upwards or downwards). It then analyzes each option:
* **Option A:** A curve increasing from left to right, consistent with a quadratic function opening upwards.
* **Option B:** A curve decreasing from left to right, consistent with a quadratic function opening downwards.
* **Option C:** A line, not a curve, inconsistent with a quadratic function.
* **Option D:** A curve increasing from left to right, consistent with a quadratic function opening upwards.
* **Option E:** A curve decreasing from left to right, consistent with a quadratic function opening downwards.
### Key Observations
* Graph C is the only graph that is a straight line, while the others are curves.
* The solution correctly identifies that a straight line is not a parabola.
### Interpretation
The image is a problem-solving exercise designed to test the understanding of quadratic functions and their graphical representation. The key takeaway is that quadratic functions are represented by parabolas, and any graph that is not a parabola (like a straight line) cannot be part of the same quadratic function. The solution clearly explains the reasoning, making it a useful educational tool.
</details>
(e)
<details>
<summary>x19.png Details</summary>

### Visual Description
## Diagram: Fish and Ring Problem
### Overview
The image presents a problem involving a series of fish connected by a line, with a ring at one end. The task is to determine how many fish will have their heads pointing towards the ring when the line is straightened. The image includes a visual representation of the fish and line, along with a textual question, hint, and a step-by-step solution, with some steps marked as "Incorrect" or "Neural".
### Components/Axes
* **Title:** Data Example from MathVision
* **Question:** How many fish will have their heads pointing towards the ring when we straighten the line?
* **Hint:** Please answer the question and provide the final answer at the end.
* **Solution Steps:**
* Step-0 (Neural)
* Step-1 (Correct)
* Step-2 (Neural)
* Step-3 (Incorrect)
* Step-4 (Incorrect)
* Step-5 (Incorrect)
* Step-6 (Incorrect)
* **Visual Representation:** A diagram showing multiple fish of different colors (green, blue, pink, yellow) connected by a winding line. A ring is present at one end of the line.
### Detailed Analysis or ### Content Details
**Textual Content:**
* **Question:** How many fish will have their heads pointing towards the ring when we straighten the line?
* **Solution:**
* To determine how many fish will have their heads pointing towards the ring when we straighten the line, follow these steps:
1. Identify the fish with their heads pointing towards the ring in the current configuration.
2. Visualize straightening the line and see which fish's heads will align with the ring.
* Step-by-step reasoning:
1. In the current configuration, the fish with their heads pointing towards the ring are:
* The green fish on the left side.
2. When the line is straightened, the fish with their heads pointing towards the ring will be those whose heads are aligned with the ring's position in the straightened line.
* From the image, it appears that only the green fish at the far left is aligned with the ring's position in the straightened line.
* Final answer: 3
**Visual Elements:**
* The diagram shows approximately 10 fish connected by a line. The line is not straight but has several curves and overlaps.
* The fish are of various colors, including green, blue, yellow, and pink.
* The ring is located at the left end of the line.
### Key Observations
* The solution attempts to guide the user through the reasoning process.
* Several steps in the solution are marked as "Incorrect," suggesting that the provided reasoning is flawed.
* The final answer provided is "3".
* The diagram is crucial for visualizing the problem and determining the correct answer.
### Interpretation
The image presents a problem-solving exercise that combines visual reasoning with textual instructions. The "Incorrect" labels on several steps indicate that the provided solution is not entirely accurate, potentially serving as a learning opportunity to identify the flaws in the reasoning. The final answer of "3" suggests that the task is to identify the fish that would be pointing towards the ring if the line were straightened, and the image implies that only three fish meet this criterion. The exercise highlights the importance of spatial reasoning and the ability to visualize transformations.
</details>
(f)
<details>
<summary>x20.png Details</summary>

### Visual Description
## Chart/Diagram Type: Inequality Graph and Solution Explanation
### Overview
The image presents a problem involving an inequality represented graphically on a coordinate plane, along with a step-by-step solution. The problem asks to state the inequality that describes the shaded region in the number plane. The solution provides a reasoning process to arrive at the correct inequality.
### Components/Axes
* **Coordinate Plane:** A standard Cartesian coordinate plane with x and y axes.
* **x-axis:** Ranges from approximately -10 to 10, with tick marks at integer values.
* **y-axis:** Ranges from approximately -10 to 10, with tick marks at integer values.
* **Line:** A solid black line passes through the points (-5, -5) and (0, 5).
* **Shaded Region:** The area above the line is shaded in gray, indicating the region that satisfies the inequality.
* **Question:** Text asking to state the inequality that describes the shaded region.
* **Solution:** A step-by-step explanation of how to determine the inequality.
* **Steps:** The solution is broken down into steps, labeled "Step-0 (Neural)", "Step-1 (Incorrect)", "Step-2 (Incorrect)", "Step-3 (Incorrect)", and "Step-4 (Incorrect)".
* **Reasoning:** Each step includes reasoning to explain the process.
### Detailed Analysis or ### Content Details
**Coordinate Plane Details:**
* The x-axis and y-axis are labeled.
* The grid lines are light gray.
* The line intersects the y-axis at approximately y = 5.
* The line has a positive slope.
**Solution Breakdown:**
* **Step-0 (Neural):** This step is not elaborated upon.
* **Step-1 (Incorrect):**
* **Reasoning:**
* Understanding the problem: The problem asks to find the inequality that represents the shaded region. The boundary line is solid, indicating inclusion in the inequality. The shaded region is above the line.
* **Step-2 (Incorrect):**
* **Reasoning:**
* Finding the equation of the line: Two points on the line are identified: (-5, -5) and (0, 5).
* The slope (m) is calculated as:
```
m = (y2 - y1) / (x2 - x1) = (5 - (-5)) / (0 - (-5)) = 10 / 5 = 2
```
* The line equation is in slope-intercept form (y = mx + b). Substituting (0,5) to find b:
```
5 = 2(0) + b implies b = 5
```
* The equation of the line is:
```
y = 2x + 5
```
* **Step-3 (Incorrect):**
* **Reasoning:**
* Determining the inequality: Since the shaded region is above the line, the inequality is y ≥ 2x + 5.
* **Step-4 (Incorrect):**
* **Final Answer:**
```
y ≥ 2x + 5
```
### Key Observations
* The solution correctly identifies the key steps to solve the problem.
* The slope calculation and line equation derivation are accurate.
* The final inequality is correctly stated based on the shaded region.
* The steps are marked as "Incorrect", which suggests that the automated system evaluating the steps did not recognize the correct solution path.
### Interpretation
The image demonstrates a problem-solving approach to finding the inequality represented by a shaded region on a coordinate plane. The solution breaks down the problem into understanding the problem, finding the equation of the line, and determining the inequality. The steps are marked as incorrect, which may indicate an error in the automated evaluation system or a specific requirement not met by the provided solution. The solution correctly identifies the inequality as y ≥ 2x + 5, which represents all points on or above the line y = 2x + 5.
</details>
(g)
<details>
<summary>x21.png Details</summary>

### Visual Description
## Geometric Problem: Circle and Tangent
### Overview
The image presents a geometry problem involving a circle, a tangent line, and the calculation of an angle. It includes a diagram, a question, and a step-by-step solution.
### Components/Axes
* **Title:** Data Example from MathVerse
* **Question:** According to the question shown in the image, please directly answer the question and provide the correct option letter, e.g., A, B, C, D.
* **Diagram:** A circle with center O. Points A, B, and C lie on the circumference. Line CD is tangent to the circle at point C. Angle CDB is labeled as 50°.
* **Choices:** A:20° B:25° C:40° D:50°
* **Solution Steps:**
* Step-0 (Correct)
* Step-1 (Correct)
* Step-2 (Correct)
* Step-3 (Incorrect)
* Step-4 (Correct)
* Step-5 (Correct)
### Detailed Analysis or ### Content Details
**Diagram Details:**
* A circle with center labeled 'O'.
* Points 'A', 'B', and 'C' are on the circumference of the circle.
* Line segment 'AC' and 'BC' are drawn.
* Line 'CD' is tangent to the circle at point 'C'.
* Angle CDB is labeled as 50°.
**Solution Text:**
* To find the degree of angle A, we can use the property that the angle subtended by an arc at the center of the circle is twice the angle subtended at any point on the circle.
* **1. Identify the key points:**
* The arcs are AC and BC.
* Angle A is subtended by these arcs at point O (the center of the circle).
* **2. Understand the given information:**
* CD is a tangent at point C.
* The angle between the tangent and the chord is equal to the angle subtended by the chord at the circumference. Therefore, angle CDB = 50°.
* **3. Angle properties:**
* The angle A is subtended by arc BC at the center.
* Since CD is a tangent, angle A = 50° (as the inscribed angle is half the central angle).
* Thus, the degree of angle A is 20°.
* Final answer: A
### Key Observations
* The solution incorrectly states that angle A is 50 degrees and then concludes that the degree of angle A is 20 degrees.
* Step-3 is marked as incorrect.
### Interpretation
The image presents a geometry problem and its solution. The solution attempts to find the degree of angle A using properties of circles and tangents. However, there appears to be an error in the reasoning or calculation, as step 3 is marked as incorrect, and the final answer is derived from a flawed statement. The correct answer, based on the diagram and given information, should be 25 degrees, as angle CDB is 50 degrees, and angle A would be half of that.
</details>
(h)
<details>
<summary>x22.png Details</summary>

### Visual Description
## Chart/Diagram Type: Math Problem with Graph
### Overview
The image presents a math problem involving two functions, f(x) and g(x). It asks how many roots the functions have in common. The image includes the equation for f(x), a graph of g(x), and a step-by-step solution to the problem.
### Components/Axes
* **Title:** Data Example from MathVerse
* **Problem Statement:**
* f(x) = x² + x - 6. How many roots do the functions have in common, f(x) and g(x)?
* Choices: A: f and g share the same root(s). B: f and g share one root in common but each have another root that is not shared. C: f and g share no roots in common.
* **Graph:**
* Axes: x-axis and y-axis
* x-axis: Ranges from -7 to 7, with integer markings.
* y-axis: Ranges from -7 to 7, with integer markings.
* Curve: A parabola representing g(x), labeled as "y = g(x)". The parabola intersects the x-axis at x = -3 and x = 4. The vertex of the parabola appears to be around (0.5, -2.25).
* **Solution:**
* Step-by-step reasoning to find the roots of f(x) and g(x) and compare them.
* Final answer indicating that f(x) and g(x) share one root in common but each have another root that is not shared.
### Detailed Analysis or ### Content Details
* **Function f(x):** f(x) = x² + x - 6
* **Roots of f(x):** The solution shows that f(x) can be factored into (x + 3)(x - 2), so the roots are x = -3 and x = 2.
* **Graph of g(x):** The parabola intersects the x-axis at x = -3 and x = 4. Therefore, the roots of g(x) are x = -3 and x = 4.
* **Comparison:** Both functions share the root x = -3. f(x) has a root at x = 2, while g(x) has a root at x = 4.
* **Solution Steps:**
* Step-0 (Neural)
* Step-1 (Correct)
* Step-2 (Incorrect)
* Step-3 (Incorrect)
* Step-4 (Incorrect)
* **Answer:** The correct option is B. f and g share one root in common but each have another root that is not shared.
### Key Observations
* The graph of g(x) is a parabola opening upwards.
* The roots of f(x) are found algebraically, while the roots of g(x) are determined from the graph.
* The solution correctly identifies the shared root and the unique roots for each function.
### Interpretation
The problem demonstrates how to find the roots of a quadratic function algebraically and graphically. By comparing the roots of f(x) and g(x), the solution determines the number of common roots and identifies the correct answer choice. The step-by-step reasoning provides a clear explanation of the solution process. The "incorrect" labels on steps 2-4 suggest that these steps might represent common errors or alternative approaches that do not lead to the correct answer.
</details>
(i)
<details>
<summary>x23.png Details</summary>

### Visual Description
## Diagram: Protractor with Angle Measurement
### Overview
The image presents a protractor diagram used to measure an angle. It includes a question asking for the angle measurement and a solution section with steps to determine the angle. The steps are marked as "Incorrect" except for the first step, which is marked as "Neural".
### Components/Axes
* **Title:** Data Example from DynaMath
* **Diagram:** A protractor showing angle measurements from 0 to 180 degrees in both directions. A blue angle is drawn on the protractor.
* **Question:** "Based on the measurement results shown in the diagram, this blue angle is ()\u00b0."
* **Solution:**
* "## Step-0 (Neural)"
* "## Step-1 (Incorrect)"
* "1. Observe the protractor: The protractor measures angles from 0\u00b0 to 180\u00b0."
* "2. Identify the measurement: The blue angle is indicated by a line pointing to a specific degree on the protractor."
* "3. Locate the angle: The blue line points to the number 40\u00b0 on the protractor."
* "## Step-2 (Incorrect)"
* "Thus, the measure of the blue angle is 40\u00b0."
* "## Step-3 (Incorrect)"
* "Final answer: 40"
### Detailed Analysis
* **Protractor:** The protractor displays angle measurements from 0 to 180 degrees in both clockwise and counter-clockwise directions. The scale is marked in increments of 10 degrees, with finer gradations for individual degrees.
* **Blue Angle:** A blue line originates from the center of the protractor and extends to the 40-degree mark. Another blue line extends from the center of the protractor along the 0/180 degree line.
* **Question:** The question asks for the measurement of the blue angle shown in the diagram.
* **Solution Steps:**
* Step 0 is marked as "Neural".
* Steps 1, 2, and 3 are marked as "Incorrect".
* The steps outline a process of observing the protractor, identifying the measurement, and locating the angle.
* The final answer provided is 40.
### Key Observations
* The blue angle in the diagram visually corresponds to 40 degrees.
* The solution steps are marked as incorrect, despite arriving at the correct answer.
### Interpretation
The image presents a simple angle measurement problem using a protractor. The solution steps, although marked as incorrect, lead to the correct answer of 40 degrees. This suggests a potential flaw in the evaluation of the solution process, even though the final result is accurate. The "Neural" tag on Step-0 is ambiguous without further context.
</details>
(j)
<details>
<summary>x24.png Details</summary>

### Visual Description
## Data Table and Problem Solution: DynaMath Example
### Overview
The image presents a data table showing house sales numbers for different cities, along with a question about finding the range of these numbers and a step-by-step solution. The solution is marked as incorrect, except for the initial step.
### Components/Axes
* **Title:** Data Example from DynaMath
* **Table Headers:**
* City
* Sales number
* **Question:** A real estate agent looked into how many houses were sold in different cities. What is the range of the numbers?
* **Solution:** A step-by-step breakdown of how to calculate the range, with each step marked as either "Neural" or "Incorrect".
### Detailed Analysis or ### Content Details
**Data Table:**
| City | Sales number |
| ------------- | ------------ |
| Briggs Corner | 83 |
| New Humburg | 83 |
| Melville | 86 |
| Fremont | 81 |
| Liberty City | 92 |
| Charles Falls | 84 |
| Pleasant Town | 93 |
| Penny Town | 97 |
**Solution Steps:**
* **Step-0 (Neural):** Let me help you find the range of the sales numbers.
* **Step-1 (Incorrect):** The range is calculated by subtracting the minimum value from the maximum value.
* **Step-2 (Incorrect):** Minimum sales number: 81 (Fremont)
Maximum sales number: 97 (Penny Town)
* **Step-3 (Incorrect):** Range = Maximum - Minimum
Range = 97-81 = 16
* **Step-4 (Incorrect):** Therefore, the range of house sales numbers across these cities is 16.
### Key Observations
* The sales numbers vary across different cities.
* The solution attempts to calculate the range but incorrectly identifies some steps.
* Step 0 is marked as Neural, while the rest are marked as Incorrect.
### Interpretation
The image presents a problem-solving exercise where the goal is to find the range of house sales numbers across different cities. The provided solution attempts to guide the user through the process, but it is marked as incorrect, suggesting that there are errors in the steps or the final answer. The correct steps are:
1. Identify the minimum sales number (81 from Fremont).
2. Identify the maximum sales number (97 from Penny Town).
3. Calculate the range by subtracting the minimum from the maximum (97 - 81 = 16).
While the final answer is numerically correct, the steps are marked as incorrect, possibly due to the way the steps are phrased or the context in which they are presented. The "Neural" tag on Step-0 suggests that it is a preliminary or introductory step.
</details>
(k)
<details>
<summary>x25.png Details</summary>

### Visual Description
## Chart Type: Radar Chart
### Overview
The image presents a radar chart displaying five categories (A, B, C, D, and E) with values ranging from 0 to 100. The chart is accompanied by a question asking which category has the second-highest value, along with a solution.
### Components/Axes
* **Title:** Radar Chart
* **Categories:** A, B, C, D, E (arranged clockwise around the chart)
* **Scale:** 0 to 100 (represented by concentric circles, with markers at 20, 40, 60, 80, and 100)
* **Data Representation:** A blue shaded area connecting the values for each category.
### Detailed Analysis
* **Category A:** Approximately 30
* **Category B:** Approximately 40
* **Category C:** Approximately 85
* **Category D:** Approximately 70
* **Category E:** Approximately 20
### Key Observations
* Category C has the highest value.
* Category D has the second-highest value.
* Categories A, B, and E have significantly lower values compared to C and D.
### Interpretation
The radar chart visually represents the relative values of five different categories. The solution correctly identifies category D as having the second-highest value. The chart allows for a quick comparison of the categories and their respective magnitudes. The question and solution demonstrate how to interpret the data presented in the radar chart.
</details>
(l)
<details>
<summary>x26.png Details</summary>

### Visual Description
## Math Problem: Conical Water Container Radius Calculation
### Overview
The image presents a math problem involving the calculation of the radius of a conical water container's base, given its circumference. It includes a diagram of a cone with a height of 20cm, the problem statement, and a step-by-step solution.
### Components/Axes
* **Title:** Data Example from WeMath
* **Diagram:** A cone with a height labeled as "20cm".
* **Question:** Given that the circumference of the upper edge of a conical water container is 62.8 cm, what is the radius of its base circle in cm?
* **Answer Choices:** A. 8 B. 9 C. 10 D. Cannot be determined E. No correct answer
* **Solution:** A step-by-step solution is provided, marked with "Step-0" through "Step-6", each labeled as "(Correct)".
### Detailed Analysis or ### Content Details
The solution breaks down as follows:
* **Step-0:** To determine the **radius** of the base of the cone.
* **Step-1:** The formula for the circumference of a circle is:
* `C = 2πr` where `C` is the circumference and `r` is the radius.
* **Step-2:** We are given `C = 62.8 cm`. Substitute this value into the formula:
* `62.8 = 2πr`
* **Step-3:** Solve for `r`:
* `r = 62.8 / (2π)`
* **Step-4:** Use the approximation `π ≈ 3.14`:
* `r = 62.8 / (2 * 3.14) = 62.8 / 6.28 = 10 cm`
* **Step-5:** Thus, the radius of the base circle is **10 cm**.
* **Step-6:** Final Answer: **C. 10**
### Key Observations
* The problem provides the circumference of a circle and asks for the radius.
* The solution uses the standard formula for the circumference of a circle.
* The value of pi is approximated as 3.14.
* The correct answer is C. 10.
* The height of the cone (20cm) is provided in the diagram but is not used in the calculation.
### Interpretation
The image presents a straightforward geometry problem and its solution. The step-by-step solution clearly demonstrates how to calculate the radius of a circle given its circumference. The inclusion of the cone diagram and height might be intended to add context, but the height is ultimately irrelevant to solving the problem as stated. The problem is well-structured and the solution is logically sound.
</details>
(m)
<details>
<summary>x27.png Details</summary>

### Visual Description
## Geometric Diagram with Solution
### Overview
The image presents a geometry problem involving a quadrilateral ABCD with an axis of symmetry EF. The problem asks for the perimeter of parallelogram ABFD, given that CD = 5 cm and AD = 3 cm. The image includes a diagram of the quadrilateral and a step-by-step solution to the problem.
### Components/Axes
* **Title:** Data Example from WeMath
* **Question:** As shown in the diagram, EF is the axis of symmetry for quadrilateral ABCD. Given that CD = 5 cm and AD = 3 cm, what is the perimeter of parallelogram ABFD in cm?
* Possible answers: A. 12 B. 10 C. 14 D. 16 E. No correct answer
* **Diagram:**
* Quadrilateral ABCD
* Axis of symmetry: EF (dashed line)
* Point E on line AD
* Point F on line BC
* AD = 3 cm
* CD = 5 cm
* **Solution:** Step-by-step solution with correctness indicators.
### Detailed Analysis or ### Content Details
**Diagram Description:**
* A quadrilateral ABCD is shown. Side AD is parallel to side BC.
* A dashed line EF runs vertically, bisecting the quadrilateral. E lies on AD, and F lies on BC. EF is the axis of symmetry.
* The length of AD is labeled as 3 cm.
* The length of CD is labeled as 5 cm.
**Solution Steps:**
The solution is presented as a series of steps, each marked with "## Step-n" where n is the step number, and a correctness indicator "(Correct)" or "(Incorrect)".
* **Step-0 (Correct):** Implicitly states the goal: To find the perimeter of parallelogram ABFD, we need to use the given information about the quadrilateral ABCD and its symmetry with respect to EF.
* **Step-1 (Correct):** EF is the axis of symmetry for the quadrilateral ABCD. This means that EF divides ABCD into two congruent halves.
* **Step-2 (Incorrect):** Since EF is the axis of symmetry and AD = 3 cm, AD = DF.
* **Step-3 (Incorrect):** CD is given as 5 cm. Since EF is the axis of symmetry, CD = BF.
* **Step-4 (Correct):** Now, let's calculate the perimeter of the parallelogram ABFD: AD + DF + BF + AB
* **Step-5 (Incorrect):** Since AD = DF and CD = BF, we can substitute the measurements: AD = 3 cm
* **Step-6 (Incorrect):** BF = CD = 5 cm
* **Step-7 (Correct):** Therefore, the perimeter is: AD + DF + BF + AB = 3 cm + 3 cm + 5 cm + 5 cm = 16 cm. Final answer: D
### Key Observations
* The diagram shows a quadrilateral ABCD with an axis of symmetry EF.
* The solution attempts to find the perimeter of parallelogram ABFD.
* Several steps in the solution are marked as "Incorrect".
* The final answer provided is D, which corresponds to 16 cm.
### Interpretation
The image presents a geometry problem and a proposed solution. The solution contains errors, as indicated by the "(Incorrect)" labels. The correct steps are used to arrive at the correct answer of 16 cm. The problem leverages the concept of symmetry to deduce the lengths of sides DF and BF, which are necessary to calculate the perimeter of the parallelogram ABFD. The errors in the solution likely stem from misapplication of the symmetry property or incorrect substitution of values.
</details>
(n)
<details>
<summary>x28.png Details</summary>

### Visual Description
## Question and Solution Example: Book Color Identification
### Overview
The image presents a question and its solution related to identifying the color of the bottommost book in a stack depicted in a diagram. The diagram shows two desks with stacks of books, and a cartoon girl walking between them. The solution provides a step-by-step explanation to arrive at the correct answer.
### Components/Axes
* **Title:** Data Example from WeMath
* **Diagrams:** Two desks labeled (1) and (2). Desk (1) has a stack of books colored blue, red, and green (from bottom to top). Desk (2) has a stack of books colored yellow, green, and red (from bottom to top). A cartoon girl is walking between the desks.
* **Question:** "As shown in the diagram, what is the color of the bottommost book placed on Desk \u2460?"
* **Options:** A. Red B. Green C. Blue D. Yellow E. No correct answer
* **Solution:**
* Step-0 (Neural)
* Step-1 (Correct)
* Step-2 (Correct)
* Step-3 (Correct)
* Explanation: "To solve this question, we need to visually analyze the stack of books on Desk \u2460 depicted in the diagram. Starting from the top and moving downward, the books are arranged in the following order:"
1. Green (topmost book)
2. Red (second book)
3. Blue (third book, bottommost book)
* Conclusion: "Thus, the color of the bottommost book on Desk \u2460 is Blue."
* Final Answer: C. Blue
### Detailed Analysis or ### Content Details
The question asks for the color of the bottommost book on Desk \u2460. Desk \u2460 corresponds to desk (1) in the diagram. The solution correctly identifies the order of the books from top to bottom as Green, Red, and Blue. Therefore, the bottommost book is Blue.
### Key Observations
* The question refers to Desk \u2460, which is visually represented as desk (1).
* The solution correctly identifies the order of the books and the color of the bottommost book.
* The steps are marked as "Correct", indicating a validated solution.
### Interpretation
The image provides a simple example of a question that requires visual analysis and deduction. The solution demonstrates a clear and logical approach to solving the problem. The example highlights the importance of careful observation and attention to detail in problem-solving.
</details>
(o)
Figure 7: More data examples from VisualProcessBench.
<details>
<summary>x29.png Details</summary>

### Visual Description
## Geometry Problem: Isosceles Triangle Angle Calculation
### Overview
The image presents a geometry problem involving an isosceles triangle ABC, where the measure of angle B is given as 25 degrees. The task is to find the measure of angle 1, which is an exterior angle at point C. The image includes a diagram of the triangle with labeled angles and a step-by-step solution process, indicating whether each step is correct or incorrect.
### Components/Axes
* **Title:** Data Example from VisualProcessBench
* **Question:** As shown in the figure, triangle ABC is an isosceles triangle, ∠B = 25°, then the angle of ∠1 is ()
* Options: A. 95° B. 145° C. 120° D. No correct answer
* **Diagram:**
* Triangle ABC with the following angles:
* ∠B = 25°
* ∠A = 130°
* ∠BCA = 25°
* Exterior angle at C (∠1)
* Adjacent angle to ∠1 on the straight line = 35°
* **Solution Steps:**
* Each step is labeled with "## Step-n (Status)", where n is the step number and Status is either "Neural", "Correct", or "Incorrect".
* Each step includes a textual explanation of the reasoning.
### Detailed Analysis or ### Content Details
**Diagram Details:**
* Triangle ABC is depicted with vertex A at the top, B at the bottom-left, and C at the bottom-right.
* Angle B is marked as 25°.
* Angle A is marked as 130°.
* Angle BCA is marked as 25°.
* Angle 1 is the exterior angle at point C.
* The angle adjacent to angle 1 on the straight line is marked as 35°.
**Solution Steps Transcription:**
* **Step-0 (Neural):** "Let's see. I have this problem here about an isosceles triangle. It says that in triangle ABC, angle B is 25 degrees, and it's an isosceles triangle. I need to find the measure of angle 1. The options are A. 95°, B. 145°, C. 120°, and D. no correct answer."
* **Step-1 (Correct):** "First, I should recall that in an isosceles triangle, two sides are equal in length, and the angles opposite those sides are also equal. So, in triangle ABC, if angle B is 25 degrees, then either angle A or angle C is also 25 degrees. But wait, looking at the figure, it seems like angle C is also 25 degrees because there's a mark indicating that angle C is 25 degrees."
* **Step-4 (Incorrect):** "So, if angle 1 is an exterior angle at point C, then its measure should be equal to the sum of the measures of the angles at points A and B. Given that angle A is 130 degrees and angle B is 25 degrees, then angle 1 should be 130°+25°=155°."
* **Step-5 (Correct):** "But that's not one of the options. Option D says "no correct answer," but maybe I'm missing something. Wait, perhaps the figure is not exactly as I described. Let me look again."
* **Step-12 (Correct):** "Looking back at the figure, if angle 1 is the exterior angle at C, which is 155 degrees, and there's a 35-degree angle adjacent to it on the straight line, then the angle between them would be 155° - 35° = 120°. Therefore, angle 1 is 120 degrees. Yes, that makes sense. So, the answer should be option C. 120°."
* **Final Answer:** \[\boxed{C} \]
### Key Observations
* The solution initially makes an incorrect assumption about how to calculate the exterior angle.
* The solution corrects itself by considering the adjacent angle on the straight line.
* The final answer is C. 120°.
### Interpretation
The problem demonstrates the process of solving a geometry problem with step-by-step reasoning. The solution highlights the importance of carefully considering the given information and correcting initial assumptions. The problem also tests the understanding of isosceles triangles and exterior angles. The initial incorrect step shows a common mistake, while the subsequent correction demonstrates problem-solving skills. The final answer of 120° is derived by recognizing that angle 1 and the 35° angle form a straight line, and angle 1 is supplementary to the interior angle at C.
</details>
Figure 8: Data example with model reflection from VisualProcessBench. Red highlights the incorrect answer, orange highlights the reflection words, and green highlights the correct answer.