# VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
## Abstract
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in this page.
🖂 Corresponding Author: wangwenhai@pjlab.org.cn
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
## Bar Chart: Overall Performance Comparison Across Policy Models
### Overview
The chart compares the overall performance of three policy models (Pass@1, BoN with InternVL2.5-8B, and BoN with VisualPRM-8B) across six different policy architectures (MiniCPM-V2.6, GwenVL2.5-7B, InternVL2.5-8B, InternVL2.5-26B, InternVL2.5-38B, InternVL2.5-78B). Performance is measured on a scale from 25 to 55.
### Components/Axes
- **X-axis (Policy Model)**:
- MiniCPM-V2.6
- GwenVL2.5-7B
- InternVL2.5-8B
- InternVL2.5-26B
- InternVL2.5-38B
- InternVL2.5-78B
- **Y-axis (Overall Performance)**:
- Scale from 25 to 55 in increments of 5.
- **Legend**:
- Green: Pass@1
- Red: BoN with InternVL2.5-8B
- Blue: BoN with VisualPRM-8B (ours)
### Detailed Analysis
1. **MiniCPM-V2.6**:
- Pass@1: 29.5 (green)
- BoN with InternVL2.5-8B: 28.6 (red)
- BoN with VisualPRM-8B: 37.5 (blue)
2. **GwenVL2.5-7B**:
- Pass@1: 41.4 (green)
- BoN with InternVL2.5-8B: 41.6 (red)
- BoN with VisualPRM-8B: 45.1 (blue)
3. **InternVL2.5-8B**:
- Pass@1: 32.8 (green)
- BoN with InternVL2.5-8B: 33.2 (red)
- BoN with VisualPRM-8B: 41.2 (blue)
4. **InternVL2.5-26B**:
- Pass@1: 36.9 (green)
- BoN with InternVL2.5-8B: 39.1 (red)
- BoN with VisualPRM-8B: 45.8 (blue)
5. **InternVL2.5-38B**:
- Pass@1: 44.4 (green)
- BoN with InternVL2.5-8B: 44.9 (red)
- BoN with VisualPRM-8B: 50.7 (blue)
6. **InternVL2.5-78B**:
- Pass@1: 46.0 (green)
- BoN with InternVL2.5-8B: 46.4 (red)
- BoN with VisualPRM-8B: 51.9 (blue)
### Key Observations
- **Trend Verification**:
- **Pass@1 (green)**: Gradually increases from 29.5 (MiniCPM-V2.6) to 46.0 (InternVL2.5-78B), showing consistent improvement as policy models scale.
- **BoN with InternVL2.5-8B (red)**: Starts at 28.6 (MiniCPM-V2.6) and rises to 46.4 (InternVL2.5-78B), mirroring the Pass@1 trend but with slightly lower values.
- **BoN with VisualPRM-8B (blue)**: Outperforms both metrics across all models, starting at 37.5 (MiniCPM-V2.6) and peaking at 51.9 (InternVL2.5-78B). The gap widens significantly for larger models (e.g., InternVL2.5-78B: 51.9 vs. 46.4 for BoN with InternVL2.5-8B).
- **Notable Outliers**:
- MiniCPM-V2.6 has the lowest performance for all metrics, with BoN with VisualPRM-8B showing the largest relative improvement (37.5 vs. 29.5 for Pass@1).
- InternVL2.5-78B achieves the highest performance for BoN with VisualPRM-8B (51.9), surpassing all other models.
### Interpretation
The data demonstrates that **BoN with VisualPRM-8B** consistently outperforms the other two policy models across all architectures, with the largest performance gains observed in larger models (InternVL2.5-38B and InternVL2.5-78B). This suggests that the VisualPRM-8B framework enhances the BoN (Bootstrapped Neural) approach more effectively than the InternVL2.5-8B variant, particularly as model complexity increases. The Pass@1 metric shows steady improvement but remains the weakest performer, indicating potential limitations in its design or training data. The results highlight the importance of architectural choices in policy models for achieving high overall performance.
</details>
Figure 1: The overall Best-of-8 evaluation results across seven multimodal reasoning benchmarks with different critic models. Our VisualPRM greatly enhances the overall performance, while InternVL2.5-8B struggles to be an effective critic model.
With the remarkable success of Large Language Models (LLMs) [75, 76, 21, 5, 71, 10, 9, 1, 4] in natural language processing, Multimodal Large Language Models (MLLMs) [80, 36, 41, 42, 6, 84, 83, 14, 16, 15, 58, 61, 89] have also achieved significant advancements across various vision-language tasks. Despite their strong performance in perception and recognition, a large gap remains in reasoning capabilities between open-source and proprietary models. A series of studies have explored methods to enhance reasoning abilities, focusing on the perspectives of data collection and construction [57, 74, 38, 43], offline preference optimization [59, 82, 32], and online reinforcement learning [66, 24, 2, 26]. Additionally, another line of research [68, 20, 94, 79] investigates utilizing Test-Time Scaling (TTS) to enhance the reasoning abilities of LLMs. This approach requires the policy model to generate multiple response candidates and select the best one, based on the quality estimation of a critic model, thereby improving the response quality at the cost of higher inference time. However, TTS for MLLMs remains largely unexplored.
This work investigates the application of TTS for MLLMs, focusing on the Best-of-N (BoN) evaluation strategies. The challenges of adapting TTS for MLLMs involves: (1) Lack of effective critic models. In BoN evaluation, a critic model is required to estimate the quality of each response candidate. However, as shown in Figure 1, existing open-source MLLMs struggle to serve as critic models, leading to marginal improvements compared to models without TTS. This limitation stems from the lack of sufficient critic data in their training corpus. (2) Lack of evaluation benchmarks for multimodal critic models. The effectiveness of TTS heavily depends on the performance of the critic model. However, directly evaluating critics under BoN settings poses two key issues. First, the evaluation cost of BoN is expensive. Although the focus is on the performance of critic models, the policy model is required to generate $N$ reasoning processes, with the majority of computational costs arising from the policy model. Second, BoN performance is also affected by the policy model, making it difficult to compare different critic models when paired with varying policy models.
To solve these challenges, we first introduce VisualPRM400K, a dataset comprising approximately 400K multimodal process supervision data. Each sample includes an image, a question, a step-by-step solution, and correctness annotations for each step. Specifically, we collect question prompts from MMPR v1.1 [82] and then generate process correctness using an automatic data pipeline [79]. This pipeline samples multiple continuations starting from a certain step and computes the expected accuracy of that step as the average accuracy of its continuations.
To facilitate the evaluation of multimodal critic models, we introduce VisualProcessBench, a benchmark for evaluating PRMs and MLLMs in detecting erroneous steps in multimodal reasoning tasks. This benchmark includes 2,866 samples with 26,950 human-annotated step-wise correctness labels. Each sample includes a multimodal reasoning question, a step-by-step solution, and correctness labels for each step. To ensure annotation accuracy, we employ human experts with at least a university degree to manually assess the correctness of each step. Unlike prior benchmarks [96, 39], which require identifying only the first erroneous step, VisualProcessBench challenges models to detect all errors within a given solution. This adjustment aligns with recent advancements in model reflection abilities, helping to reduce false negatives in evaluations. Evaluation results reveal that existing open-source MLLMs struggle to accurately assess step-wise correctness, highlighting the need for improved multimodal critic models.
Building upon the dataset and benchmark, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, to serve as the critic model in BoN evaluation. Each training sample is formulated as a multi-turn chat. The first turn includes the image, the question, and the first solution step, while each subsequent turn presents a new step. The model is trained to predict the correctness of the given step at each turn. Experimental results demonstrate that VisualPRM enhances MLLM reasoning across different model families and scales. Specifically, VisualPRM improves the overall reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks [90, 50, 78, 93, 99, 60, 87]. Additionally, we compare PRMs with Outcome Reward Models and Self-Consistency in BoN evaluation, finding that PRMs consistently outperform both approaches.
In summary, our main contributions are as follows:
(1) We introduce VisualPRM400K, a dataset comprising approximately 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal PRM to serve as the critic model in the BoN evaluation.
(2) We construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to identify erroneous steps in multimodal reasoning tasks. This benchmark comprises 2,866 samples with a total of 26,950 human-annotated step-wise correctness labels.
(3) Through extensive experiments, we demonstrate that PRMs can serve as effective critic models for test-time scaling of MLLMs. Specifically, VisualPRM enhances the overall multimodal reasoning performance of MiniCPM-V2.6, QwenVL2.5-7B, InternVL2.5-8B, and InternVL2.5-78B by 8.0, 3.7, 8.4, and 5.9 points, respectively, across seven multimodal reasoning benchmarks. Furthermore, our results show that PRMs consistently outperform both ORMs and SC in BoN evaluation. Additionally, experiments on VisualProcessBench reveal that existing open-source MLLMs struggle to accurately assess the correctness of each step.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagrams with Mathematical Problems and Solutions
### Overview
The image contains two distinct data examples from technical documents, each featuring a geometric diagram, a problem statement, and a step-by-step solution. The first example involves a quadrilateral with algebraic angle expressions, while the second involves parallel lines and angle relationships. Both solutions use code-like annotations to document reasoning steps.
---
### Components/Axes
#### Example 1: Quadrilateral (VisualPRM400K)
- **Diagram Labels**: Vertices labeled Q, R, T, S.
- **Angle Expressions**:
- ∠Q: `(2x + 5)°`
- ∠R: `x°`
- ∠T: `x°`
- ∠S: `(2x + 7)°`
- **Solution Annotations**:
- Code-style comments (e.g., `# Step-1 (mc=0.75)`) with mathematical reasoning.
- Final answer: `58°` (angle S).
#### Example 2: Parallel Lines (VisualProcessBench)
- **Diagram Labels**: Lines AB (parallel to CD), transversal EG.
- **Angle Labels**:
- ∠1: `50°` (at intersection of EG and CD).
- ∠2: Target angle (at intersection of EG and AB).
- **Solution Annotations**:
- Code-style comments (e.g., `# Step-1 (Correct)`) with geometric reasoning.
- Final answer: `50°` (angle 2, option A).
---
### Detailed Analysis
#### Example 1: Quadrilateral
1. **Problem Statement**: Find ∠S in a quadrilateral with angles defined by algebraic expressions.
2. **Solution Steps**:
- **Step 1**: Use the property that the sum of interior angles in a quadrilateral is `360°`.
- Equation: `(2x + 5) + x + (2x + 7) + x = 360`.
- **Step 2**: Simplify the equation:
- `6x + 12 = 360`.
- **Step 3**: Solve for `x`:
- `6x = 348` → `x = 58`.
- **Step 4**: Substitute `x = 58` into ∠S: `(2x + 7) = 123°` (not directly used in final answer).
- **Final Answer**: ∠S = `58°` (derived from `x = 58`).
#### Example 2: Parallel Lines
1. **Problem Statement**: Determine ∠2 given AB || CD and EG as a transversal, with ∠1 = `50°`.
2. **Solution Steps**:
- **Step 1**: Identify that AB || CD creates alternate interior angles (∠BEF = ∠GEF = `50°`).
- **Step 2**: Recognize ∠2 and ∠GEF are congruent due to parallel lines and transversal.
- **Final Answer**: ∠2 = `50°` (option A).
---
### Key Observations
- **Example 1**: The solution uses algebraic manipulation of angle expressions and geometric properties of quadrilaterals.
- **Example 2**: Relies on parallel line theorems (alternate interior angles) and transversal relationships.
- **Code Annotations**: Both solutions include metadata-like comments (e.g., `mc=0.75`, `Neural`) that may indicate confidence scores or model types.
---
### Interpretation
- **Example 1**: Demonstrates how algebraic expressions can model geometric problems. The final answer (`58°`) directly corresponds to the solved variable `x`, emphasizing the interdependence of algebra and geometry.
- **Example 2**: Highlights the application of parallel line theorems. The congruence of ∠2 and ∠GEF (`50°`) underscores the consistency of geometric principles across diagrams.
- **Code Comments**: The presence of `mc` values (e.g., `mc=0.75`) suggests these solutions may originate from machine learning models, where `mc` could represent confidence or accuracy metrics.
---
### Final Answer
- **Example 1**: ∠S = `58°`.
- **Example 2**: ∠2 = `50°` (option A).
</details>
Figure 2: Data examples in VisualPRM400K and VisualProcessBench. For VisualPRM400K, we generate the data using an automatic data pipeline. The key idea is to estimate the expected accuracy $mc_{i}$ of the given step $s_{\leq i}$ based on Monte Carlo sampling and consider the step correct if $mc_{i}>0$ . During the training process of VisualPRM, the data is formulated as multi-turn conversations and the model is required to predict the correctness of each step conditioned on the image, question, and previous steps. For VisualProcessBench, we collect questions from existing multimodal reasoning benchmarks [90, 78, 93, 99, 60] and generate the solutions using leading MLLMs [58, 4, 15, 82, 72]. Based on these questions and solutions, we employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions.
## 2 Related Work
Multimodal Large Language Models. A wide range of efforts has been made to advance the development of MLLMs, including improvements in model architecture, data construction, and training algorithms. From an architectural perspective, several studies [41, 42, 89, 15, 16, 84, 83, 45, 85, 7, 89] employ connectors to align visual embeddings from Vision Foundation Models (VFMs) [14, 91] with the latent space of LLMs [5, 75, 76, 71], achieving promising performance. Another series of works [3, 21, 73, 81] extends pre-trained LLMs with additional layers to fuse visual features, reducing the number of required visual tokens while introducing extra training cost. In terms of data construction, recent studies have made significant progress [63, 98, 34, 38, 43, 82, 95]. For example, OmniCorpus [38] offers a noisy but large-scale multimodal corpus for pre-training, while MMInstruct [43] provides an open-source, high-quality instruction-tuning dataset. Additionally, MMPR [82] constructs a preference dataset focusing on multimodal reasoning abilities. Regarding training algorithms, the InternVL2.5 series [15, 82] proposes square loss and Mix Preference Optimization to enhance MLLM capabilities. Despite these advancements, existing works primarily focus on the training process of MLLMs, leaving Test-Time Scaling (TTS) for MLLMs largely underexplored. In this work, we investigate TTS applications for MLLMs, specifically focusing on the Best-of-N evaluation to improve multimodal reasoning performance.
Process Reward Models. Reward models play a crucial role in Reinforcement Learning (RL) [64, 66, 2, 26] and TTS [68, 79, 20, 51]. Outcome Reward Models (ORMs) [55, 92, 77] directly assign an overall score to the given response. In contrast, Process Reward Models (PRMs) first estimate the quality of each step in the given response and then aggregate them into a final score. PRM800K [39] is the first open-source process supervision dataset, entirely annotated by human experts. To reduce annotation costs, MathShepherd [79] and OmegaPRM [51] introduce a Monte Carlo sampling-based data pipeline to automatically estimate the quality of each step. Despite these advancements in natural language processing, multimodal PRMs remain largely underexplored. In this work, we introduce VisualPRM400K, the first multimodal process supervision dataset, and develop VisualPRM, a multimodal PRM trained on this dataset.
Benchmarks for Reward Models. The evaluation of reward models (RMs) is a crucial research topic. A series of benchmarks [33, 37, 44] have been proposed to assess the effectiveness of RMs, typically formulated as a binary preference judgment task. Building on this, subsequent work [97] extends the evaluation settings and includes both pairwise and Best-of-N evaluations, providing a more comprehensive evaluation of RM performance. With the rapid advancement of PRMs, a series of benchmarks [96, 69] have been introduced to evaluate their step-wise judgment capabilities. Despite these developments, there remains a lack of a multimodal process benchmark. To bridge this gap and support the development of multimodal PRMs, we introduce VisualProcessBench, a benchmark designed to evaluate the ability of PRMs and MLLMs to detect erroneous steps in multimodal reasoning tasks.
## 3 Method
<details>
<summary>x3.png Details</summary>

### Visual Description
## Flowchart Diagram: Geometry Problem Solving Process with Evaluation Metrics
### Overview
The image depicts a multi-step flowchart for solving a geometry problem (finding m∠S in a quadrilateral) alongside an evaluation system using color-coded feedback and confidence metrics (mc). The diagram combines a geometric figure with a decision tree and performance tracking system.
### Components/Axes
1. **Left Section (Problem Setup)**:
- Diamond-shaped quadrilateral labeled Q, R, S, T
- Angle expressions:
- ∠Q: (2x + 5)°
- ∠R: x°
- ∠S: (2x + 7)°
- ∠T: x°
- Question: "Find m∠S"
2. **Central Flowchart (Solution Process)**:
- **Step 0**: "To find (m∠S)..." with mc=0.75 (Correct)
- **Step 1**: "Write the equation for..." with mc=0.5 (Correct)
- **Step 4**: "Substitute (x) back..." with mc=0.0 (Incorrect)
- **Step 5**: "Final answer: 58" with mc=0.0 (Tie)
3. **Right Section (Evaluation System)**:
- **Legend**:
- Blue: Correct (+)
- Green: Correct (+)
- Orange: Incorrect (-)
- Light Orange: Tie (=)
- **Labels**:
- "Question & Solution"
- "Expected Accuracy"
- "Value-based PRM"
- "Advantage-based PRM"
### Detailed Analysis
1. **Geometric Figure**:
- Quadrilateral angles sum to 360° (standard polygon property)
- Symmetrical angle expressions suggest isosceles trapezoid properties
- Variables: x (base angles), 2x+5 and 2x+7 (vertex angles)
2. **Flowchart Metrics**:
- Step 0: High confidence (mc=0.75) with positive feedback
- Step 1: Moderate confidence (mc=0.5) with positive feedback
- Step 4: Zero confidence (mc=0.0) with negative feedback
- Step 5: Zero confidence (mc=0.0) with neutral "Tie" feedback
3. **Evaluation Categories**:
- Expected Accuracy: Likely represents theoretical correctness
- Value-based PRM: Possibly measures practical utility
- Advantage-based PRM: May assess strategic benefit
### Key Observations
1. **Confidence-Outcome Correlation**:
- High confidence (Step 0) correlates with correct execution
- Decreasing confidence (Step 1) still yields correct results
- Zero confidence steps (4-5) show mixed outcomes
2. **Feedback Paradox**:
- Step 5's "Tie" feedback despite zero confidence suggests evaluation system limitations
- Step 4's "Incorrect" feedback with zero confidence indicates potential model uncertainty
3. **Geometric Constraints**:
- Angle sum equation: (2x+5) + x + (2x+7) + x = 360
- Simplifies to: 6x + 12 = 360 → x = 58
- Final angle calculation: m∠S = 2(58) + 7 = 123° (contradicts final answer 58°)
### Interpretation
The diagram reveals a complex relationship between problem-solving steps and evaluation metrics. The geometric solution process (Steps 0-5) demonstrates standard algebraic manipulation for angle calculation, though the final answer (58°) conflicts with the derived value (123°) from the angle sum equation. This discrepancy suggests either:
1. A misinterpretation of the geometric configuration
2. An error in the final answer presentation
3. A deliberate test of evaluator attention to detail
The evaluation system's mixed feedback (Correct/Incorrect/Tie) with varying confidence metrics highlights potential flaws in automated grading systems. The "Tie" feedback for the final answer despite zero confidence particularly raises questions about the evaluation criteria's reliability. The color-coded feedback system (blue/green for correct, orange for incorrect) provides immediate visual assessment but may oversimplify complex problem-solving processes.
</details>
Figure 3: Different modeling methods for PRMs. PRMs are developed to estimate the quality of each step in a given solution. For value-based PRMs, the quality of a certain step is determined by its expected accuracy $mc_{i}$ , where a step is considered correct if $mc_{i}>0$ . For advantage-based PRMs, the quality of a certain step is determined by the improvement of $mc_{i}$ over $mc_{i-1}$ , where a step is considered good if $mc_{i}-mc_{i-1}>0$ . During the training stage, the output space of PRMs is discretized into specific tokens, while during the inference stage, we compute the step score as the weighted sum of the generation probability for these discretized tokens.
During Best-of-N (BoN) evaluation, a critic model is required to estimate the quality of each response candidate. In this work, we formulate the critic model as a Process Reward Model (PRM). To develop a multimodal PRM, we first construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data, as detailed in Section 3.1. We then describe our PRM’s modeling approach in Section 3.2. Furthermore, to support the evaluation of critic models, we establish VisualProcessBench to measure the abilities of critic models to detect erroneous steps in multimodal reasoning, as introduced in Section 3.3.
### 3.1 VisualPRM400K
Definition. As shown in Figure 2, each data sample in our VisualPRM400K consists of an image $I\in\mathcal{I}$ , a question $q\in\mathcal{Q}$ , a step-by-step solution $s=\{s_{0},s_{1},\cdots,s_{n}\}\in\mathcal{S}$ , and the expected accuracy annotation $mc=\{mc_{0},mc_{1},\cdots,mc_{n}\},mc_{i}\in\mathbb{R}_{\geq 0}$ for each step, where $n$ is the number of steps of a certain solution and $mc_{i}$ denotes the expected accuracy of step $s_{i}$ . The image sets $\mathcal{I}$ and question sets $\mathcal{Q}$ are collected from MMPR v1.1 [82], while the step-by-step solutions $\mathcal{S}$ are sampled using InternVL2.5 series models [15, 82].
Process Supervision Generation. Given an image $I$ , a question $q$ , and a solution $s=\{s_{0},s_{1},\cdots,s_{n}\}$ , we annotate the correctness of each step $s_{i}$ using an automatic data pipeline. The key idea is to estimate the expected accuracy of given steps $s_{\leq i}$ based on Monte Carlo sampling. Specifically, the model is required to complete the solution as follows:
$$
\tilde{s}_{>i}\sim M(\tilde{s}_{>i}\mid I,q,s_{\leq i}), \tag{1}
$$
where $\tilde{s}_{>i}$ is the completion of $s_{\leq i}$ . Besides, the expected accuracy of $s_{i}$ is defined as:
$$
mc_{i}=\frac{\text{num(correct completions)}}{\text{num(sampled completions)}}. \tag{2}
$$
Notably, to reduce the data construction costs, we set the max number of steps to 12 and evenly merge the steps if the number of current steps exceeds the threshold.
Statistics. During the construction process, we sample $4$ solutions for each image-question pair and split each of them into at most $12$ steps. For each step, we sample $16$ continuations and compute $m_{i}$ according to these continuations. The resulting dataset comprises approximately 400K samples and 2 million steps with process supervision. Each response averages 126.9 words and 5.6 steps, while each step averages 22.6 words. Among these steps, about 10% are incorrect steps. Despite the imbalanced distribution of correct and incorrect steps, our PRM demonstrates promising performance, as shown in Section 4.
### 3.2 VisualPRM
Overview. During the training process, we formulate the process supervision problem as a multi-turn chat task so that we can effectively leverage the generation ability of MLLMs. The image $I$ , question $q$ , and the first step $s_{0}$ of the solution to this question are included in the first turn and a new step is presented in each subsequent turn. The model is required to predict the quality of the given step in each turn as follows:
$$
y_{i}\sim M(y_{i}\mid I,q,s_{\leq i}), \tag{3}
$$
where $y_{i}$ denotes the quality of $i$ -th step.
For value-based PRMs, the quality of a certain step is determined by its expected accuracy $mc_{i}$ , which is similar to the definition of the value function in reinforcement learning. Following Math-Shepherd [79, 20], we require the model to predict the correctness $c_{i}\in\{+,-\}$ of the given step, rather than the exact score of $mc_{i}$ . The $i$ -th step is considered correct if $mc_{i}>0$ . We also try to set a threshold to reduce false positive steps, but find that such a threshold negatively impacts the PRM performance, as shown in Section 7. Notably, unlike previous works [79, 39, 20], which choose to supervise only up to the first incorrect step, we always supervise all steps.
For advantage-based PRMs, the quality of a certain step is determined by the improvement of $mc_{i}$ over $mc_{i-1}$ , which is analogous to the definition of the advantage function in reinforcement learning. Similar to value-based PRMs, the quality space is discretized into predefined values $\{+,=,-\}$ , meaning that the $i$ -th step $s_{i}$ results in a superior, comparable, or inferior situation.
During inference stage, we first compute the scores of each step and then merge them to obtain the response score. Specifically, the score for each step is defined as the weighted sum of the generation probability for the discretized scores. For value-based PRMs, the weights for $\{+,-\}$ are $\{1,0\}$ . For advantage-based PRMs, the weights for $\{+,=,-\}$ are $\{1,0,-1\}$ . Without further explanation, we average the scores of each step as the response score.
### 3.3 VisualProcessBench
| Total Samples - MMMU [90] - MathVision [78] | 2866 267 712 |
| --- | --- |
| - MathVerse [93] | 1026 |
| - DynaMath [99] | 570 |
| - WeMath [60] | 291 |
| Source Solutions | 2866 |
| - GPT-4o [58] | 870 |
| - Claude-3.5-Sonnet [4] | 865 |
| - QvQ-72B-Preview [72] | 825 |
| - InternVL2.5-78B [15] | 306 |
| Total Steps | 26950 |
| - Correct Steps | 16585 |
| - Incorrect Steps | 7691 |
| - Neural Steps | 2674 |
| Query Word Length Quartile | (22, 24, 50) |
| Response Word Length Quartile | (137, 193, 552) |
| Step Word Length Quartile | (13, 31, 67) |
| Number of Steps per Solution | 9.4 |
Table 1: Statistics of VisualProcessBench.
Definition. Each sample in our benchmark consists of a multimodal reasoning question, a step-by-step solution, and correctness annotations for each step. Considering that recent models begin to demonstrate reflection abilities to rectify their own reasoning process, the evaluation setting used in previous works [96, 39], which only requires the model to find the first erroneous step, may lead to a false negative estimation. Therefore, our benchmark requires the model to identify all erroneous steps in the given solution instead of only the first erroneous step.
Data Source. Our benchmark focuses on multimodal reasoning tasks, collecting images and questions from existing representative multimodal reasoning benchmarks, including MMMU [90], MathVision [78], MathVerse [93], DynaMath [99], and WeMath [60]. Given these questions, we generate step-by-step solutions using leading MLLMs, including GPT-4o [58], Claude-3.5-Sonnet [4], Gemini-2.0-Flash [70], QvQ-72B-Preview [72], and InternVL2.5-78B [15]. The solutions are sampled from different MLLMs to ensure their diversity.
Step Correctness Annotation. We employ a team of human experts with at least a university degree to manually annotate the correctness of each step in the solutions. Specifically, 13 people worked for 3 days, resulting in a workload of 39 person-days. The cost per person-day is approximately 37 dollars. During the annotation process, annotators are provided with the image, question, ground truth answer, and each step of the solution. Their task is to assign each step in the solution a label of positive, negative, or neutral, as illustrated in Figure 2. A positive label indicates that the step is correct, while a negative label signifies an incorrect step. The neural label is assigned to steps that do not involve any reasoning process or provide no additional information. To ensure the annotation quality, annotators are permitted to skip questions they do not understand. During the annotation process, our dataset is divided into 10 splits, each containing approximately 300 samples. For each split, the authors of this paper manually review about 10% of the samples. Splits with erroneous annotations are sent back for re-annotation. See Section 10 for more data examples.
Statistics. As shown in Table 1, our benchmark comprises 2866 samples. To enhance the diversity of our evaluation samples, we gather questions and solutions from a wide range of benchmarks and models while carefully regulating their distribution. The statistics of step distribution are presented in Section 8.
Metrics. In this work, we use macro F1 scores to compare model performance, aiming to mitigate the impact of the imbalanced distribution between correct and incorrect steps. Specifically, we first compute the F1 scores separately for correct and incorrect steps and then take their average to obtain the overall score.
## 4 Experiments
In this section, we first employ VisualPRM to evaluate various MLLMs using BoN evaluation strategies in Section 4.1, demonstrating that PRMs can significantly enhance the reasoning abilities of MLLMs. Next, we evaluate our VisualPRM and other leading MLLMs on VisualProcessBench in Section 4.2. Finally, the ablation studies are presented in Section 4.3 and Section 7.
| GPT-4o [58] Gemini-2.0-Flash [61] Claude-3.5-Sonnet [4] | 70.7 69.9 66.4 | 60.0 70.4 65.3 | 31.2 43.6 35.6 | 40.6 47.8 46.3 | 34.5 42.1 35.7 | 45.8 47.4 44.0 | 52.8 52.3 60.4 | 47.9 53.4 50.5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Open-source Models | | | | | | | | |
| MiniCPM-V2.6-8B [89] | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| +VisualPRM | 56.8 | 65.7 | 24.7 | 35.8 | 11.2 | 31.0 | 37.4 | 37.5 |
| +7.0 | +4.9 | +1.3 | +16.9 | +1.4 | +14.6 | +9.8 | +8.0 | |
| Qwen2.5-VL-7B [7] | 55.0 | 67.8 | 25.4 | 41.1 | 21.0 | 35.2 | 44.1 | 41.4 |
| +VisualPRM | 58.6 | 70.3 | 31.3 | 44.3 | 23.0 | 39.8 | 48.3 | 45.1 |
| +3.6 | +2.5 | +5.9 | +3.2 | +2.0 | +4.6 | +4.2 | +3.7 | |
| InternVL2.5-8B [15] | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| +VisualPRM | 60.2 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.2 |
| +4.0 | +4.0 | +8.7 | +13.0 | +8.6 | +13.0 | +7.8 | +8.4 | |
| InternVL2.5-26B [15] | 60.7 | 68.2 | 23.4 | 24.0 | 11.4 | 30.9 | 39.6 | 36.9 |
| +VisualPRM | 63.9 | 73.1 | 29.6 | 39.1 | 23.2 | 40.8 | 51.0 | 45.8 |
| +3.2 | +4.9 | +6.2 | +15.1 | +11.8 | +9.9 | +11.4 | +8.9 | |
| InternVL2.5-38B [15] | 63.9 | 71.9 | 32.2 | 36.9 | 20.0 | 38.3 | 47.9 | 44.4 |
| +VisualPRM | 69.0 | 73.9 | 35.2 | 46.7 | 30.5 | 46.2 | 53.7 | 50.7 |
| +5.1 | +2.0 | +3.0 | +9.8 | +10.5 | +7.9 | +5.8 | +6.3 | |
| InternVL2.5-78B [15] | 70.0 | 72.3 | 32.2 | 39.2 | 19.2 | 39.8 | 49.0 | 46.0 |
| +VisualPRM | 70.7 | 75.1 | 35.9 | 47.1 | 31.3 | 49.1 | 53.9 | 51.9 |
| +0.7 | +2.8 | +3.7 | +7.9 | +12.1 | +9.3 | +4.9 | +5.9 | |
Table 2: Results on seven multimodal reasoning benchmarks. MMMU [90] is a multidisciplinary reasoning benchmark. MathVista [50], MathVision [78], MathVerse [93], DynaMath [99], and WeMath [60] are mathematics benchmarks. For MathVerse, we report the performance on Vision-Only (VO) split. LogicVista [87] is a logical reasoning benchmark. Part of the results are collected from the OpenCompass leaderboard [19]. The overall score is the average score of the above benchmarks. By using VisualPRM as the critic model, existing open-source MLLMs achieve significant improvements in reasoning ability under the Best-of-8 evaluation strategy.
### 4.1 Results with Best-of-N evaluation
Benchmarks. We evaluate the reasoning abilities of MLLMs across seven benchmarks, including MMMU [90], MathVista [50], MathVision [78], MathVerse [93], DynaMath [99], WeMath [60], and LogicVista [87]. The evaluation samples include subject-based, mathematical, and logical reasoning problems. We report the worst-case accuracy for DynaMath and the overall accuracy for the remaining benchmarks. For MathVerse, we report the performance on the Vision-Only split.
Settings. Without further explanation, we use VisualPRM as the critic model for BoN evaluation and set $N$ to $8$ by default. The policy model is required to generate $N$ distinct step-by-step Chain-of-Thought (CoT) reasoning processes with a temperature of 0.7. The response with the highest score is then selected to determine the correctness.
Results. As shown in Table 2, VisualPRM greatly enhances the reasoning abilities of MLLMs across different model scales and families. Specifically, for models with fewer than 10 billion parameters, the overall performance of InternVL2.5-8B, MiniCPM-V-8B, and Qwen2.5-VL-7B improves by 8.4, 8.0, and 3.7 points, respectively, demonstrating the effectiveness of test-time scaling across different model families. For larger models, InternVL2.5-26B, InternVL2.5-38B, and InternVL2.5-78B also achieve substantial performance gains over their counterparts without TTS, further validating the scalability and effectiveness of TTS across different model sizes.
### 4.2 Results on VisualProcessBench
Settings. For the evaluation of PRMs, a step is considered correct if the probability of outputting “ $+$ ” exceeds that of outputting “ $-$ ” by a certain threshold. For the evaluation of MLLMs, the model is prompted to analyze each step and determine its correctness, classifying it as either correct or incorrect. When computing the F1 score, we exclude steps labeled as neural by human annotators in Section 3.3.
Results. As shown in Table 3, most existing MLLMs struggle to accurately assess the correctness of each step. Specifically, the overall F1 score for random guessing is 50.0, while most open-source MLLMs achieve scores close to this baseline, highlighting their limitations as critic models. We manually check the judgments of these open-source MLLMs and observe that these models tend to provide positive analysis and label most steps as correct. For example, InternVL2.5-8B achieves an F1 score of 76.8 for positive steps, while its F1 score for negative steps is only 19.2, indicating that InternVL2.5-8B rarely identifies steps as incorrect. Furthermore, compared to proprietary models, our VisualPRM demonstrates competitive performance, achieving an overall F1 score of 62.0—outperforming GPT-4o and GPT-4o-Mini, and performing on par with Gemini-2.0-Flash. Notably, our model, with only 8 billion parameters, is more efficient than these proprietary counterparts.
| Model Random Guessing Proprietary Models | MMMU 50.0 | MathVision 50.0 | MathVerse-VO 50.0 | DynaMath 50.0 | WeMath 50.0 | Overall 50.0 |
| --- | --- | --- | --- | --- | --- | --- |
| GPT-4o-Mini [58] | 53.6 | 58.9 | 57.1 | 56.7 | 58.5 | 57.9 |
| GPT-4o [58] | 56.3 | 60.2 | 59.7 | 59.0 | 63.3 | 60.3 |
| Gemini-2.0-Flash [61] | 58.5 | 60.1 | 62.8 | 66.7 | 58.7 | 62.3 |
| Open-source Models | | | | | | |
| MiniCPM-V2.6-8B [89] | 44.9 | 50.9 | 58.9 | 46.7 | 57.4 | 50.4 |
| LLaVA-OV-7B [35] | 45.7 | 43.0 | 42.2 | 44.7 | 52.5 | 44.4 |
| LLaVA-OV-72B [35] | 46.1 | 48.4 | 53.0 | 57.0 | 57.3 | 52.3 |
| Qwen2.5-VL-7B [7] | 53.1 | 51.8 | 47.8 | 51.3 | 54.2 | 51.0 |
| Qwen2.5-VL-72B [7] | 59.2 | 59.0 | 59.7 | 62.9 | 62.3 | 60.5 |
| InternVL2.5-8B [15] | 47.1 | 45.5 | 47.8 | 50.3 | 50.8 | 48.0 |
| InternVL2.5-26B [15] | 48.8 | 47.4 | 49.2 | 50.4 | 51.4 | 49.2 |
| InternVL2.5-38B [15] | 51.5 | 48.4 | 50.9 | 51.8 | 52.5 | 50.8 |
| InternVL2.5-78B [15] | 52.0 | 51.7 | 53.7 | 50.8 | 52.5 | 52.6 |
| VisualPRM (ours) | 58.5 | 62.1 | 61.0 | 62.7 | 61.8 | 62.0 |
Table 3: Results on VisualProcessBench. We report the macro F1 of the correct and incorrect steps. The overall score is the micro average of the score from different data sources. Our VisualPRM exhibits state-of-the-art performance among open-source models.
### 4.3 Ablation Studies
Effects of BoN. Here, we increase the number of response candidates sampled from InternVL2.5-8B and select the final response using Self-Consistency (SC) [86], Outcome Reward Model (ORM), and PRM. The training data for ORM are nearly identical to those used for PRM, except that all steps are concatenated into a single step and step-wise correctness annotations are converted into a single correctness label for the outcome. As shown in Figure 4, increasing the number of response candidates $N$ improves the reasoning performance of InternVL2.5-8B and MiniCPM-V2.6-8B when using SC, ORM, or PRM, with PRM yielding the most significant improvements. Specifically, when using InternVL2.5-8B as the policy model, PRM outperforms SC and ORM by 2.4 and 1.5 points, respectively, under the Best-of-8 evaluation setting. Moreover, this performance gap widens as $N$ increases, reaching 3.1 and 4.3 points when $N$ is set to $128$ . Notably, when using ORM as the critic model, although performance improves during Best-of-8 evaluation, further increasing $N$ does not lead to consistent gains for InternVL2.5-8B. For example, the Best-of-128 performance is inferior to the Best-of-64 performance. These results highlight the effectiveness of PRM in TTS.
Effects of PRM modeling methods. Here, we compare the value-based PRM and the advantage-based PRM introduced in Section 3.2, along with different methods for aggregating step scores into a final score, including averaging, as well as selecting the maximum or minimum value. The results are presented in Table 4. We find that value-based PRMs outperform advantage-based PRMs in both BoN evaluation settings and VL-ProcessBench. We attribute this to the inherent noise in our training data, which is generated through an automatic data pipeline, making it challenging to accurately determine whether a given step contributes to higher or lower expected accuracy. We also compare two training strategies: supervising all steps (i.e., w/o early stop) versus supervising only up to the first incorrect step (i.e., w. early stop) during training. Experimental results show that the former yields better performance. Regarding different score aggregation methods, we find that selecting the maximum value results in poorer performance compared to averaging or taking the minimum value. Analyzing the generated scores reveals that most responses contain a high-scored step, close to $1$ , at the beginning of the solution. This phenomenon likely arises because most erroneous steps appear in the middle of the solution. Our statistics of VisualProcessBench presented in Section 8 further demonstrate this conclusion. Furthermore, averaging performs better than selecting the maximum value, likely because the latter relies on a single step’s score, while averaging accounts for multiple steps and can be considered as an ensemble approach, which benefits the step quality estimation.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: InterVL2.5-8B Overall Performance vs Best of N
### Overview
The chart compares the performance of three methods (Self-Consistency, VisualORM-8B, VisualPRM-8B) across increasing "Best of N" values (1, 8, 16, 32, 64, 128). Performance is measured on the y-axis (32–44), with distinct markers for each method: green squares, red triangles, and blue diamonds.
### Components/Axes
- **X-axis**: "Best of N" (logarithmic scale: 1, 8, 16, 32, 64, 128)
- **Y-axis**: "InterVL2.5-8B Overall Performance" (linear scale: 32–44)
- **Legend**: Located in the bottom-right corner, mapping colors to methods:
- Green squares: Self-Consistency
- Red triangles: VisualORM-8B
- Blue diamonds: VisualPRM-8B
### Detailed Analysis
1. **Self-Consistency (Green Squares)**:
- Starts at ~32.5 (N=1)
- Rises sharply to ~39 (N=8)
- Gradually increases to ~40.5 (N=16–128), plateauing slightly after N=32.
2. **VisualORM-8B (Red Triangles)**:
- Begins at ~32.5 (N=1)
- Peaks at ~39.5 (N=8)
- Fluctuates between ~39–40.5 (N=16–128), with a minor dip at N=16.
3. **VisualPRM-8B (Blue Diamonds)**:
- Starts at ~32.5 (N=1)
- Jumps to ~41 (N=8)
- Steadily increases to ~44 (N=128), showing the steepest upward trend.
### Key Observations
- **VisualPRM-8B** consistently outperforms other methods, especially at higher N values (e.g., +2.5 over Self-Consistency at N=128).
- **Self-Consistency** and **VisualORM-8B** exhibit similar performance trajectories but lag behind VisualPRM-8B.
- All methods show diminishing returns after N=32, with performance gains slowing or plateauing.
### Interpretation
The data suggests that **VisualPRM-8B** scales more effectively with increased "Best of N" values, likely due to architectural advantages or optimization strategies. Self-Consistency and VisualORM-8B demonstrate comparable but suboptimal scalability, with performance gains tapering off after moderate N values. The plateau in Self-Consistency’s performance (N≥32) may indicate saturation or inefficiency in leveraging additional data. This chart highlights the importance of method selection in performance-critical applications, where VisualPRM-8B’s scalability could provide a decisive advantage.
</details>
(a)
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: MiniCPM-V-2.6 Overall Performance vs. Best of N
### Overview
The chart compares the performance of three methods (Self-Consistency, VisualORM-8B, VisualPRM-8B) across increasing "Best of N" values (1, 8, 16, 32, 64, 128). Performance is measured on the MiniCPM-V-2.6 Overall Performance scale (29–41). All three methods show upward trends, with VisualPRM-8B achieving the highest performance at larger N values.
### Components/Axes
- **X-axis**: "Best of N" (logarithmic scale: 1, 8, 16, 32, 64, 128)
- **Y-axis**: "MiniCPM-V-2.6 Overall Performance" (linear scale: 29–41)
- **Legend**: Located on the right, with:
- Green squares: Self-Consistency
- Red triangles: VisualORM-8B
- Blue diamonds: VisualPRM-8B
### Detailed Analysis
1. **Self-Consistency (Green Squares)**:
- Starts at ~29.1 (N=1) and increases to ~35.6 (N=128).
- Gradual, linear growth with minimal curvature.
- Uncertainty: ±0.2 at all points.
2. **VisualORM-8B (Red Triangles)**:
- Begins at ~29.0 (N=1) and rises to ~38.0 (N=128).
- Steeper initial slope (N=1→8: +7.0) followed by slower growth.
- Uncertainty: ±0.3 at all points.
3. **VisualPRM-8B (Blue Diamonds)**:
- Starts at ~29.0 (N=1) and peaks at ~40.0 (N=128).
- Sharp initial increase (N=1→8: +8.0), then plateaus.
- Uncertainty: ±0.4 at all points.
### Key Observations
- **Performance Correlation**: All methods improve with larger N, but VisualPRM-8B dominates at N≥32.
- **Diminishing Returns**: VisualPRM-8B’s performance plateaus after N=32, suggesting saturation.
- **Self-Consistency Lag**: The green line shows the slowest growth, indicating lower sensitivity to N.
### Interpretation
The data demonstrates that increasing the number of samples (Best of N) enhances performance across all methods. VisualPRM-8B achieves the highest gains, particularly at larger N values, but its performance stabilizes after N=32. Self-Consistency’s slower growth suggests it may rely less on iterative refinement compared to the other methods. The steep initial rise for VisualPRM-8B implies significant early-stage improvements, while its plateau indicates diminishing returns at scale. This could inform resource allocation for optimization tasks, favoring methods with higher scalability (e.g., VisualPRM-8B) for large N scenarios.
</details>
(b)
Figure 4: Overall Best-of-N results across seven multimodal reasoning benchmarks with different policy and critic models. VisualPRM consistently enhances reasoning performance of both InternVL2.5-8B and MiniCPM-V as $N$ increases and outperforms the improvement introduced by Self-Consistency and ORM, highlighting the effectiveness of PRM in Test-Time Scaling.
| Critic Model Pass@1 Random Guessing | BoN 32.8 33.0 | VL-ProcessBench - 50.0 |
| --- | --- | --- |
| InternVL2.5-8B | 33.2 | 48.0 |
| InternVL2.5-78B | 34.2 | 52.6 |
| Advantage-based PRM | | |
| +Min | 36.8 | 55.0 |
| +Max | 36.9 | 55.0 |
| +Average | 37.4 | 55.0 |
| Value (w. early stop) | | |
| +Min | 40.3 | 61.6 |
| +Max | 37.0 | 61.6 |
| +Average | 40.6 | 61.6 |
| Value (w/o early stop) | | |
| +Min | 40.4 | 62.0 |
| +Max | 35.9 | 62.0 |
| +Average | 41.1 | 62.0 |
Table 4: Comparison of different critic models and score aggregation methods. Note that score aggregation methods do not affect performance on VisualProcessBench, which focuses on step correctness judgement. We find that supervising all steps (w/o early stop) during training perform slightly better than supervising only up to the first incorrect step (w. early steop).
MLLM-as-a-Judger. Existing MLLMs can be prompted to serve as a critic model. However, as shown in Table 4, the InternVL2.5 series struggle to improve BoN performance, resulting in only marginal improvements. Upon analyzing the generated scores, we find that these models assign similar scores to most solutions. Consistent with our observations in Section 4.2, the InternVL2.5 series tend to generate positive judgments for most steps, which hinders their ability to effectively distinguish and select the truly superior response. In addition to their effectiveness as critic models for MLLMs, their inference latency also limits efficiency. Specifically, MLLMs generate judgments for each step in an autoregressive manner, which is time-consuming. In contrast, our VisualPRM computes scores for all steps in a single forward pass by using a “+” as a placeholder for model responses and interpreting its generation probability as the step score.
Results on text-only performance. To assess the effectiveness of VisualPRM on text-only inputs, we evaluate the Qwen2.5 series [88] and InternVL2.5 series [15] on three text reasoning benchmarks under BoN evaluation settings: GSM8K [18], MATH-500 [25], and GPQA-Diamond [62]. We report accuracy as the evaluation metric for these benchmarks. As shown in Table 5, our model enhances the text reasoning abilities of both the Qwen2.5 series and the InternVL2.5 series. Specifically, Qwen2.5-7B achieves improvements of 6.1 and 5.0 points on MATH-500 and GPQA-Diamond, respectively. Similarly, Qwen2.5-72B demonstrates gains of 2.1 and 6.6 points on these benchmarks. For the InternVL2.5 series, InternVL2.5-8B, InternVL2.5-38B, and InternVL2.5-78B achieve improvements of 9.4 and 5.0, 4.6 and 8.1, and 7.4 and 3.5 points, respectively, on MATH-500 and GPQA-Diamond. These results demonstrate the effectiveness of our VisualPRM in text-only scenarios.
| Model Large Language Models Qwen2.5-7B [88] | GSM8K 91.6 | MATH-500 75.5 | GPQA 36.4 |
| --- | --- | --- | --- |
| +VisualPRM | 94.5 | 81.6 | 41.4 |
| +2.9 | +6.1 | +5.0 | |
| Qwen2.5-32B [88] | 95.9 | 83.1 | 49.5 |
| +VisualPRM | 96.1 | 85.4 | 53.5 |
| +0.2 | +2.3 | +4.0 | |
| Qwen2.5-72B [88] | 95.8 | 83.1 | 49.0 |
| +VisualPRM | 96.5 | 85.2 | 55.6 |
| +0.7 | +2.1 | +6.6 | |
| Multimodal Large Language Models | | | |
| InternVL2.5-8B [15] | 81.9 | 56.8 | 29.3 |
| +VisualPRM | 82.9 | 66.2 | 34.3 |
| +1.1 | +9.4 | +5.0 | |
| InternVL2.5-38B [15] | 94.6 | 75.4 | 44.4 |
| +VisualPRM | 95.6 | 80.0 | 52.5 |
| +1.0 | +4.6 | +8.1 | |
| InternVL2.5-78B [15] | 93.6 | 70.4 | 47.5 |
| +VisualPRM | 94.5 | 77.8 | 51.0 |
| +0.9 | +7.4 | +3.5 | |
Table 5: Results on text reasoning benchmarks. GSM8K and MATH500 are mathematics benchmarks, while GPQA is a multidisciplinary reasoning benchmark. For GPQA, we report the performance on the Diamond split. Our VisualPRM enhances the text reasoning abilities of both LLMs and MLLMs under the Best-of-8 evaluation settings.
## 5 Conclusion
In this work, we construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) capable of estimating the value score of each step during the reasoning process. With the Best-of-N (BoN) evaluation strategies, our model improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation, highlighting the effectiveness of PRMs in Test-Time Scaling. To further facilitate the development of multimodal critic models, we construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to detect incorrect steps in multimodal reasoning tasks. Evaluation results show that existing open-source MLLMs struggle to effectively judge the correctness of each step. We hope that our work can inspire more future research and contribute to the development of MLLMs.
## References
- Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ahmadian et al. [2024] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
- Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022.
- Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024.
- Bai et al. [2023a] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Bai et al. [2023b] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Bai et al. [2025] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- Biten et al. [2019] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NIPS, 2020.
- Cai et al. [2024] Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- Cao and Xiao [2022] Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In COLING, pages 1511–1520, 2022.
- Chang et al. [2022] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022.
- Chen et al. [2024a] Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. arXiv preprint arXiv:2405.16473, 2024a.
- Chen et al. [2023] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
- Chen et al. [2024b] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024b.
- Chen et al. [2024c] Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024c.
- Clark and Gardner [2018] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. In ACL, pages 845–855, 2018.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Contributors [2023] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Dong et al. [2024] Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
- Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Gao et al. [2023] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
- Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Hu [2025] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025.
- Huang et al. [2019] Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
- Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- Kafle et al. [2018] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, pages 5648–5656, 2018.
- Kazemi et al. [2023] Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023.
- Kembhavi et al. [2016] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016.
- Lai et al. [2024] Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.
- Lambert et al. [2024] Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
- Laurençon et al. [2024] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NIPS, 36, 2024.
- Li et al. [2024a] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
- Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR, 2023.
- Li et al. [2024b] Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. Vlrewardbench: A challenging benchmark for vision-language generative reward models. arXiv preprint arXiv:2411.17451, 2024b.
- Li et al. [2024c] Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: An unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024c.
- Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
- Lindström and Abraham [2022] Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022.
- Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NIPS, 36, 2023a.
- Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024a.
- Liu et al. [2024b] Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity. arXiv preprint arXiv:2407.15838, 2024b.
- Liu et al. [2024c] Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style. arXiv preprint arXiv:2410.16184, 2024c.
- Liu et al. [2023b] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 2023b.
- Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Lu et al. [2021a] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021a.
- Lu et al. [2021b] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021b.
- Lu et al. [2022] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. NIPS, 35:2507–2521, 2022.
- Lu et al. [2023] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Luo et al. [2024] Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592, 2, 2024.
- Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, pages 3195–3204, 2019.
- Masry et al. [2022] Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL, pages 2263–2279, 2022.
- Mathew et al. [2022] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022.
- McAleese et al. [2024] Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024.
- Mishra et al. [2019] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- Muennighoff et al. [2025] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- OpenAI [2024] OpenAI. Gpt-4o system card. https://openai.com/index/gpt-4o-system-card/, 2024.
- Pang et al. [2024] Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024.
- Qiao et al. [2024] Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024.
- Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Rein et al. [2024] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 35:25278–25294, 2022.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Seo et al. [2015] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015.
- Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019.
- Snell et al. [2024] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
- Song et al. [2025] Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. Prmbench: A fine-grained and challenging benchmark for process-level reward models. arXiv preprint arXiv:2501.03124, 2025.
- Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Team [2023] InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Team [2024] Qwen Team. Qvq: To see the world with wisdom, 2024.
- Tian et al. [2024] Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, et al. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024.
- Toshniwal et al. [2025] Shubham Toshniwal, Ivan Moshkov, Sean Narenthiran, Daria Gitman, Fei Jia, and Igor Gitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. Advances in Neural Information Processing Systems, 37:34737–34774, 2025.
- Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Wang et al. [2024a] Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, 2024a.
- Wang et al. [2024b] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024b.
- Wang et al. [2023a] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. arXiv preprint arXiv:2312.08935, 2023a.
- Wang et al. [2023b] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NIPS, 36, 2023b.
- Wang et al. [2023c] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023c.
- Wang et al. [2024c] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024c.
- Wang et al. [2024d] Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. arXiv preprint arXiv:2402.19474, 2024d.
- Wang et al. [2024e] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. In ICLR, 2024e.
- Wang et al. [2024f] Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024f.
- Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Xiao et al. [2024] Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts. arXiv preprint arXiv:2407.04973, 2024.
- Yang et al. [2024] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
- Yao et al. [2024] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- Yue et al. [2024] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
- Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023.
- Zhang et al. [2024a] Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240, 2024a.
- Zhang et al. [2024b] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024b.
- Zhang et al. [2025] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301, 2025.
- Zhao et al. [2025] Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, et al. Omnialign-v: Towards enhanced alignment of mllms with human preference. arXiv preprint arXiv:2502.18411, 2025.
- Zheng et al. [2024] Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559, 2024.
- Zhou et al. [2024] Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, et al. Rmb: Comprehensively benchmarking reward models in llm alignment. arXiv preprint arXiv:2410.09893, 2024.
- Zhu et al. [2024] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. NIPS, 36, 2024.
- Zou et al. [2024] Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. arXiv preprint arXiv:2411.00836, 2024.
Supplementary Material
## 6 Training Hyper-parameters
During the training process of VisualPRM, the data-packing strategy [15] is enabled during training. We employ the AdamW optimizer [46] with the $\beta_{1}$ of $0.9$ , the $\beta_{2}$ of $0.999$ , and the weight decay of $0.05$ . The learning rate is initialized as $1e\text{-}5$ . The training phases include a linear warmup that lasts until the first 5% of training steps. The warmup is followed by a cosine decay strategy with a minimum learning rate of 0. We set the training epoch to 1.
## 7 More Ablation Studies
### 7.1 Effects of Training Hyper-parameters
When training our value-based Process Reward Model (PRM) using VisualPRM400K, we define a step as correct if its expected accuracy exceeds 0. In this section, we analyze the impact of varying expected accuracy thresholds for determining step correctness. As shown in Table 8, increasing the threshold results in a decline in both Best-of-8 evaluation performance and VisualProcessBench scores. These results are consistent with the observation in Qwen2.5-Math-PRM [94]. Therefore, we suggest setting the threshold to $0 0$ during training.
### 7.2 Effects of Generation Hyper-parameters
In this section, we analyze the impact of generation temperature on the Best-of-8 evaluation. As shown in Table 8, as the temperature increases from 0.3 to 1.3, the overall performance of InternVL2.5-8B first improves and then declines. We attribute this phenomenon to the trade-off between response diversity and accuracy. When the temperature is low (e.g., set to 0.3), the generated responses lack diversity, limiting the model’s performance upper bound. Conversely, when the temperature is high (e.g., set to 1.3), the responses become more random, reducing the accuracy of individual responses and lowering the model’s overall performance ceiling.
### 7.3 Effects of Best-of-N evaluation
In this section, we present the Best-of-N evaluation results as $N$ increases, as shown in Table 6 and Table 7. Our results indicate that as $N$ increases, VisualPRM consistently enhances the reasoning abilities of InternVL2.5-8B [15] and MiniCPM-V2.6 [89]. Specifically, as $N$ increases from $8$ to $128$ , the overall performance of InternVL2.5-8B improves from 41.2 to 44.0, while MiniCPM-V2.6 improves from 37.5 to 40.4, demonstrating the scalability of Test-Time Scaling for MLLMs.
| Self Consistency 8 16 | 1 58.0 58.6 | 56.2 65.9 65.8 | 64.5 23.4 26.3 | 17.0 30.5 32.1 | 22.8 18.4 19.4 | 9.4 32.7 33.0 | 23.5 43.0 43.4 | 36.0 38.8 39.8 | 32.8 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 32 | 60.4 | 66.7 | 28.0 | 32.6 | 20.8 | 34.1 | 44.7 | 41.0 | |
| 64 | 59.7 | 66.7 | 26.6 | 33.2 | 20.6 | 35.8 | 43.4 | 40.9 | |
| 128 | 60.6 | 67.4 | 25.7 | 32.0 | 22.6 | 34.7 | 43.2 | 40.9 | |
| VisualORM | 1 | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| 8 | 60.2 | 67.0 | 25.3 | 32.5 | 16.4 | 35.0 | 41.8 | 39.7 | |
| 16 | 58.3 | 67.7 | 27.0 | 33.6 | 16.6 | 33.1 | 39.1 | 39.3 | |
| 32 | 58.6 | 67.9 | 26.3 | 33.6 | 17.4 | 34.4 | 42.1 | 40.0 | |
| 64 | 59.4 | 66.8 | 28.6 | 33.9 | 17.8 | 34.1 | 42.3 | 40.4 | |
| 128 | 59.4 | 66.6 | 28.3 | 33.5 | 16.8 | 32.3 | 40.9 | 39.7 | |
| VisualPRM | 1 | 56.2 | 64.5 | 17.0 | 22.8 | 9.4 | 23.5 | 36.0 | 32.8 |
| 8 | 60.2 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.2 | |
| 16 | 60.2 | 69.9 | 27.3 | 36.4 | 19.0 | 38.8 | 42.5 | 42.0 | |
| 32 | 60.3 | 70.4 | 29.6 | 37.8 | 17.2 | 40.3 | 43.4 | 42.7 | |
| 64 | 61.4 | 69.6 | 30.6 | 38.2 | 18.8 | 40.2 | 45.4 | 43.5 | |
| 128 | 61.7 | 70.8 | 30.3 | 39.3 | 19.4 | 40.9 | 45.4 | 44.0 | |
Table 6: Overall Best-of-N results of InternVL2.5-8B across seven multimodal reasoning benchmarks with different critic models.
| Self Consistency 8 16 | 1 51.8 51.7 | 49.8 58.9 60.2 | 60.8 21.7 21.7 | 23.4 31.5 31.5 | 18.9 10.0 11.6 | 9.8 22.6 25.7 | 16.4 35.6 35.3 | 27.5 33.2 34.0 | 29.5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 32 | 52.2 | 60.1 | 24.3 | 33.1 | 11.4 | 24.3 | 36.0 | 34.5 | |
| 64 | 51.7 | 61.0 | 23.4 | 34.8 | 12.8 | 25.8 | 35.3 | 35.0 | |
| 128 | 53.2 | 61.7 | 25.7 | 33.5 | 13.0 | 25.6 | 35.6 | 35.5 | |
| VisualORM | 1 | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| 8 | 55.7 | 66.0 | 22.0 | 33.5 | 10.2 | 24.1 | 38.9 | 35.8 | |
| 16 | 56.4 | 65.3 | 24.0 | 32.1 | 10.4 | 27.3 | 36.5 | 36.0 | |
| 32 | 58.8 | 64.8 | 19.7 | 35.7 | 12.0 | 29.4 | 38.5 | 37.0 | |
| 64 | 58.2 | 67.3 | 22.7 | 35.5 | 11.0 | 30.1 | 37.6 | 37.5 | |
| 128 | 58.2 | 66.5 | 25.3 | 35.4 | 11.6 | 30.0 | 40.7 | 38.2 | |
| VisualPRM | 1 | 49.8 | 60.8 | 23.4 | 18.9 | 9.8 | 16.4 | 27.5 | 29.5 |
| 8 | 56.8 | 65.7 | 24.7 | 35.8 | 11.2 | 31.0 | 37.4 | 37.5 | |
| 16 | 58.8 | 68.6 | 24.0 | 37.3 | 12.4 | 32.7 | 39.8 | 39.1 | |
| 32 | 57.8 | 68.4 | 26.6 | 38.5 | 13.4 | 35.3 | 39.1 | 39.9 | |
| 64 | 58.6 | 69.4 | 25.3 | 39.7 | 12.2 | 38.2 | 36.9 | 40.0 | |
| 128 | 59.3 | 69.4 | 25.3 | 39.1 | 14.4 | 37.0 | 38.3 | 40.4 | |
Table 7: Overall Best-of-N results of MiniCPM-V2.6 across seven multimodal reasoning benchmarks with different critic models.
| Threshold=0.00 Threshold=0.625 Threshold=0.125 | 59.3 59.7 58.0 | 68.5 66.8 67.9 | 25.7 24.7 27.6 | 35.8 36.7 35.4 | 18.0 18.4 17.4 | 36.5 35.0 35.3 | 43.8 41.8 41.6 | 41.1 40.4 40.5 | 62.0 61.0 60.7 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Threshold=0.25 | 58.6 | 67.6 | 25.7 | 33.6 | 16.8 | 36.0 | 41.4 | 40.0 | 60.2 |
| Temperature | | | | | | | | | |
| Temperature=0.3 | 59.7 | 69.4 | 26.0 | 32.6 | 17.6 | 35.5 | 42.7 | 40.5 | - |
| Temperature=0.7 | 59.3 | 68.5 | 25.7 | 35.8 | 18.0 | 36.5 | 43.8 | 41.1 | - |
| Temperature=1.0 | 61.7 | 67.2 | 27.3 | 35.8 | 16.6 | 34.2 | 43.2 | 40.9 | - |
| Temperature=1.3 | 57.9 | 66.1 | 25.0 | 32.1 | 16.8 | 31.9 | 40.5 | 38.6 | - |
Table 8: Ablation studies about the effects of expected accuracy threshold and generationo temperaure.
## 8 More Statistics for VisualProcessBench
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Step Index vs. Steps (%)
### Overview
The chart displays five data series representing the percentage of steps taken across a step index range (0–30). Each series is color-coded and labeled in the legend (top-right). The y-axis shows "Steps (%)" with a range of 0–5, while the x-axis shows "Step Index" with a range of 0–30. Dotted grid lines aid in reading values.
### Components/Axes
- **X-axis (Step Index)**: Labeled "Step Index" with ticks at 0, 5, 10, 15, 20, 25, 30.
- **Y-axis (Steps %)**: Labeled "Steps (%)" with ticks at 0, 1, 2, 3, 4, 5.
- **Legend**: Located in the top-right corner, with five entries:
- **MathVision** (gray line)
- **MathVerse** (red line)
- **MMMU** (blue line)
- **DynaMath** (green line)
- **WeMath** (purple line)
### Detailed Analysis
1. **MathVision (gray)**:
- Starts at ~1.5% at Step Index 5.
- Peaks at ~3% around Step Index 10.
- Drops to ~0.5% by Step Index 15, then remains near 0.
2. **MathVerse (red)**:
- Starts at ~2% at Step Index 5.
- Peaks sharply at ~5% around Step Index 10.
- Drops sharply to ~0.5% by Step Index 15, then remains near 0.
3. **MMMU (blue)**:
- Starts at ~0.8% at Step Index 5.
- Peaks at ~1.2% around Step Index 10.
- Drops to ~0.2% by Step Index 15, then remains near 0.
4. **DynaMath (green)**:
- Starts at ~2.5% at Step Index 5.
- Peaks at ~3% around Step Index 10.
- Drops to ~1% by Step Index 15, then remains near 0.
5. **WeMath (purple)**:
- Starts at ~1% at Step Index 5.
- Peaks at ~2% around Step Index 10.
- Drops to ~0.5% by Step Index 15, then remains near 0.
### Key Observations
- **Peak Activity**: All lines peak around Step Index 10, suggesting a critical threshold or event at this point.
- **MathVerse (red)** exhibits the highest peak (~5%) and sharpest decline, indicating a brief but intense activity.
- **DynaMath (green)** and **MathVision (gray)** show moderate peaks (~3% and ~3%, respectively) with gradual declines.
- **MMMU (blue)** and **WeMath (purple)** have lower peaks (~1.2% and ~2%, respectively) but similar decline patterns.
- All lines drop to near 0% after Step Index 15, suggesting a cessation of activity or a reset.
### Interpretation
The chart likely represents the efficiency or usage of five mathematical tools/methods (MathVision, MathVerse, MMMU, DynaMath, WeMath) across a sequence of steps. The peak at Step Index 10 implies a focal point where all methods are most active, possibly due to a specific task or algorithmic phase. MathVerse’s sharp rise and fall suggest it is highly effective for short-term tasks but less sustainable. DynaMath and MathVision show more balanced performance, while MMMU and WeMath have lower overall activity. The uniform drop after Step Index 15 indicates a common endpoint or reset in the process.
</details>
(a)
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Graph: Incorrect Steps (%) Across Models
### Overview
The image is a line graph comparing the percentage of incorrect steps across five models (MathVision, MathVerse, MMMU, DynaMath, WeMath) over a sequence of steps (Step Index 0–30). The y-axis represents "Incorrect Steps (%)" (0–100%), and the x-axis represents "Step Index" (0–30). Each model is represented by a distinct colored line with shaded regions indicating variability or confidence intervals.
### Components/Axes
- **X-axis (Step Index)**: Labeled "Step Index" with ticks at 0, 5, 10, 15, 20, 25, 30.
- **Y-axis (Incorrect Steps %)**: Labeled "Incorrect Steps (%)" with ticks at 0, 20, 40, 60, 80, 100.
- **Legend**: Located in the top-left corner, mapping colors to models:
- Gray: MathVision
- Red: MathVerse
- Blue: MMMU
- Green: DynaMath
- Purple: WeMath
- **Lines**: Each model has a line with circular markers (filled for MathVision, outlined for others) and shaded regions below the line.
### Detailed Analysis
1. **MathVision (Gray)**:
- Starts at ~10% (Step 0), rises to a peak of ~50% at Step 12, then declines sharply to ~10% by Step 30.
- Shaded region widens significantly after Step 12, indicating high variability.
2. **MathVerse (Red)**:
- Begins at ~15% (Step 0), peaks at ~60% at Step 23, then drops to ~20% by Step 30.
- Shaded region is narrower than MathVision’s, suggesting lower variability.
3. **MMMU (Blue)**:
- Starts at ~10% (Step 0), remains stable until Step 25, then spikes to 100% at Step 25 and remains there.
- Shaded region is minimal before Step 25 but becomes a vertical band at Step 25.
4. **DynaMath (Green)**:
- Begins at ~5% (Step 0), rises to ~70% at Step 20, then drops to ~30% by Step 30.
- Shaded region is moderate, with a sharp decline after Step 20.
5. **WeMath (Purple)**:
- Starts at ~10% (Step 0), declines gradually to ~5% by Step 15, then stabilizes at ~2% by Step 30.
- Shaded region is the narrowest, indicating consistent performance.
### Key Observations
- **MMMU’s Outlier**: The 100% spike at Step 25 is the highest value across all models and steps.
- **DynaMath’s Peak**: The second-highest peak (~70%) occurs at Step 20.
- **MathVision vs. MathVerse**: Both models show similar early trends but diverge after Step 12, with MathVerse peaking later.
- **WeMath’s Consistency**: The only model with a steady decline and minimal variability.
### Interpretation
The data suggests significant variability in model performance across steps. MMMU’s abrupt 100% failure at Step 25 may indicate a critical flaw or edge case in its logic. DynaMath’s sharp decline after Step 20 implies a recovery or correction mechanism. MathVision and MathVerse exhibit similar early errors but differ in late-stage performance, possibly reflecting architectural differences. WeMath’s consistent improvement suggests robust error-handling. The shaded regions highlight uncertainty, with MMMU showing the largest variability post-Step 25. This graph could inform model optimization by identifying failure points and stability trends.
</details>
(b)
Figure 5: Step Distribution of VisualProcessBench. The X-axis represents the step index. (a) The Y-axis indicates the proportion of steps at each index relative to the total number of steps, reflecting the distribution of step positions in solutions. (b) The Y-axis represents the error rate of steps at each index, showing the likelihood of errors occurring at different step positions.
The statistics for step distribution of VisualProcessBench is presented in Figure 5. We observe that most solutions consist of fewer than 15 steps. Among these solutions with fewer than 15 steps, most solutions contain about 7 or 13 steps. For the correctness of each step, we observe that the error rate is lower in the first three steps and then increases as the step index grows. We attribute this to the fact that problems requiring more reasoning steps tend to be more challenging, leading to a gradual rise in step error rates. Notably, starting from step 15, the error rate drops sharply. This is because the number of steps in this range is relatively small, resulting in significant statistical fluctuations.
## 9 More Data Examples in VisualPRM400K
In this section, we provide more data examples of VisualPRM400K in Figure 6 from different domains, including general visual question answering (VQA) [23, 28, 52, 48], science [31, 13, 49], chart [53, 29, 12], mathematics [40, 65, 11, 47, 30, 22], OCR [54, 67, 8, 27, 56], and document [17].
## 10 More Data Examples in VisualProcessBench
In this section, we provide more data examples in VisualProcessBench from different data sources. Specifically, we randomly choose three examples from our benchmark and visualize them in Figure 7. Additionally, in Figure 8, we provide an example where the model initially generates an incorrect answer and then autonomously corrects it.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Receipt and Calculation Breakdown: Geoventure Transaction
### Overview
The image contains a simplified tax invoice from "Geoventure" (Taste Of The World Sdn Bhd) and a step-by-step solution document explaining how to calculate the total amount on the receipt. The receipt details a transaction for a "Tiger Bucket" with associated charges, GST, and payment details. The solution document methodically breaks down the calculation process.
---
### Components/Axes
#### Receipt Elements:
- **Header**:
- Restaurant: Geoventure (Taste Of The World Sdn Bhd)
- Address: Lot 38, 39, 101 Boulevard, Jalan Kenari 5, Bandar Puchong Jaya, 47170 Puchong, Selangor
- Contact: Tel: 03-8079 0255, GST REG NO: 000136577024
- Invoice No: 581355, Date: 5/3/2018 22:52:33, Table No: C38
- **Items**:
- **Tiger Bucket** (GTG02): Quantity 1, Unit Price RM 55.00
- **Service Chrg (Incl GST 6%)**: RM 5.83
- **Totals**:
- **Total Sales (Incl GST 6%)**: RM 64.15
- **Cash Paid**: RM 100.00
- **Change**: RM 35.85
- **GST Summary**:
- **6% Tax**: RM 3.63 (on RM 60.50)
- **0% Tax**: RM 0.00
#### Solution Document Elements:
- **Steps**:
1. Identify item price: "Tiger Bucket" = RM 55.00
2. Total cost of items: RM 55.00
3. No additional charges
4. Include service charge (RM 5.83) → Total: RM 60.83
5. Total Sales (Incl GST 6%): RM 64.15
6. Amount paid: RM 100.00
7. Change: RM 100.00 - RM 64.15 = RM 35.85
8. Final total amount: RM 100.00 (customer's cash payment)
---
### Detailed Analysis
#### Receipt Breakdown:
1. **Item Costs**:
- Tiger Bucket: RM 55.00
- Service Charge (Incl GST 6%): RM 5.83
2. **Tax Calculation**:
- GST (6%) applied to RM 60.50 (item + service charge base): RM 3.63
- Total Sales (Incl GST): RM 64.15
3. **Payment**:
- Cash Paid: RM 100.00
- Change: RM 35.85
#### Solution Document Logic:
- Steps 1–3 confirm the base item cost and absence of discounts.
- Step 4 adds the service charge (RM 5.83) to the item cost, resulting in RM 60.83.
- Step 5 calculates GST (6%) on RM 60.83, yielding RM 3.65 (rounded to RM 3.63 in the receipt).
- Step 6–7 compute the change from RM 100.00 payment.
- Step 8 concludes the total amount is the cash payment (RM 100.00).
---
### Key Observations
1. **GST Discrepancy**:
- The receipt lists GST as RM 3.63 (6% of RM 60.50), but the service charge (RM 5.83) already includes GST. This suggests the GST summary may be miscalculated or rounded differently.
2. **Rounding Adjustment**:
- A "Rounding Adjustment: RM 0.02" is noted but not explicitly applied in the totals.
3. **Final Amount**:
- The solution document incorrectly states the "total amount" as RM 100.00 (customer's payment), whereas the receipt's total is RM 64.15. This is likely a misinterpretation in the solution.
---
### Interpretation
- The receipt accurately reflects a transaction where a customer paid RM 100.00 for a RM 64.15 order, receiving RM 35.85 in change.
- The solution document contains an error in Step 8, conflating the customer's payment (RM 100.00) with the total amount due (RM 64.15). The "total amount" should refer to the latter, not the former.
- The GST calculation appears inconsistent: the service charge (RM 5.83) already includes GST, yet the GST summary applies 6% to RM 60.50, suggesting a potential double-counting or mislabeling of taxable bases.
---
### Spatial Grounding
- **Receipt**:
- Header (top-left), Items (center), Totals (bottom-left), GST Summary (bottom-right).
- **Solution Document**:
- Steps listed sequentially (left-aligned), with calculations in bold (e.g., "RM 55.00 + RM 5.83 = RM 60.83").
---
### Trend Verification
- **Receipt Totals**:
- Item + Service Charge: RM 55.00 + RM 5.83 = RM 60.83 (linear addition).
- GST: RM 3.63 (6% of RM 60.50, non-linear due to tax application).
- Final Total: RM 64.15 (sum of RM 60.83 + RM 3.63, rounded).
- **Solution Steps**:
- Linear progression from item cost to final change calculation.
---
### Component Isolation
1. **Header**: Identifies the business and transaction metadata.
2. **Items/Charges**: Lists individual costs and taxes.
3. **Payment Section**: Shows cash tendered and change.
4. **Solution Steps**: Logical breakdown of calculations, though Step 8 misrepresents the total amount.
---
### Final Answer
The total amount on the receipt is **RM 64.15** (total sales including GST). The solution document incorrectly states the final answer as RM 100.00, which is the customer's payment, not the total due.
</details>
(a)
<details>
<summary>x9.png Details</summary>

### Visual Description
## Screenshot: Technical Document with Letter and Solution Analysis
### Overview
The image shows a screenshot of a technical document containing a typed letter and a structured solution section. The letter is addressed to "Mr. Place" and includes a reference to a new address provided by Dudley's sister. The solution section outlines a step-by-step process to identify a date mentioned in the document.
### Components/Axes
- **Header**: "Document" (top-left corner).
- **Letter Section**:
- Recipient: "Mr. Place"
- Address: "P.W. Company B — USNTS, Ward Island, Corpus Christi, Texas."
- Date Reference: "11/20" (likely November 20).
- Content: Mentions the sender's activities in NYC and Washington, and a request to "supplements his book learning."
- **Solution Section**:
- Steps labeled with timestamps (e.g., `# Step-0 (mc=0.0625)`).
- Final answer: "11/20" (November 20).
- **Footer**: Source URL: `https://www.industrydocuments.ucsf.edu/docs/ffhl0226`.
### Detailed Analysis
- **Letter Content**:
- The letter is dated "11/20" (Step 1 in the solution).
- The address is explicitly stated as "P.W. Company B — USNTS, Ward Island, Corpus Christi, Texas."
- The sender mentions being "busy in NYC today, Washington (11/20)" and references "mn" (likely "Monday").
- **Solution Section**:
- Step 0: Identifies the task as determining the date.
- Step 1: Locates the date "11/20" in the text.
- Step 2: Breaks down the steps for clarity.
- Step 3: Confirms "11/20" as an abbreviation for November 20.
- Step 4: Finalizes the answer as "11/20."
### Key Observations
- The date "11/20" is explicitly mentioned in both the letter and the solution.
- The address "P.W. Company B — USNTS, Ward Island, Corpus Christi, Texas" is critical for contextualizing the letter's purpose.
- The solution section uses a structured, step-by-step approach to validate the date, emphasizing precision.
### Interpretation
The document appears to be a formal communication (letter) with a date reference ("11/20") that is later analyzed in the solution section. The solution's methodical breakdown suggests the document is part of a technical or investigative process, possibly related to archival research or data extraction. The mention of "supplements his book learning" implies the letter may involve educational or informational content. The source URL indicates the document is part of a larger collection, likely from an academic or institutional repository.
No non-English text is present. All labels, dates, and addresses are transcribed with high confidence. The structure of the solution section confirms the date "11/20" as the primary data point.
</details>
(b)
<details>
<summary>x10.png Details</summary>

### Visual Description
## Screenshot: Mathematics Problem with Solution
### Overview
The image displays a mathematics problem involving geometric shapes and a step-by-step solution. The problem asks to subtract all red shiny spheres and determine the remaining number of spheres. The solution is presented with labeled steps, code-like comments, and numerical values.
### Components/Axes
- **Title**: "Mathematics" (top-left corner).
- **Image**: A photograph of geometric shapes:
- Blue sphere (left side).
- Red shiny sphere (center-right).
- Blue cylinder (bottom-right).
- Gray cylinder (bottom-center).
- Yellow cube (top-center).
- Purple cube (top-right).
- **Question**: "Subtract all red shiny spheres. How many spheres are left?"
- **Solution**: Structured in steps with code-like comments:
- `## Step-0 (mc=0.6875)`: Initial analysis.
- `## Step-1 (mc=0.0)`: Step 1 details.
- `## Step-2 (mc=0.0625)`: Step 2 details.
- `## Step-3 (mc=0.0)`: Final answer.
### Detailed Analysis
1. **Step-0 (mc=0.6875)**:
- Purpose: Determine remaining spheres after removing the red shiny sphere.
- Analysis: Break down objects in the image step by step.
2. **Step-1 (mc=0.0)**:
- **1.** Two spheres are visible in the image.
- **2.** One sphere is red and shiny.
- **3.** The other sphere is blue and shiny.
3. **Step-2 (mc=0.0625)**:
- Since the question asks for spheres after removing the red one, subtract the red sphere from the total count of spheres.
4. **Step-3 (mc=0.0)**:
- Final answer: **1** (one sphere remains after subtraction).
### Key Observations
- The red shiny sphere is explicitly identified as the target for subtraction.
- The solution uses a stepwise logical process to isolate the red sphere and calculate the result.
- Code-like comments (`mc=...`) appear to represent intermediate metrics or confidence scores, though their exact purpose is unclear.
### Interpretation
The problem demonstrates basic arithmetic and object classification. The solution emphasizes:
1. **Counting**: Identifying all spheres (two total).
2. **Filtering**: Isolating the red shiny sphere (1 out of 2).
3. **Subtraction**: Removing the red sphere leaves 1 sphere (the blue one).
The `mc` values in the steps may indicate model confidence or error rates, but the core logic relies on visual analysis and arithmetic. The final answer aligns with the problem’s requirements, confirming the blue sphere remains after subtraction.
</details>
(c)
<details>
<summary>x11.png Details</summary>

### Visual Description
## Diagram: Footprint Analysis for Foot Type Identification
### Overview
The image presents a science-related diagram comparing three footprints labeled **B**, **A**, and **D**, alongside a question asking which foot type corresponds to Diagram **D**. The solution provides a step-by-step analysis of pressure distribution and arch visibility to determine the foot type.
### Components/Axes
- **Footprints**: Three black silhouettes labeled **B**, **A**, and **D** (left to right).
- **Question Text**: "Diagram D corresponds with what foot type? Normal Flat Rigid Big."
- **Solution Text**: A detailed explanation analyzing pressure distribution and arch visibility.
### Detailed Analysis
1. **Footprint Labels**:
- **B**: Full footprint with no visible arch (indicative of a flat foot).
- **A**: Partial footprint with a pronounced arch (indicative of a rigid foot).
- **D**: Full footprint with a visible arch (indicative of a normal foot).
2. **Solution Steps**:
- **Step 0**: Initial analysis of Diagram D’s footprint.
- **Step 1**: Definition of a normal foot (even pressure distribution, well-defined arches).
- **Step 2**: Definition of a flat foot (entire footprint presses against the ground, no arch).
- **Step 3**: Definition of a rigid foot (hard to identify, little flexibility).
- **Step 4**: Definition of a "big foot" (size-based, not pressure-based).
- **Step 5**: Observation that Diagram D shows a visible arch with heel and forefoot touching the ground.
- **Step 6**: Conclusion: Diagram D matches a **normal foot**.
### Key Observations
- Diagram **D** is the only footprint with a visible arch, distinguishing it from **B** (flat) and **A** (rigid).
- The solution explicitly rules out "big foot" as a pressure-distribution-based category.
### Interpretation
The diagram demonstrates that foot type classification relies on arch visibility and pressure distribution. Diagram **D**’s clear arch and balanced contact points align with the anatomical characteristics of a **normal foot**, where the arch acts as a shock absorber. This contrasts with flat feet (no arch) and rigid feet (exaggerated arch with reduced flexibility). The analysis emphasizes that foot size ("big") is irrelevant to this classification, focusing instead on structural and functional traits.
</details>
(d)
<details>
<summary>x12.png Details</summary>

### Visual Description
## Screenshot: Visual Question Answering (VQA) Task
### Overview
The image depicts a Visual Question Answering (VQA) interface. It includes a stylized blue shell graphic with a dashed diagonal line, a question about symmetry, multiple-choice options, and a step-by-step solution with confidence metrics.
### Components/Axes
- **Header**:
- Text: "General VQA" (top-left corner, blue border).
- **Main Content**:
- **Image**: Stylized blue shell with a dashed diagonal line (top-left quadrant).
- **Question**: "Does this picture have symmetry?"
- **Options**:
- A. yes
- B. no
- **Solution Section**:
- Step-0: `(mc=0.625)` - Instruction to determine symmetry via steps.
- Step-1: `(mc=0.0)` - Identify object (stylized flying object) and check bilateral symmetry.
- Step-2: `(mc=0.0)` - Confirm symmetry via mirror image along central vertical line.
- Step-3: `(mc=0.0)` - Final answer: A (yes).
### Detailed Analysis
- **Image**:
- The shell is positioned in the top-left quadrant, partially overlapping the dashed diagonal line. The line runs from the top-left to bottom-right corner of the image.
- **Question and Options**:
- The question asks about symmetry, with two binary choices (A: yes, B: no).
- **Solution Steps**:
- **Step-0**: High confidence (`mc=0.625`) in the initial instruction to follow symmetry-checking steps.
- **Step-1**: Low confidence (`mc=0.0`) in identifying the object as a stylized flying object and assessing bilateral symmetry.
- **Step-2**: Low confidence (`mc=0.0`) in confirming symmetry via mirroring along the central vertical line.
- **Step-3**: Low confidence (`mc=0.0`) in the final answer (A: yes).
### Key Observations
- The dashed diagonal line may represent a proposed symmetry axis, but the solution emphasizes vertical symmetry.
- Confidence metrics (`mc`) are inconsistent: Step-0 has moderate confidence, while subsequent steps have zero confidence.
- The final answer (A: yes) contradicts the low confidence in later steps, suggesting potential model uncertainty.
### Interpretation
The VQA task demonstrates how an AI model evaluates symmetry in an image. Despite the model’s low confidence in later steps, it concludes symmetry exists (A: yes). This discrepancy highlights potential limitations in the model’s reasoning process, such as over-reliance on initial instructions (Step-0) or misalignment between visual cues (dashed line) and textual reasoning. The dashed line’s diagonal orientation conflicts with the described vertical symmetry axis, indicating possible ambiguity in the task design or model interpretation.
</details>
(e)
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Democrats Hold Advantages on Environment, Abortion, Education; Neither Party Has Edge on the Economy
### Overview
The image presents a comparative analysis of public perception regarding which political party (Democratic or Republican) is better equipped to handle various societal issues. The chart includes a table of percentages, a question about foreign policy ratios, and a step-by-step solution. Key themes include partisan advantages in social issues, neutrality in economic matters, and a slight Democratic edge in foreign policy.
### Components/Axes
- **Title**: "Democrats Hold Advantages on Environment, Abortion, Education; Neither Party Has Edge on the Economy"
- **Table Headers**:
- Issue categories (e.g., Gun control, Terrorist threat, Budget deficit)
- Columns:
- Democratic Party (%)
- Republican Party (%)
- Diff (percentage point difference)
- **Question**: "What is the ratio between Democratic and Republican Party in Foreign policy?"
- **Solution Steps**:
- Step 0: Identify foreign policy percentages (Democrats: 41%, Republicans: 38%)
- Step 1: Calculate ratio (41:38)
- Final Answer: 41:38
### Detailed Analysis
#### Table Data
| Issue | Democratic Party (%) | Republican Party (%) | Diff |
|------------------------|----------------------|----------------------|------|
| Gun control | 36 | 48 | R+12 |
| Terrorist threat | 34 | 44 | R+10 |
| Budget deficit | 37 | 41 | R+4 |
| Taxes | 41 | 42 | R+1 |
| Immigration | 42 | 40 | D+2 |
| Economy | 44 | 41 | D+3 |
| Foreign policy | 41 | 38 | D+3 |
| Health care | 46 | 36 | D+10 |
| Education | 46 | 34 | D+12 |
| Abortion & contraception | 50 | 31 | D+19 |
| Environment | 53 | 27 | D+26 |
#### Key Observations
1. **Democratic Advantages**:
- Environment (53% vs. 27%)
- Abortion & contraception (50% vs. 31%)
- Education (46% vs. 34%)
- Health care (46% vs. 36%)
2. **Republican Advantages**:
- Gun control (48% vs. 36%)
- Terrorist threat (44% vs. 34%)
- Budget deficit (41% vs. 37%)
3. **Economic Neutrality**:
- Economy (44% vs. 41%) and taxes (41% vs. 42%) show minimal partisan divides.
4. **Foreign Policy**:
- Democrats lead slightly (41% vs. 38%), with a calculated ratio of 41:38.
### Interpretation
The data suggests Democrats are perceived as more competent in social and environmental issues, while Republicans hold an edge in security-related domains. The economy is viewed as a nonpartisan concern. The foreign policy ratio (41:38) indicates a marginal Democratic advantage, though the difference is statistically small. This analysis could inform strategic messaging or policy prioritization for both parties. The stark contrast in abortion/environmental perceptions highlights cultural divides, while the economic neutrality underscores shared voter priorities.
</details>
(f)
Figure 6: More data examples from VisualPRM400K.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Diagram: Elements of Art and Question Analysis
### Overview
The image presents an educational diagram titled "Data Example from MMMU," focusing on the **Elements of Art**. It includes a question about geometric concepts and a step-by-step solution. The left side visually represents seven art elements (Line, Shape, Form, Space, Texture, Value, Color) with abstract examples, while the right side contains a multiple-choice question and its detailed solution.
---
### Components/Axes
- **Left Panel**:
- **Title**: "Elements of Art"
- **Visual Examples**:
1. **Line**: Abstract wavy and straight lines.
2. **Shape**: Geometric figures (triangle, square, circle).
3. **Form**: 3D objects (cube, sphere, cone).
4. **Space**: Negative/positive space interaction.
5. **Texture**: Patterns mimicking surfaces (e.g., fabric, stone).
6. **Value**: Gradient from light to dark.
7. **Color**: Primary colors (red, blue, yellow) with secondary hues.
- **Right Panel**:
- **Question**:
*"Based on <image 1>. When lines meet to form an enclosed area, this is formed."*
Options:
A. Space
B. Shape
C. Form
D. Line
- **Solution**:
Step-by-step breakdown of art element definitions:
1. **Space**: Area between/around elements (positive/negative).
2. **Shape**: Enclosed area created by lines (geometric/organic).
3. **Form**: 3D aspect of an object (volume/depth).
4. **Line**: Mark made by a moving point.
5. **Conclusion**: Enclosed area = "Shape" → Final answer: **B**.
---
### Detailed Analysis
- **Art Elements**:
- Each element is paired with a visual example (e.g., "Line" shows abstract strokes; "Color" uses primary/secondary hues).
- Definitions in the solution clarify distinctions (e.g., "Form" vs. "Shape" emphasizes 3D vs. 2D).
- **Question Logic**:
- The question tests recognition of geometric concepts.
- The solution methodically eliminates options by defining terms:
- **Space** (A) refers to spatial relationships, not enclosed areas.
- **Form** (C) implies 3D volume, not 2D boundaries.
- **Line** (D) is the tool, not the result.
---
### Key Observations
- **Correct Answer**: B (Shape) is explicitly justified by the definition of an enclosed area.
- **Visual-Textual Alignment**: The left panel’s "Shape" example (geometric figures) directly supports the textual explanation.
- **Step-by-Step Rigor**: The solution avoids ambiguity by defining all options before concluding.
---
### Interpretation
The diagram emphasizes foundational art theory, linking visual examples to abstract definitions. The question tests the ability to distinguish between spatial concepts (Space) and geometric constructs (Shape). By defining each term, the solution demonstrates critical reasoning:
- **Shape** is the only term describing a 2D enclosed area formed by lines.
- The diagram reinforces that art elements are interdependent (e.g., lines create shapes, which occupy space).
This structure aligns with pedagogical goals of clarifying abstract concepts through examples and definitions.
</details>
(a)
<details>
<summary>x15.png Details</summary>

### Visual Description
## Musical Notation Analysis: Key Identification
### Overview
The image presents a musical staff with a melody labeled (a), accompanied by a multiple-choice question asking to identify the key of the melody. The solution provides a step-by-step analysis of the key signature, accidentals, and tonal center to determine the correct answer.
### Components/Axes
- **Musical Staff**:
- Contains a treble clef and a key signature with **one sharp** (F#).
- Notes include a **D# accidental**.
- **Question Text**:
- "Choose the correct key of each of the melody."
- Options:
- A. B major
- B. E major
- C. A minor
- D. E minor
- **Solution Steps**:
1. **Key Signature**: One sharp (F#) → corresponds to **G major** or **E minor**.
2. **Accidentals**: Presence of **D#** (leading tone in E minor).
3. **Tonal Center**: Melody revolves around **E** as the tonic.
4. **Final Answer**: **D. E minor**.
### Detailed Analysis
- **Key Signature**:
- One sharp (F#) eliminates B major (which has two sharps) and E major (three sharps).
- Remaining candidates: G major or E minor.
- **Accidentals**:
- D# is the leading tone in E minor (scale degree 7).
- This accidental strongly suggests the key is **E minor**.
- **Tonal Center**:
- The melody emphasizes E as the tonic, resolving to E minor.
### Key Observations
- The combination of **F#** in the key signature and **D#** as an accidental confirms **E minor** as the key.
- No other options (B major, E major, A minor) align with the key signature and accidental usage.
### Interpretation
The analysis demonstrates how key signatures, accidentals, and tonal center collectively determine a melody’s key. The presence of F# (key signature) and D# (leading tone) eliminates major keys and non-E minor candidates. The tonal focus on E solidifies **E minor** as the correct answer. This reflects foundational music theory principles, where key signatures and accidentals work in tandem to establish harmonic identity.
# Correct Answer: **D. E minor**
</details>
(b)
<details>
<summary>x16.png Details</summary>

### Visual Description
## Line Graph: Phillips Curve Illustration with Economic Movement Analysis
### Overview
The image depicts a Phillips Curve graph showing the inverse relationship between inflation and unemployment. It includes two curves (Curve 1 and Curve 2), five labeled points (A, B, C, D, E), and a textual solution explaining economic movement along the curve due to changes in money supply growth. The graph demonstrates short-run vs. long-run economic behavior according to Phillips Curve theory.
### Components/Axes
- **Y-axis**: Inflation Rate (%)
- Scale: 0 to 10 in increments of 1
- Labels: "Inflation Rate (%)" at top
- **X-axis**: Unemployment Rate (%)
- Scale: 0 to 10 in increments of 1
- Labels: "Unemployment Rate (%)" at bottom
- **Curves**:
- **Curve 1**: Red label at top-left, descending line from (0,10) to (10,0)
- **Curve 2**: Black label at top-left, descending line from (0,10) to (10,0)
- **Points**:
- **A**: (4%, 5%) on Curve 1
- **B**: (6%, 3%) on Curve 1
- **C**: (6%, 1%) on Curve 2
- **D**: (5%, 6%) on Curve 1
- **E**: (3%, 2%) on Curve 1
### Detailed Analysis
1. **Curve 1** (Red):
- Represents the short-run Phillips Curve
- Slope: -1 (linear decline)
- Key points: A (4%,5%), B (6%,3%), D (5%,6%), E (3%,2%)
2. **Curve 2** (Black):
- Represents the long-run Phillips Curve (vertical at NAIRU)
- Slope: Vertical line at 6% unemployment (NAIRU)
- Key point: C (6%,1%)
3. **Solution Steps**:
- **Step 0**: Correct (Question setup)
- **Step 1**: Correct (Initial point B at 6% unemployment, 3% inflation)
- **Step 2**: Correct (Increased money supply → higher inflation, same unemployment)
- **Step 3**: Correct (Movement along Curve 1 to higher inflation)
- **Step 4**: Incorrect (Unemployment remains at NAIRU in long run)
- **Step 5**: Neural (Correct conclusion: Movement to Curve 2 at point C)
- **Step 6**: Incorrect (Unemployment does not change in long run)
- **Step 7**: Incorrect (Final answer should be C, not A)
### Key Observations
- **NAIRU Confirmation**: Curve 2's vertical position at 6% unemployment matches the Non-Accelerating Inflation Rate of Unemployment (NAIRU).
- **Inflation-Unemployment Tradeoff**: Short-run movement along Curve 1 shows inverse relationship (e.g., B→C movement increases inflation while maintaining unemployment).
- **Long-Run Neutrality**: Curve 2's vertical orientation reflects long-run unemployment neutrality, where inflation changes without affecting unemployment.
- **Solution Errors**: Steps 4, 6, and 7 incorrectly suggest unemployment changes in the long run, contradicting Phillips Curve theory.
### Interpretation
The graph illustrates the **Phillips Curve framework** and its implications for monetary policy:
1. **Short-Run Dynamics**: Increased money supply growth (Step 2) shifts the economy along Curve 1 to higher inflation (e.g., B→C movement) without changing unemployment.
2. **Long-Run Adjustment**: The economy eventually returns to NAIRU (Curve 2) with higher inflation but unchanged unemployment (Step 5).
3. **Policy Misconceptions**: The incorrect steps (4,6,7) reflect common errors in assuming unemployment adjusts in the long run, highlighting the importance of distinguishing short-run vs. long-run effects.
This analysis aligns with **monetarist theory**, emphasizing that money supply changes primarily affect inflation in the long run while unemployment reverts to its natural rate. The graph serves as a visual aid for understanding inflation-unemployment tradeoffs and the limitations of short-term policy interventions.
</details>
(c)
<details>
<summary>x17.png Details</summary>

### Visual Description
## Screenshot: MathVision Data Example
### Overview
The image displays a math problem from MathVision, focusing on digit arrangement to maximize the sum of two numbers. The problem involves placing digits 2, 3, 4, and 5 into squares, forming two numbers, and determining the largest possible sum. The solution includes step-by-step reasoning with correctness annotations.
### Components/Axes
- **Title**: "Data Example from MathVision" (top-left, bold).
- **Question Section**:
- Text: "Question: Hint: Please answer the question and provide the final answer at the end. Question: Each of the digits 2, 3, 4 and 5 will be placed in a square. Then there will be two numbers, which will be added together. What is the biggest number that they could make?"
- **Solution Section**:
- Steps labeled with `# Step-X (Correct/Incorrect)`:
- `# Step-0 (Correct)`: "To maximize the sum of the two numbers, we should form the largest possible numbers by arranging the digits in descending order."
- `# Step-1 (Incorrect)`: "1. Arrange the digits (2, 3, 4, 5) in descending order: (5, 4, 3, 2)."
- `# Step-2 (Incorrect)`: "2. Split them into two numbers such that the largest digits are in the tens place. This ensures the largest sum."
- Subtext: "- First number: 54"
- Subtext: "- Second number: 32"
- `# Step-3 (Correct)`: "3. Add the two numbers: 54 + 32 = 86"
- `# Step-4 (Incorrect)`: "Final Answer: **86**"
- **Visual Elements**:
- Four empty squares (two on the left, two on the right) with a "+" symbol between them.
- Formatting: Bold text for the final answer, red/green annotations for correctness.
### Content Details
- **Digits**: 2, 3, 4, 5 (explicitly listed in the question).
- **Steps**:
- Step 0: Correct reasoning about descending order.
- Step 1: Incorrectly formatted backslashes (likely a typo or formatting error).
- Step 2: Incorrect logic about splitting digits (though the numbers 54 and 32 are correct).
- Step 3: Correct addition of 54 + 32 = 86.
- Step 4: Incorrect final answer formatting (extra `#` symbols).
- **Final Answer**: **86** (highlighted with asterisks).
### Key Observations
- The problem emphasizes arranging digits in descending order to maximize the sum.
- Step 1 and Step 2 contain formatting errors (backslashes) and logical inaccuracies, respectively.
- The correct sum (86) is derived from splitting the digits into 54 and 32.
- Step 4’s formatting error suggests a possible oversight in the solution’s presentation.
### Interpretation
The problem demonstrates a strategy for maximizing the sum of two numbers by prioritizing larger digits in higher place values. While the final answer (86) is correct, the solution steps contain inconsistencies:
1. **Formatting Issues**: Backslashes in Step 1 and extra `#` symbols in Step 4 indicate potential errors in the solution’s markup.
2. **Logical Flaw**: Step 2’s reasoning about "largest digits in the tens place" is redundant, as splitting into 54 and 32 already achieves this.
3. **Correctness**: Despite errors, the core logic (descending order and optimal splitting) leads to the correct sum.
This example highlights the importance of precise formatting and clear reasoning in algorithmic problem-solving. The final answer aligns with the mathematical principle of maximizing digit placement, even though the solution’s presentation has minor flaws.
</details>
(d)
<details>
<summary>x18.png Details</summary>

### Visual Description
## Diagram: MathVision Quadratic Function Analysis
### Overview
The image presents a mathematical problem from MathVision, asking which of five graphs (A-E) does not belong to the same quadratic function. The solution explains that quadratic functions form parabolas, while one graph is a linear function (a straight line).
### Components/Axes
- **Title**: "Data Example from MathVision" (top-left corner).
- **Question Text**:
- "Four of the following five pictures show pieces of the graph of the same quadratic function. Which piece does not belong?"
- Choices: (A) (B) (C) (D) (E).
- **Graphs (A-E)**:
- All graphs are simple line plots with x and y axes.
- No explicit axis labels, scales, or legends are visible in the graphs themselves.
- **Solution Section**:
- Step 0: "To solve this, we analyze the graphs."
- Step 1: "A quadratic function has a parabolic shape, either opening upwards or downwards."
- Step 2: Analysis of each option:
- **A**: Curve increasing left to right (upward-opening parabola).
- **B**: Curve decreasing left to right (downward-opening parabola).
- **C**: Straight line (not a parabola).
- **D**: Curve increasing left to right (upward-opening parabola).
- **E**: Curve decreasing left to right (downward-opening parabola).
- Step 3: Conclusion: "The graph **C** does not belong because it is not a parabola."
- Step 4: Final answer: **C**.
### Detailed Analysis
- **Graph Shapes**:
- **A, D**: Parabolic curves opening upwards (consistent with quadratic functions).
- **B, E**: Parabolic curves opening downwards (consistent with quadratic functions).
- **C**: Linear function (straight line), inconsistent with quadratic functions.
- **Textual Clues**:
- The solution explicitly states that quadratic functions must be parabolic, ruling out linear functions.
- No numerical data points or axis markers are provided in the graphs.
### Key Observations
1. **Graph C** is the only non-parabolic shape, making it the outlier.
2. All other graphs (A, B, D, E) align with quadratic function properties (parabolic curves).
3. The solution relies on geometric properties rather than numerical data.
### Interpretation
The problem tests recognition of quadratic function characteristics. Quadratic functions are defined by their parabolic shape, which can open upward or downward depending on the coefficient of the squared term. The linear function (Graph C) lacks this curvature, confirming it does not belong to the same family. This highlights the importance of visual pattern recognition in algebraic problem-solving.
</details>
(e)
<details>
<summary>x19.png Details</summary>

### Visual Description
## Diagram: Fish Alignment Puzzle from MathVision
### Overview
The image depicts a tangled line with colored fish (yellow, blue, green, pink) positioned along it, and a ring at one end. The question asks how many fish will have their heads pointing toward the ring when the line is straightened. The solution steps are provided but contain inconsistencies.
### Components/Axes
- **Visual Elements**:
- **Line**: A tangled, looping path with no clear direction.
- **Fish**: Colored fish (yellow, blue, green, pink) attached to the line, oriented in various directions.
- **Ring**: A circular object at the end of the line, serving as the target for fish alignment.
- **Textual Elements**:
- **Question**: "How many fish will have their heads pointing towards the ring when we straighten the line?"
- **Hint**: "Please answer the question and provide the final answer at the end."
- **Solution Steps**: Six steps with labels (Step-0 to Step-6), marked as "Neural" or "Incorrect" except Step-1 and Step-2.
### Detailed Analysis
1. **Step-0 (Neural)**: Correctly identifies the goal: determining fish alignment after straightening the line.
2. **Step-1 (Correct)**: Accurately states the need to identify fish already pointing toward the ring in the current configuration.
3. **Step-2 (Neural)**: Correctly advises visualizing the straightened line to predict alignment.
4. **Step-3 (Incorrect)**: Claims only the green fish on the left points toward the ring in the current configuration. This is **partially incorrect**—other fish (e.g., pink fish near the ring) may also align.
5. **Step-4 (Incorrect)**: Incorrectly assumes only the green fish aligns in the straightened line. The reasoning fails to account for other fish whose heads might realign.
6. **Step-5 (Incorrect)**: Misinterprets the image, claiming only the far-left green fish aligns. This ignores fish near the ring.
7. **Step-6 (Incorrect)**: Final answer of "3" contradicts the flawed reasoning. No fish in the straightened line aligns except the green fish on the left.
### Key Observations
- **Inconsistent Reasoning**: Steps 3–6 contain logical errors, leading to an incorrect final answer.
- **Visual vs. Textual Conflict**: The diagram shows fish positions, but the textual solution misinterprets them.
- **Alignment Logic**: When the line is straightened, only fish whose heads are already aligned with the ring’s position in the tangled line will remain aligned. The green fish on the left is the only one that satisfies this condition.
### Interpretation
The data suggests **only one fish** (the green fish on the left) will point toward the ring when the line is straightened. However, the provided solution incorrectly concludes "3," likely due to flawed reasoning in Steps 3–6. The discrepancy highlights a need to re-evaluate the alignment criteria: fish must have their heads **directly aligned** with the ring’s position in the straightened line, not just pointing generally toward it. The final answer should be **1**, not 3.
</details>
(f)
<details>
<summary>x20.png Details</summary>

### Visual Description
## Diagram: Data Example from MathVerse
### Overview
The image presents a coordinate plane with a shaded region and a solid line. The task is to state the inequality describing the shaded region. The solution steps provided contain errors, but the final answer is correct.
### Components/Axes
- **Axes**:
- Horizontal axis labeled **x** (ranges from -10 to 10).
- Vertical axis labeled **y** (ranges from -10 to 10).
- **Line**:
- Solid line passing through points **(-5, -5)** and **(0, 5)**.
- Equation derived as **y = 2x + 5** (slope = 2, y-intercept = 5).
- **Shaded Region**:
- Area above the line, including the line itself (solid boundary).
### Detailed Analysis
1. **Line Equation**:
- Slope calculated as **m = (5 - (-5)) / (0 - (-5)) = 10 / 5 = 2**.
- Using point-slope form with **(0, 5)**: **y = 2x + 5**.
2. **Inequality**:
- Shaded region is **above** the line, so the inequality is **y ≥ 2x + 5**.
- Solid line indicates **inclusion** (≤ or ≥).
### Key Observations
- The solution steps contain errors:
- **Step 1**: Incorrectly states the shaded region is "above the line" but fails to justify the inequality direction.
- **Step 2**: Correctly calculates the slope and equation but mislabels the process as "incorrect."
- **Step 3**: Incorrectly claims the inequality is **y ≤ 2x + 5** (contradicts the shaded region).
- The final answer **y ≥ 2x + 5** is correct but not properly justified in the steps.
### Interpretation
The diagram demonstrates how to derive an inequality from a graph. The shaded region above the line **y = 2x + 5** corresponds to **y ≥ 2x + 5**. However, the provided solution steps contain logical inconsistencies:
- The slope calculation and line equation are correct, but the inequality direction is misrepresented in Step 3.
- The final answer is accurate but lacks a clear explanation of why the inequality is **≥** (shaded region above the line).
This example highlights the importance of aligning graphical shading with inequality notation and verifying each step for consistency.
</details>
(g)
<details>
<summary>x21.png Details</summary>

### Visual Description
## Diagram: Geometry Problem with Solution Steps
### Overview
The image depicts a geometry problem involving a circle with center O and points A, B, C, D on its circumference. A tangent CD is drawn at point C, and the task is to determine the degree of angle A. The solution includes a step-by-step explanation with annotations indicating correctness (✅) or incorrectness (❌) of each step.
### Components/Axes
- **Geometry Figure**:
- Circle with center **O**.
- Points labeled **A, B, C, D** on the circumference.
- Tangent **CD** at point **C**.
- Angles marked:
- ∠A = ? (choices: 20°, 25°, 40°, 50°).
- ∠CDB = 50° (inscribed angle subtended by arc BC).
- **Solution Steps**:
- **Step 0**: Correct. States the goal to find ∠A using the property that the central angle is twice the inscribed angle.
- **Step 1**: Correct. Identifies arcs AC and BC as key components.
- **Step 2**: Correct. Notes CD is a tangent at C, and ∠CDB = 50° (tangent-chord angle theorem).
- **Step 3**: ❌ Incorrect. Misapplies the inscribed angle theorem by claiming ∠A is subtended by arc BC (should be arc BC for ∠CDB, not ∠A).
- **Step 4**: Correct. Concludes ∠A = 50° (misinterpretation of the inscribed angle theorem).
- **Step 5**: Correct. Final answer: **A: 20°** (correct application of the theorem).
### Detailed Analysis
- **Geometry Principles**:
- The inscribed angle theorem states that an angle subtended by an arc at the circumference is half the central angle subtended by the same arc.
- The tangent-chord angle theorem states that the angle between a tangent and a chord is equal to the inscribed angle subtended by the chord in the alternate segment.
- **Step-by-Step Breakdown**:
1. **Step 0**: Correctly identifies the relationship between central and inscribed angles.
2. **Step 1**: Correctly labels arcs AC and BC as relevant to the problem.
3. **Step 2**: Correctly applies the tangent-chord angle theorem to establish ∠CDB = 50°.
4. **Step 3**: ❌ Incorrectly associates ∠A with arc BC. ∠A is actually subtended by arc BC (central angle) or arc BDC (inscribed angle), but the error here conflates the roles of ∠A and ∠CDB.
5. **Step 4**: ❌ Misinterprets the relationship, leading to an incorrect conclusion of ∠A = 50°.
6. **Step 5**: Correctly resolves the problem by recognizing that ∠A is half the central angle subtended by arc BC (20°).
### Key Observations
- **Incorrect Step 3**: The error arises from misapplying the inscribed angle theorem. ∠A is subtended by arc BC (central angle = 2 × ∠A), while ∠CDB is subtended by the same arc BC (inscribed angle = ½ × central angle).
- **Correct Final Answer**: ∠A = 20° (Step 5) aligns with the tangent-chord angle theorem and inscribed angle properties.
### Interpretation
The problem demonstrates the application of two key circle theorems:
1. **Tangent-Chord Angle Theorem**: ∠CDB = 50° (equal to the inscribed angle subtended by chord BC).
2. **Inscribed Angle Theorem**: ∠A (inscribed angle) is half the central angle subtended by arc BC. Since ∠CDB = 50°, the central angle for arc BC is 100°, making ∠A = 50° (incorrect in Step 4) or 20° (correct in Step 5).
The error in Step 3 highlights a common misconception: confusing the roles of central and inscribed angles. The correct solution requires recognizing that ∠A is subtended by arc BC (central angle = 2 × ∠A), while ∠CDB is an inscribed angle subtended by the same arc. The final answer (20°) is derived by halving the central angle (100°) to get the inscribed angle ∠A.
</details>
(h)
<details>
<summary>x22.png Details</summary>

### Visual Description
## Chart/Diagram Type: MathVerse Data Example
### Overview
The image presents a mathematical problem from MathVerse involving quadratic functions. It includes a graph of `f(x) = x² + x - 6`, a question about shared roots between `f(x)` and `g(x)`, and a step-by-step solution with annotations indicating correctness.
### Components/Axes
- **Graph Labels**:
- X-axis: Labeled `x` with grid lines from -7 to 7.
- Y-axis: Labeled `y` with grid lines from -7 to 7.
- Function `f(x) = x² + x - 6` is plotted as a blue parabola opening upward.
- **Legend**:
- `y = g(x)` is indicated in blue (matches the graph's blue line).
- **Text Elements**:
- **Question Section**:
- Text: "f(x)=x² + x - 6. How many roots do the functions have in common, f(x) and g(x)?"
- Choices:
- A: Share all roots.
- B: Share one root but each has another unique root.
- C: Share no roots.
- **Solution Section**:
- Step 0: `Neural` (reasoning phase).
- Step 1: `Correct` (factoring `f(x)` into `(x+3)(x-2)`, roots at `x = -3` and `x = 2`).
- Step 2: `Incorrect` (misidentifies `g(x)` roots as `x = -3` and `x = 4`).
- Step 3: `Incorrect` (comparison of roots: `f(x)` has `-3` and `2`; `g(x)` has `-3` and `4`).
- Step 4: `Incorrect` (conclusion: shared root `-3`, unique roots `2` and `4`).
- Final Answer: `B` (shared root `-3`, unique roots `2` and `4`).
### Detailed Analysis
- **Graph of `f(x)`**:
- Vertex at `(-0.5, -6.25)` (calculated from `x = -b/(2a)`).
- Roots at `x = -3` and `x = 2` (confirmed via factoring).
- **Solution Steps**:
- Step 1 correctly factors `f(x)` and identifies roots.
- Step 2 incorrectly assumes `g(x)` has roots at `-3` and `4` (no graph provided for `g(x)`).
- Step 3 correctly compares roots but misaligns with the final answer.
- Step 4 correctly identifies shared root `-3` and unique roots `2` (for `f(x)`) and `4` (for `g(x)`).
### Key Observations
1. The graph of `f(x)` is a standard upward-opening parabola with clear roots at `-3` and `2`.
2. The solution process contains intentional errors (Steps 2–4) to test understanding of root comparison.
3. The final answer (`B`) aligns with the correct interpretation of shared and unique roots.
### Interpretation
The problem demonstrates how to compare roots of quadratic functions. While `f(x)` and `g(x)` share one root (`x = -3`), their other roots differ (`x = 2` for `f(x)` and `x = 4` for `g(x)`), making **Option B** correct. The annotated solution steps likely serve as a pedagogical tool to highlight common misconceptions (e.g., misidentifying roots of `g(x)`). The absence of `g(x)`'s graph emphasizes reliance on algebraic reasoning over visual analysis.
</details>
(i)
<details>
<summary>x23.png Details</summary>

### Visual Description
## Chart/Diagram Type: Protractor Diagram with Angle Measurement
### Overview
The image depicts a technical example from DynaMath, illustrating a protractor diagram with a blue angle measurement. The diagram includes a semicircular protractor, a blue line indicating an angle, and textual steps to solve the problem. The question asks for the measure of the blue angle, and the solution includes steps with annotations (e.g., "Neural," "Incorrect") and a final answer.
### Components/Axes
- **Protractor**:
- Semicircular scale marked from **0° to 180°** in 10° increments.
- Blue line originates from the center (0°) and points to the **40° mark**.
- **Textual Elements**:
- **Question**: "Based on the measurement results shown in the diagram, this blue angle is ()°."
- **Solution Steps**:
- **Step 0 (Neural)**: "To determine the measure of the blue angle in the diagram, follow these steps:"
- **Step 1 (Incorrect)**:
1. Observe the protractor: The protractor measures angles from 0° to 180°.
2. Identify the measurement: The blue angle is indicated by a line pointing to a specific degree on the protractor.
3. Locate the angle: The blue line points to the number 40° on the protractor.
- **Step 2 (Incorrect)**: "Thus, the measure of the blue angle is 40°."
- **Step 3 (Incorrect)**: "Final answer: 40."
- **Final Answer**: "40" (highlighted in bold).
### Detailed Analysis
- **Protractor Scale**:
- The protractor is labeled with degree markings from **0° to 180°**, with tick marks every 10°. The blue line aligns with the **40° mark**, confirming the angle's measure.
- **Solution Steps**:
- **Step 0** (Neural): A directive to follow steps, but no explicit calculation.
- **Step 1** (Incorrect): Correctly identifies the protractor's range (0°–180°) and the blue line's role but fails to explicitly state the angle's measure.
- **Step 2** (Incorrect): Correctly concludes the angle is 40°, but the reasoning is labeled as "Incorrect," suggesting a flaw in the logic (e.g., misinterpretation of the protractor's scale).
- **Step 3** (Incorrect): Repeats the final answer but is also labeled as "Incorrect," indicating a possible inconsistency in the solution process.
- **Color Coding**:
- "Neural" is highlighted in **yellow**, while "Incorrect" is in **red**, likely indicating the model's confidence or errors in the steps.
### Key Observations
1. The blue line in the protractor diagram directly points to **40°**, matching the final answer.
2. The solution steps contain **contradictions**: Steps 1–3 are labeled "Incorrect," yet the final answer is correct. This suggests either a mislabeling of steps or an error in the reasoning process.
3. The protractor's scale is standard, with no anomalies in the markings.
### Interpretation
The diagram demonstrates a basic angle measurement task using a protractor. The blue line's position at **40°** is unambiguous, confirming the angle's measure. However, the solution steps reveal **logical inconsistencies**:
- Step 1 correctly identifies the protractor's range but does not explicitly state the angle's measure.
- Step 2 correctly concludes the angle is 40°, but the "Incorrect" label implies a flaw (e.g., misreading the protractor).
- Step 3 repeats the answer but is also labeled "Incorrect," suggesting a possible error in the model's reasoning or formatting.
The example highlights the importance of **accurate step-by-step reasoning** in problem-solving, even when the final answer is correct. The use of color coding ("Neural" vs. "Incorrect") may indicate the model's confidence or errors in the solution process.
</details>
(j)
<details>
<summary>x24.png Details</summary>

### Visual Description
## Text-Based Data Table and Solution Analysis
### Overview
The image presents a data table listing sales numbers for eight cities, followed by a question about calculating the range of these numbers. A solution attempt is provided, containing labeled steps with some marked as "Incorrect."
### Components/Axes
- **Table Structure**:
- **Headers**: `city`, `Sales number`
- **Rows**: Eight cities with corresponding sales numbers (all values are integers).
- **Question**: "A real estate agent looked into how many houses were sold in different cities. What is the range of the numbers?"
- **Solution Section**:
- **Steps**: Labeled `#Step-0` to `#Step-4`, with `#Step-0` marked as "Neural" and `#Step-1` to `#Step-4` marked as "Incorrect."
- **Text Content**: Step-by-step explanation of calculating the range, including intermediate values and a final answer.
### Detailed Analysis
#### Table Data
| City | Sales Number |
|-----------------|--------------|
| Briggs Corner | 83 |
| New Humburg | 83 |
| Melville | 86 |
| Fremont | 81 |
| Liberty City | 92 |
| Charles Falls | 84 |
| Pleasant Town | 93 |
| Penny Town | 97 |
#### Solution Steps
1. **Step 0 (Neural)**: "Let me help you find the range of the sales numbers."
2. **Step 1 (Incorrect)**: "The range is calculated by subtracting the minimum value from the maximum value."
3. **Step 2 (Incorrect)**: "Minimum sales number: 81 (Fremont)\nMaximum sales number: 97 (Penny Town)"
4. **Step 3 (Incorrect)**: "Range = Maximum - Minimum\nRange = 97 - 81 = 16"
5. **Step 4 (Incorrect)**: "Therefore, the range of house sales numbers across these cities is 16."
### Key Observations
- The table lists sales numbers for eight cities, with values ranging from 81 (Fremont) to 97 (Penny Town).
- The solution correctly identifies the minimum (81) and maximum (97) sales numbers and calculates the range as 16. However, all steps after Step 0 are labeled as "Incorrect," despite the final answer being accurate. This suggests a potential inconsistency in the labeling of steps.
### Interpretation
- **Data Trends**: The sales numbers are clustered between 81 and 97, with no extreme outliers. The range (16) indicates moderate variability in sales across cities.
- **Solution Analysis**: While the mathematical steps (Steps 1–4) are factually correct, their labeling as "Incorrect" is contradictory. This could imply an error in the problem's annotation or a misalignment between the steps and their labels. The final answer (range = 16) is valid based on the provided data.
- **Notable Anomaly**: The labeling of Steps 1–4 as "Incorrect" conflicts with the accuracy of their content. This discrepancy warrants further investigation into the source of the error (e.g., mislabeling, contextual misunderstanding of "range").
</details>
(k)
<details>
<summary>x25.png Details</summary>

### Visual Description
## Radar Chart: Data Example from DynaMath
### Overview
The image displays a radar chart titled "Data Example from DynaMath," which visualizes performance across five categories (A, B, C, D, E) on a scale from 0 to 100. The chart includes a shaded triangular area representing data values, with a legend on the right associating colors with solution steps. The solution text is embedded on the right side of the chart.
---
### Components/Axes
- **Categories (Axes):**
- A, B, C, D, E (labeled clockwise around the radar chart).
- Scale: 0 to 100, with increments of 20 (20, 40, 60, 80, 100).
- **Legend:**
- Step 0: Yellow (`# Step-0 (Neural)`).
- Steps 1–3: Red (`# Step-1 (Incorrect)`, `# Step-2 (Incorrect)`, `# Step-3 (Incorrect)`).
- **Solution Text:**
- Embedded on the right side, explaining the analysis process.
---
### Detailed Analysis
- **Category Values (Approximate):**
- **C:** Highest value (likely 80–100, based on the shaded area's peak).
- **D:** Second-highest value (likely 60–80).
- **A, B, E:** Lower values (likely below 60, with minimal differentiation).
- **Solution Steps:**
- **Step 0 (Neural):** Identifies categories and scale.
- **Steps 1–3 (Incorrect):** Misidentifies the second-highest category as C, then D, then concludes D is correct.
---
### Key Observations
1. **Category C** dominates the radar chart, indicating the highest performance.
2. **Category D** is the second-highest, with values significantly lower than C but higher than A, B, and E.
3. **A, B, and E** cluster near the lower end of the scale, showing minimal performance.
4. The solution text contains logical inconsistencies (e.g., mislabeling categories during analysis).
---
### Interpretation
- The radar chart demonstrates a clear hierarchy: **C > D > A/B/E**.
- The solution process highlights common errors in interpreting radar charts, such as misjudging relative values.
- The color-coded steps (yellow for initial analysis, red for errors) emphasize the importance of systematic evaluation.
- While exact numerical values are not provided, the shaded area’s geometry suggests approximate ranges for each category.
---
### Spatial Grounding
- **Radar Chart:** Left side of the image.
- **Legend:** Right side, aligned vertically.
- **Solution Text:** Right side, below the legend.
- **Color Consistency:** Step 0 (yellow) and Steps 1–3 (red) match the legend labels.
---
### Content Details
- **Text Transcription:**
- **Question:** "The radar chart below represents different categories. Which category is second highest?"
- **Choices:** (A) A, (B) B, (C) C, (D) D, (E) E.
- **Solution:**
- Step 0: Identifies categories and scale.
- Step 1: Incorrectly labels C as highest.
- Step 2: Correctly identifies D as second-highest.
- Step 3: Concludes D is the answer.
---
### Final Answer
The second-highest category is **(D) D**, as confirmed by the solution text despite initial missteps in the analysis.
</details>
(l)
<details>
<summary>x26.png Details</summary>

### Visual Description
## Text-Based Problem: Cone Radius Calculation
### Overview
The image presents a mathematical problem from WeMath involving a conical water container. It includes a question, multiple-choice answers, a step-by-step solution, and a labeled diagram of a cone.
### Components/Axes
- **Diagram Labels**:
- Height of the cone: **20 cm** (vertical dashed line).
- Circumference of the base: **62.8 cm** (dashed circle).
- **Text Elements**:
- Question: "Given that the circumference of the upper edge of a conical water container is 62.8 cm, what is the radius of its base circle in cm?"
- Multiple-choice options: A. 8, B. 9, C. 10, D. 10, E. Cannot be determined.
- Solution steps: Numbered 0–6, each with brief explanations.
### Detailed Analysis
1. **Question & Options**:
- The problem asks for the radius of the cone’s base given its circumference (62.8 cm).
- Options include numerical values and an "E. Cannot be determined" choice.
2. **Solution Steps**:
- **Step 0**: Objective to find the radius.
- **Step 1**: Formula for circumference: \( C = 2\pi r \).
- **Step 2**: Substitute \( C = 62.8 \) into the formula: \( 62.8 = 2\pi r \).
- **Step 3**: Solve for \( r \): \( r = \frac{62.8}{2\pi} \).
- **Step 4**: Approximate \( \pi \approx 3.14 \): \( r = \frac{62.8}{6.28} \).
- **Step 5**: Final calculation: \( r = 10 \).
- **Step 6**: Answer: **C. 10**.
3. **Diagram**:
- A side view of a cone with a dashed base circle and labeled height (20 cm).
- The circumference (62.8 cm) is annotated near the base.
### Key Observations
- The solution correctly applies the circumference formula \( C = 2\pi r \).
- The approximation \( \pi \approx 3.14 \) simplifies the calculation.
- The final answer matches option **C. 10**, confirming the radius is 10 cm.
### Interpretation
The problem demonstrates the application of geometric formulas to real-world scenarios. By isolating the relationship between circumference and radius, the solution logically derives the radius using algebraic manipulation and numerical approximation. The diagram reinforces the problem’s context, ensuring clarity in visualizing the cone’s dimensions. The absence of conflicting data or outliers underscores the straightforward nature of the calculation.
</details>
(m)
<details>
<summary>x27.png Details</summary>

### Visual Description
## Diagram: Quadrilateral ABCD with Axis of Symmetry EF and Parallelogram ABFD Perimeter Calculation
### Overview
The image presents a geometry problem involving a quadrilateral ABCD with an axis of symmetry EF. The task is to calculate the perimeter of parallelogram ABFD given specific side lengths (CD = 5 cm, AD = 3 cm). The diagram includes labeled measurements, a solution process with step-by-step annotations, and multiple-choice answer options.
### Components/Axes
1. **Diagram Elements**:
- Quadrilateral ABCD with axis of symmetry EF (dashed line).
- Labeled side lengths:
- AE = 3 cm, ED = 5 cm (left side of EF).
- BF = 3 cm, FC = 5 cm (right side of EF).
- Parallelogram ABFD highlighted in the solution.
2. **Question Text**:
- **Question**: "As shown in the diagram, EF is the axis of symmetry for quadrilateral ABCD. Given that CD = 5 cm and AD = 3 cm, what is the perimeter of parallelogram ABFD in cm?"
- **Answer Choices**:
- A. 12 cm
- B. 10 cm
- C. 14 cm
- D. 16 cm
- E. No correct answer
3. **Solution Steps**:
- **Step 0**: Correct. Explains the need to use symmetry to find ABFD's perimeter.
- **Step 1**: Correct. States EF divides ABCD into congruent halves.
- **Step 2**: Incorrect. Incorrectly assumes AD = DF = 3 cm without justification.
- **Step 3**: Correct. Uses symmetry to conclude CD = BF = 5 cm.
- **Step 4**: Correct. Lists perimeter formula: AD + DF + BF + AB.
- **Step 5**: Incorrect. Incorrectly substitutes AD = 3 cm, BF = 5 cm, but omits AB.
- **Step 6**: Incorrect. Repeats incorrect perimeter calculation (3+3+5+5=16 cm).
- **Step 7**: Correct. Final answer D (16 cm).
### Detailed Analysis
- **Diagram Labels**:
- AE = 3 cm, ED = 5 cm (left side of EF).
- BF = 3 cm, FC = 5 cm (right side of EF).
- CD = 5 cm (given), AD = 3 cm (given).
- **Solution Logic**:
- Symmetry implies AD = DF = 3 cm and CD = BF = 5 cm.
- Perimeter of ABFD = AD + DF + BF + AB = 3 + 3 + 5 + 5 = 16 cm.
### Key Observations
1. Steps 0, 1, 3, and 7 are marked correct, aligning with the final answer D (16 cm).
2. Steps 2, 4, 5, and 6 contain errors:
- Step 2 incorrectly assumes AD = DF without proof.
- Step 5 omits AB in the perimeter calculation.
- Step 6 repeats the incorrect sum (3+3+5+5=16 cm) despite being labeled incorrect.
### Interpretation
The problem hinges on understanding symmetry in quadrilaterals. EF as the axis of symmetry ensures congruent halves, making AD = DF and CD = BF. The perimeter calculation requires summing all sides of ABFD, which are derived from the symmetric properties. The correct answer (D: 16 cm) is validated by Steps 0, 1, 3, and 7, while errors in other steps highlight common misconceptions about symmetry and perimeter formulas.
</details>
(n)
<details>
<summary>x28.png Details</summary>

### Visual Description
## Diagram: Educational Question with Solution Flowchart
### Overview
The image depicts a cartoon-style educational diagram featuring two desks with stacked books, a girl walking between them, and a textual question/solution pair. The diagram is labeled "Data Example from WeMath" and includes a question about identifying the color of the bottommost book on "Desk 2" (Desk 2 in the diagram).
### Components/Axes
- **Desks**:
- Desk 1 (left): Stacked books in order (top to bottom): Green, Red, Blue.
- Desk 2 (right): Stacked books in order (top to bottom): Green, Red, Blue.
- **Girl**: Standing between desks, wearing an orange apron and yellow skirt.
- **Question Text**:
- "As shown in the diagram, what is the color of the bottommost book placed on Desk 2?"
- Options: A. Red, B. Green, C. Blue, D. Yellow, E. No correct answer.
- **Solution Text**:
- Step 0 (Neural): Visual analysis of Desk 2’s book stack.
- Step 1 (Correct): Topmost book = Green.
- Step 2 (Correct): Second book = Red.
- Step 3 (Correct): Bottommost book = Blue.
- Final Answer: **C. Blue**.
### Detailed Analysis
- **Book Stack Order**:
- Desk 1 and Desk 2 have identical book stacks: Green (top), Red (middle), Blue (bottom).
- Colors are explicitly labeled in the solution steps with **bold formatting**.
- **Question Structure**:
- The question references "Desk 2," which corresponds to Desk 2 in the diagram.
- Options include a distractor (E: No correct answer), but the solution confirms a valid answer exists.
### Key Observations
1. The bottommost book on Desk 2 is **Blue**, as stated in Step 3 of the solution.
2. The solution uses a stepwise "Neural" → "Correct" progression to validate the answer.
3. The diagram’s visual layout mirrors the textual solution, reinforcing the answer through spatial and textual alignment.
### Interpretation
The diagram demonstrates a pedagogical approach to problem-solving, where visual and textual elements are cross-referenced to arrive at a conclusion. The consistent book stack order across desks simplifies the task, emphasizing logical deduction over complex pattern recognition. The use of **bold formatting** in the solution steps highlights critical data points, guiding the learner’s focus. The final answer (**C. Blue**) is unambiguously supported by both the diagram and the textual explanation, leaving no room for ambiguity.
</details>
(o)
Figure 7: More data examples from VisualProcessBench.
<details>
<summary>x29.png Details</summary>

### Visual Description
## Screenshot: Geometry Problem-Solving Process
### Overview
This image shows a step-by-step solution to a geometry problem involving an isosceles triangle (triangle ABC) with labeled angles. The problem asks for the measure of an exterior angle (angle 1), with multiple-choice options provided. The solution process includes reasoning, errors, and corrections.
### Components/Axes
- **Diagram**: Triangle ABC with labeled angles:
- ∠B = 25°
- ∠A = 130°
- ∠C = 25°
- Exterior angle 1 = 35° (labeled near point C)
- **Text**:
- **Question**: "As shown in the figure, triangle ABC is an isosceles triangle, ∠B = 25°, then the angle of ∠1 is ()"
- **Options**: A. 95°, B. 145°, C. 120°, D. No correct answer
- **Solution Steps**: Numbered from Step 0 to Step 12, with annotations like `(Correct)`, `(Incorrect)`, and `(Correct)` in parentheses.
### Detailed Analysis
#### Diagram Labels
- Triangle ABC is isosceles with ∠B = 25° and ∠C = 25° (marked with identical arcs).
- ∠A = 130° is labeled at vertex A.
- Exterior angle 1 is labeled as 35° near point C, adjacent to a straight line.
#### Solution Steps
1. **Step 0 (Neutral)**: Problem restatement.
2. **Step 1 (Correct)**: Recalls isosceles triangle properties (equal sides ↔ equal opposite angles).
3. **Step 4 (Incorrect)**: Misapplies exterior angle theorem, calculating ∠1 = 155° (30° + 25°).
4. **Step 5 (Correct)**: Questions the validity of options, noting discrepancies.
5. **Step 12 (Correct)**: Final answer: ∠1 = 120° (155° - 35°), matching option C.
### Key Observations
- The diagram’s angles (25°, 130°, 25°) sum to 180°, confirming triangle validity.
- Step 4’s error arises from misidentifying the exterior angle’s relationship to adjacent angles.
- The final answer (120°) aligns with the exterior angle theorem: ∠1 = ∠A + ∠B = 130° + 25° = 155°, but the adjacent 35° angle reduces it to 120°.
### Interpretation
The problem tests understanding of:
1. **Isosceles triangle properties**: Equal angles opposite equal sides.
2. **Exterior angle theorem**: An exterior angle equals the sum of the two non-adjacent interior angles.
3. **Angle subtraction**: Adjusting for adjacent angles on a straight line.
The error in Step 4 highlights a common misconception about exterior angles. The correct solution (Step 12) resolves this by subtracting the adjacent 35° angle from the total exterior angle (155°), yielding 120°. The figure’s accuracy is critical, as mislabeling angles (e.g., ∠C as 25°) directly impacts the calculation.
**Final Answer**: Option C (120°).
</details>
Figure 8: Data example with model reflection from VisualProcessBench. Red highlights the incorrect answer, orange highlights the reflection words, and green highlights the correct answer.