2505.20241

Model: healer-alpha-free

# DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning **Authors**: - Qi Cao (University of California, San Diego) - &Ruiyi Wang (University of California, San Diego) - &Ruiyi Zhang (University of California, San Diego) - &Sai Ashish Somayajula (University of California, San Diego) - &Pengtao Xie (University of California, San Diego) ## Abstract Reasoning has substantially improved the performance of large language models (LLMs) on complicated tasks. Central to the current reasoning studies, Process Reward Models (PRMs) offer a fine-grained evaluation of intermediate reasoning steps and guide the reasoning process. However, extending PRMs to multimodal large language models (MLLMs) introduces challenges. Since multimodal reasoning covers a wider range of tasks compared to text-only scenarios, the resulting distribution shift from the training to testing sets is more severe, leading to greater generalization difficulty. Training a reliable multimodal PRM, therefore, demands large and diverse datasets to ensure sufficient coverage. However, current multimodal reasoning datasets suffer from a marked quality imbalance, which degrades PRM performance and highlights the need for an effective data selection strategy. To address the issues, we introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs which employs bi-level optimization. In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights, allowing the PRM to prioritize high-quality reasoning signals and alleviating the impact of dataset quality imbalance. In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset; this feedback updates the domain weights through an aggregation loss function, thereby improving the generalization capability of trained PRM. Extensive experiments on multiple multimodal reasoning benchmarks covering both mathematical and general reasoning show that test-time scaling with DreamPRM consistently improves the performance of state-of-the-art MLLMs. Further comparisons reveal that DreamPRM’s domain-reweighting strategy surpasses other data selection methods and yields higher accuracy gains than existing test-time scaling approaches. Notably, DreamPRM achieves a top-1 accuracy of 85.2% on the MathVista leaderboard using the o4-mini model, demonstrating its strong generalization in complex multimodal reasoning tasks. Project Page: https://github.com/coder-qicao/DreamPRM ## 1 Introduction <details> <summary>x1.png Details</summary> ![4f108810](/v1/image/4f10881052fbc58cda14da325ba9e09b33f8d5918a79d21d707751d60724a413) ### Visual Description ## Composite Technical Figure: DreamPRM Performance and Dataset Examples ### Overview The image is a composite technical figure divided into two main sections. The left section contains a bar chart quantifying the performance improvement of a method called "DreamPRM" compared to a baseline ("PRM w/o data selection") across five different datasets. The right section consists of two vertically stacked panels, each presenting an example question from a specific dataset (AIZD and M3CoT) along with associated metadata and analysis determined by DreamPRM. ### Components/Axes **Left Chart:** * **Chart Type:** Grouped bar chart. * **Y-Axis:** Label: "Accuracy Improvement (%)". Scale: 0 to 7, with major ticks at 0, 1, 2, 3, 4, 5, 6, 7. * **X-Axis:** Lists five datasets: "WeMath", "MMVet", "MathVista", "MMStar", "MathVision". * **Legend:** Located in the top-left corner. * Blue bar: "DreamPRM" * Yellow bar: "PRM w/o data selection" * **Additional Annotation:** A horizontal dashed line at y=4.0, labeled "avg. = +4.0". **Right Panels (Top and Bottom):** * Each panel is a self-contained box with a white background and black border. * **Top Panel (AIZD Example):** * **Image (Top-Left):** A black-and-white diagram of a simple aquatic food chain: Sun → Phytoplankton → Zooplankton → Small Fish → Large Fish → Bird (Eagle/Hawk). * **Text Block (Right of Image):** * **Question:** "What does the bird feed on?" * **Choices:** "A. zooplankton", "B. grass", "C. predator fish", "D. none of the above" * **Answer:** "C" * **Dataset:** "AIZD (2016)" * **Metadata Block (Below Image and Question):** * "Dataset difficulty: easy (InternVL-2.5-MPO-8B's accuracy 84.6%)" * "Unnecessary modality: can answer without image" * "Requirements for reasoning: do not require complicated reasoning" * "Domain weight: 0.55 (Determined by DreamPRM)" (This line is in blue text). * **Bottom Panel (M3CoT Example):** * **Image (Top-Left):** A color photograph of a white and grey bird (likely a gull) in flight against a blue sky. Below it are four smaller thumbnail images of other animals. * **Text Block (Right of Image):** * **Question:** "Determine the scientific nomenclature of the organism shown in the primary image." * **Choices:** "A. Hemidactylus turcicus", "B. Felis silvestris", "C. Macropus agilis", "D. None of the above" * **Answer:** "D" * **Dataset:** "M3CoT (2024)" * **Metadata Block (Below Image and Question):** * "Dataset difficulty: hard (InternVL-2.5-MPO-8B's accuracy 62.1%)" * "Unnecessary modality: cannot answer without image" * "Requirements for reasoning: require complicated reasoning" * "Domain weight: 1.49 (Determined by DreamPRM)" (This line is in blue text). ### Detailed Analysis **Left Chart - Data Points:** The chart shows the percentage improvement in accuracy for DreamPRM (blue) versus the baseline without data selection (yellow) for each dataset. 1. **WeMath:** * DreamPRM (Blue): +5.7% * PRM w/o data selection (Yellow): +2.5% 2. **MMVet:** * DreamPRM (Blue): +5.5% * PRM w/o data selection (Yellow): +3.0% 3. **MathVista:** * DreamPRM (Blue): +3.5% * PRM w/o data selection (Yellow): +1.8% 4. **MMStar:** * DreamPRM (Blue): +3.4% * PRM w/o data selection (Yellow): +1.9% 5. **MathVision:** * DreamPRM (Blue): +1.7% * PRM w/o data selection (Yellow): +0.2% * **Average Line:** The dashed line indicates the average improvement across all datasets for DreamPRM is +4.0%. **Right Panels - Content Details:** * **AIZD (2016) Example:** The question tests basic understanding of a food chain diagram. The correct answer (C. predator fish) can be deduced from the diagram's arrows without needing complex reasoning. DreamPRM assigns it a low "Domain weight" of 0.55, correlating with its "easy" difficulty rating. * **M3CoT (2024) Example:** The question requires identifying the scientific name of a specific bird from a photograph, a task requiring specialized knowledge and visual analysis. The correct answer is "D. None of the above" (as the bird is a gull, not a gecko, cat, or kangaroo). DreamPRM assigns it a high "Domain weight" of 1.49, correlating with its "hard" difficulty rating and the stated requirement for "complicated reasoning." ### Key Observations 1. **Consistent Superiority:** DreamPRM (blue bars) shows a higher accuracy improvement than the baseline (yellow bars) across all five datasets. 2. **Magnitude of Improvement:** The performance gap is largest on the "WeMath" (+3.2% difference) and "MMVet" (+2.5% difference) datasets. The gap narrows for the other three datasets. 3. **Dataset Difficulty Spectrum:** The two example panels illustrate a clear contrast. The AIZD task is labeled "easy" with high model accuracy (84.6%) and low domain weight (0.55). The M3CoT task is labeled "hard" with lower model accuracy (62.1%) and high domain weight (1.49). 4. **Modality Relevance:** The metadata explicitly states when the image is unnecessary ("can answer without image") versus essential ("cannot answer without image") for solving the problem. 5. **Domain Weight as a Metric:** The "Domain weight" value, determined by DreamPRM, appears to be a quantitative measure that aligns with the qualitative difficulty and reasoning requirements of a task. ### Interpretation This composite figure serves a dual purpose: demonstrating the efficacy of the DreamPRM method and illustrating its analytical capabilities on diverse multimodal reasoning tasks. * **Performance Validation:** The bar chart provides empirical evidence that DreamPRM enhances model accuracy more effectively than a baseline approach that lacks its data selection mechanism. The consistent outperformance suggests the method is robust across different types of mathematical and visual reasoning benchmarks (WeMath, MMVet, etc.). * **Analytical Insight:** The right-hand panels showcase how DreamPRM can be used to *characterize* datasets. It doesn't just solve problems; it analyzes them to assign metadata like difficulty, modality necessity, reasoning requirements, and a novel "Domain weight." This weight seems to function as a proxy for task complexity or the degree of specialized knowledge required. * **Underlying Principle:** The contrast between the two examples suggests DreamPRM's core function may involve intelligently selecting or weighting training data based on these characterized properties. By assigning higher "domain weight" to hard, image-dependent, reasoning-intensive tasks (like M3CoT), the system likely prioritizes learning from such challenging examples, leading to the overall accuracy improvements seen in the chart. The figure argues that effective data selection (the difference between the blue and yellow bars) is key to improving performance on complex multimodal reasoning. </details> Figure 1: DreamPRM improves multimodal reasoning by mitigating the dataset quality imbalance problem. Left: On five benchmarks, DreamPRM outperforms base model (InternVL-2.5-8B-MPO [67]) by an average of $+4.0\$ . DreamPRM also consistently surpasses Vanilla PRM trained without data selection. Right: Easy AI2D [23] questions (weight 0.55) vs. hard M3CoT [6] questions (weight 1.49) shows how DreamPRM prioritizes data that demand deeper reasoning - samples requiring knowledge from both textual and visual modalities for step-by-step logical deduction. Reasoning [55] has significantly enhanced the logical and critical thinking capabilities of large language models (LLMs) [2, 8, 59, 49]. Post-training [45, 10] and test-time scaling strategies [44] enable sophisticated reasoning behaviors in LLMs and extend the length of Chain-of-Thoughts (CoTs) [71], thereby achieving strong results on challenging benchmarks [80, 47]. A key component of these advances is the Process Reward Models (PRMs) [29, 27], which provide fine-grained, step-wise supervision of the reasoning process and reliable selection of high-quality reasoning trajectories. These developments are proven highly effective for improving the performance of LLMs in complex tasks [38, 61]. Given the success with LLMs, a natural extension is to apply PRMs to multimodal large language models (MLLMs) [72, 28] to enhance their reasoning abilities. Early studies of multimodal PRMs demonstrate promise results, yet substantial challenges persist. Distinct from text-only inputs of LLMs, MLLMs must combine diverse visual and language signals: a high-dimensional, continuous image space coupled with discrete language tokens. This fusion dramatically broadens the input manifold and leads to more severe distribution shifts [56] from training to testing distributions. Consequently, directly utilizing PRM training strategies from the text domain [69, 37] underperforms, mainly due to the decreased generalizability [11] caused by the insufficient coverage of the multimodal input space. A straightforward solution to this problem is to combine multiple datasets that emphasize different multimodal reasoning skills, thereby enlarging the sampling space. However, quality imbalance among existing multimodal reasoning datasets is more severe than in text-only settings: many contain noisy inputs such as unnecessary modalities [78] or questions of negligible difficulty [33], as illustrated in Fig. 1. Since these easy datasets contribute little to effective sampling, paying much attention to them can substantially degrade PRM performance. Therefore, an effective data selection strategy that filters out unreliable datasets and instances is crucial to training a high-quality multimodal PRM. To overcome these challenges, we propose DreamPRM, a domain-reweighted training framework for multimodal PRMs. Inspired by domain-reweighting techniques [53, 12, 57], DreamPRM dynamically learns appropriate weights for each multimodal reasoning dataset, allowing them to contribute unequally during training. Datasets that contain many noisy samples tend to receive lower domain weights, reducing their influence on PRM parameter updates. Conversely, high-quality datasets are assigned higher weights and thus play a more important role in optimization. This domain-reweighting strategy alleviates the issue of dataset quality imbalances. DreamPRM adopts a bi-level optimization (BLO) framework [14, 31] to jointly learn the domain weights and PRM parameters. At the lower level, the PRM parameters are optimized with Monte Carlo signals on multiple training domains under different domain weights. At the upper level, the optimized PRM is evaluated on a separate meta domain to compute a novel aggregation function loss, which is used to optimized the domain weights. Extensive experiments on a wide range of multimodal reasoning benchmarks verify the effectiveness of DreamPRM. Our contributions are summarized as follows: - We propose DreamPRM, a domain-reweighted multimodal process reward model training framework that dynamically adjusts the importance of different training domains. We formulate the training process of DreamPRM as a bi-level optimization (BLO) problem, where the lower level optimizes the PRM via domain-reweighted fine-tuning, and the upper level optimizes domain weights with an aggregation function loss. Our method helps address dataset quality imbalance issue in multimodal reasoning, and improves the generalization ability of PRM. - We conduct extensive experiments using DreamPRM on a wide range of multimodal reasoning benchmarks. Results indicate that DreamPRM consistently surpasses PRM baselines with other data selection strategies, confirming the effectiveness of its bi-level optimization based domain-reweighting strategy. Notably, DreamPRM achieves a top-1 accuracy of 85.2% on the MathVista leaderboard using the o4-mini model, demonstrating its strong generalization in complex multimodal reasoning tasks. Carefully designed evaluations further demonstrate that DreamPRM possesses both scaling capability and generalization ability to stronger models. ## 2 Related Works #### Multimodal reasoning Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning [70, 25, 81] into LLMs encourages a step-by-step approach, thereby significantly enhancing question-answering performance. However, it has been reported that CoT prompting can’t be easily extended to MLLMs, mainly due to hallucinated outputs during the reasoning process [67, 82, 19]. Therefore, some post-training methods have been proposed for enhancing reasoning capability of MLLMs. InternVL-MPO [67] proposes a mixed preference optimization that jointly optimizes preference ranking, response quality, and response generation loss to improve the reasoning abilities. Llava-CoT [74] creates a structured thinking fine-tuning dataset to make MLLM to perform systematic step-by-step reasoning. Some efforts have also been made for inference time scaling. RLAIF-V [77] proposes a novel self-feedback guidance for inference-time scaling and devises a simple length-normalization strategy tackling the bias towards shorter responses. AR-MCTS [11] combines Monte-Carlo Tree Search (MCTS) and Retrival Augmented Generation (RAG) to guide MLLM search step by step and explore the answer space. #### Process reward model Process Reward Model (PRM) [29, 27, 38, 61] provides a more finer-grained verification than Outcome Reward Model (ORM) [9, 52], scoring each step of the reasoning trajectory. However, a central challenge in designing PRMs is obtaining process supervision signals, which require supervised labels for each reasoning step. Current approaches typically depend on costly, labor-intensive human annotation [29], highlighting the need for automated methods to improve scalability and efficiency. Math-Shepherd [64] proposes a method utilizing Monte-Carlo estimation to provide hard labels and soft labels for automatic process supervision. OmegaPRM [37] proposes a Monte Carlo Tree Search (MCTS) for finer-grained exploration for automatical labeling. MiPS [69] further explores the Monte Carlo estimation method and studies the aggregation of PRM signals. #### Domain-reweighting Domain reweighting methodologies are developed to modulate the influence of individual data domains, thereby enabling models to achieve robust generalization. Recently, domain reweighting has emerged as a key component in large language model pre-training, where corpora are drawn from heterogeneous sources. DoReMi [73] trains a lightweight proxy model with group distributionally robust optimization to assign domain weights that maximize excess loss relative to a reference model. DOGE [13] proposes a first-order bi-level optimization framework, using gradient alignment between source and target domains to update mixture weights online during training. Complementary to these optimization-based approaches, Data Mixing Laws [76] derives scaling laws that could predict performance under different domain mixtures, enabling low-cost searches for near-optimal weights without proxy models. In this paper, we extend these ideas to process supervision and introduce a novel bi-level domain-reweighting framework. ## 3 Problem Setting and Preliminaries #### Notations. Let $I$ , $T$ , and $Y$ denote the multimodal input space (images), textual instruction space, and response space, respectively. A multimodal large language model (MLLM) is formalized as a parametric mapping $M_θ:T×I→Δ(Y)$ , where $\hat{y}∼ M_θ(·|x)$ represents the stochastic generation of responses conditioned on input pair $x=(t,I)$ including visual input $I∈I$ and textual instruction $t∈T$ , with $Δ(Y)$ denoting the probability simplex over the response space. We use $y∈Y$ to denote the ground truth label from a dataset. The process reward model (PRM) constitutes a sequence classification function $V_φ:T×I×Y→[0,1]$ , parameterized by $φ$ , which quantifies the epistemic value of partial reasoning state $\hat{y}_i$ through scalar reward $p_i=V_φ(x,\hat{y}_i)$ , modeling incremental utility toward solving instruction $t$ under visual grounding $I$ . Specifically, $\hat{y}_i$ represents the first $i$ steps of a complete reasoning trajectory $\hat{y}$ . #### PRM training with Monte Carlo signals. Due to the lack of ground truth epistemic value for each partial reasoning state $\hat{y}_i$ , training of PRM requires automatic generation of approximated supervision signals. An effective approach to obtain these signals is to use the Monte Carlo method [69, 65]. We first feed the input question-image pair $x=(t,I)$ and the prefix solution $\hat{y}_i$ into the MLLM, and let it complete the remaining steps until reaching the final answer. We randomly sample multiple completions, compare their final answers to the gold answer $y$ , and thereby obtain multiple correctness labels. PRM is trained as a sequence classification task to predict these correctness labels. The ratio of correct completions at the $i$ -th step estimates the “correctness level” up to step $i$ , which is used as the approximated supervision signals $p_i$ to train the PRM. Formally, $$ p_i=\texttt{MonteCarlo}(x,\hat{y}_i,y)=\frac{\texttt{num(correct completions from }\hat{y}_i)}{\texttt{num(total completions from }\hat{y}_i)} \tag{1} $$ #### PRM-based inference with aggregation function. <details> <summary>x2.png Details</summary> ![e94be163](/v1/image/e94be1630d0c30e576ba1897804e2f4784baa39ea428eac92882b2e92d65324f) ### Visual Description ## Diagram: Machine Learning Pipeline with Training and Testing Sets ### Overview The image is a technical diagram illustrating a machine learning pipeline that processes visual inputs through a Multimodal Large Language Model (MLLM) and a Process Reward Model (PRM). It contrasts the workflow during training versus testing, highlighting a "distribution shift" between the two phases. The diagram uses a flowchart style with icons, text labels, and directional arrows to show data flow and model interactions. ### Components/Axes The diagram is divided into two primary horizontal sections, each representing a different phase of the machine learning process. **1. Training Set (Top Section - Blue Theme)** * **Input:** A map image (top-left) showing a yellow highlighted region and a park. * **Questions (Blue Boxes):** * "What is the area of the yellow region?" * "Which building is west of the park?" * **Model:** An icon labeled "MLLM" (Multimodal Large Language Model) processes each question. * **Process Flow:** Each MLLM output is represented by a sequence of blue circles (processing steps). The final step for both sequences is a dashed circle, indicating an incomplete or probabilistic output. * **Signal:** A curved arrow labeled "Monte Carlo signal" connects the final processing steps of the two MLLM sequences. * **Evaluation:** The outputs feed into a model icon labeled "PRM" (Process Reward Model). * **Output:** A blue bracket encompasses this entire section. **2. Testing Set (Bottom Section - Orange Theme)** * **Input:** A chart image (bottom-left) containing a bar graph and a pie chart. * **Question (Green Box):** "What is the value of the highest bar?" * **Model:** The same "MLLM" icon processes this question. * **Process Flow:** The MLLM generates two distinct output sequences, represented by two rows of green circles. * **Evaluation:** Both sequences are evaluated by the "PRM" model. * **Results:** The top sequence is marked with a red "X" (incorrect). The bottom sequence is marked with a green checkmark (correct). * **Output:** An orange bracket encompasses this section. **3. Connecting Element** * **Label:** "Distribution shift" * **Visual:** A large, curved orange arrow originates from the PRM in the Training Set section and points to the PRM in the Testing Set section. This indicates that the PRM trained on one data distribution (map-based QA) is being applied to a different distribution (chart-based QA). ### Detailed Analysis * **Spatial Grounding:** The legend (color-coding) is consistent: Blue elements (questions, processing circles) are associated with the Training Set. Orange elements (input chart, bracket) are associated with the Testing Set. Green elements (question, processing circles) are specific to the testing question's processing flow. * **Trend Verification:** The diagram does not show numerical trends but illustrates a procedural flow. The trend is the movement of data from input, through the MLLM, to evaluation by the PRM. * **Component Isolation:** * **Header/Training Region:** Focuses on training the PRM using multiple, related questions about a single visual input (map), with a "Monte Carlo signal" likely used for reward estimation. * **Footer/Testing Region:** Focuses on applying the trained PRM to a new visual domain (charts) with a single question, where the PRM must judge the correctness of different MLLM-generated reasoning paths. * **Text Transcription:** All text is in English. The questions are: * "What is the area of the yellow region?" * "Which building is west of the park?" * "What is the value of the highest bar?" * Labels: "Training Set", "Testing Set", "MLLM", "Monte Carlo signal", "PRM", "Distribution shift". ### Key Observations 1. **Two-Phase Process:** The system is explicitly designed with separate training and testing phases. 2. **PRM as a Judge:** The PRM's role is to evaluate the quality or correctness of the MLLM's internal reasoning process (the chain of circles), not just the final answer. 3. **Monte Carlo Signal:** This term in the training phase suggests the use of stochastic sampling to estimate rewards or outcomes during training. 4. **Distribution Shift:** This is the central challenge highlighted. The PRM is trained on one type of visual question answering (spatial reasoning on maps) and must generalize to another (data extraction from charts). 5. **Multiple Outputs in Testing:** During testing, the MLLM generates multiple potential reasoning paths for the same question, and the PRM's task is to identify the correct one. ### Interpretation This diagram illustrates a **reinforcement learning or reward modeling framework for improving multimodal AI reasoning**. The core idea is to train a Process Reward Model (PRM) to act as a verifier or judge. * **What it demonstrates:** The pipeline aims to make AI reasoning more robust and reliable. Instead of just training a model to produce an answer, it trains a separate model (PRM) to evaluate the *quality of the reasoning steps* that lead to an answer. This is akin to having a teacher who grades not just the final exam answer, but the student's shown work. * **How elements relate:** The MLLM is the "student" generating answers and reasoning chains. The PRM is the "teacher" or "grader." The "Monte Carlo signal" during training is likely the method used to provide feedback to the PRM on which reasoning paths are good. The "distribution shift" arrow is critical—it shows the system is being tested on its ability to apply learned judgment skills to entirely new problem domains (from maps to charts), which is a key measure of generalization in AI. * **Notable Implications:** The presence of multiple output paths in the testing phase and the PRM's selection of one as correct suggests this system could be used for **self-improvement or consistency checking**. The MLLM might generate several candidate solutions, and the PRM filters for the most logically sound one. The challenge of distribution shift underscores a major research goal: creating AI systems whose judgment capabilities are not confined to the narrow conditions in which they were trained. </details> Figure 2: General flow of training PRM and using PRM for inference. Training phase: Train PRM with Monte Carlo signals from intermediate steps of Chain-of-Thoughts (CoTs). Inference phase: Use the trained PRM to verify CoTs step by step and select the best CoT. Conventional training of PRM has poor generalization capability due to distribution shift between training set and testing set. After training a PRM, a typical way of conducting PRM-based MLLM inference is to use aggregation function [69]. Specifically, for each candidate solution $\hat{y}$ from the MLLM, PRM will generate a list of predicted probabilities ${p}=\{{p_1},{p_2},...,{p_n}\}$ accordingly, one for each step $\hat{y}_i$ in the solution. The list of predicted probabilities are then aggregated using the following function: $$ A({p})=∑_i=1^n\log\frac{{p_i}}{1-{p_i}}. \tag{2} $$ The aggregated value corresponds to the score of a specific prediction $\hat{y}$ , and the final PRM-based solution is the one with the highest aggregated score. #### Bi-level optimization. Bi-level optimization (BLO) has been widely used in meta-learning [14], neural architecture search [31], and data reweighting [54]. A BLO problem is usually formulated as: $$ \displaystyle\min_αU(α,φ^*(α)) \displaystyle s.t. \displaystyleφ^*(α)=\underset{φ}{\arg\min}L(φ,α) \tag{3} $$ where $U$ is the upper-level optimization problem (OP) with parameter $α$ , and $L$ is the lower-level OP with parameter $φ$ . The lower-level OP is nested within the upper-level one, and the two OPs are mutually dependent. ## 4 The Proposed Domain-reweighting Method <details> <summary>x3.png Details</summary> ![378567c3](/v1/image/378567c3fda1ff64513e390165b9ca5cc1018e31d2de1042ffde3756889425ca) ### Visual Description ## Diagram: Bi-Level Optimization Framework for DreamPRM Training ### Overview The image is a technical diagram illustrating a two-level (bi-level) optimization framework for training a model named "DreamPRM." The process involves training on multiple, diverse problem domains (e.g., geometry, data interpretation, algebra) to address "Quality imbalance." The framework separates optimization into a "Lower-level" and an "Upper-level," with a feedback loop managed by a component labeled "BLO." ### Components/Axes The diagram is organized into three main horizontal sections and a right-side vertical component. **1. Lower-level Optimization (Top Section):** * **Domains:** Two example domains are shown. * **Domain 1 (Blue):** Contains a geometry problem image (a yellow region on a grid) and the question: "What is the area of yellow region?". * **Domain k (Orange):** Contains a pie chart image and the question: "What is the largest pie area?". * **Process Flow:** Each domain's question is fed into an "MLLM" (Multimodal Large Language Model) icon. The MLLM output passes through a series of connected circular nodes (blue for Domain 1, orange for Domain k). The final nodes are dashed circles, suggesting intermediate or latent representations. * **Output:** The processed outputs from both domains converge and point to the "DreamPRM" component on the right. **2. Upper-level Optimization (Bottom Section):** * **Domain k+1 (Teal):** Contains an algebra problem: "2x+6=13" and the question: "What is the value of x?". * **Process Flow:** Similar to the lower level, the question goes through an "MLLM" and a series of teal circular nodes. * **Feedback Loop:** The final node in this chain has multiple teal arrows pointing back to earlier nodes in the same chain, indicating an iterative or recursive optimization process within this domain. **3. DreamPRM & BLO (Right Side):** * **DreamPRM:** Depicted as a robot-head icon. It receives input from the Lower-level Optimization. * **Domain weights:** Represented by a bar chart icon. Arrows show these weights are used by the PRM and are updated by the BLO. * **PRM:** Another robot-head icon, connected to the Domain weights. * **BLO (Bi-Level Optimization):** A central component with dashed purple arrows forming a loop between the "Domain weights" and the "PRM," indicating the upper-level optimization loop that adjusts weights based on performance. **4. Legend (Bottom Center):** * **Red flame icon:** "Activated parameters" * **Blue snowflake icon:** "Frozen parameters" * This legend is referenced in the PRM icons: the top PRM (connected to Lower-level) has a red flame (activated), while the bottom PRM (connected to BLO) has a blue snowflake (frozen). ### Detailed Analysis * **Spatial Grounding:** The "Lower-level Optimization" label is centered at the top. "Domain 1" and "Domain k" are left-aligned in their respective rows. The "Quality imbalance" label is positioned between the two lower-level domains. "Upper-level Optimization" is centered above the third domain. The "DreamPRM" system is vertically aligned on the far right. The legend is centered at the very bottom. * **Trend & Flow Verification:** The visual flow is strictly left-to-right for the initial processing within each domain. The lower-level outputs converge rightward into DreamPRM. The upper-level shows a left-to-right flow with a prominent backward (right-to-left) feedback loop. The BLO creates a vertical, cyclical flow between Domain weights and the PRM. * **Component Isolation:** * **Header:** Contains the main title "Lower-level Optimization." * **Main Chart Area:** Contains the three domain rows, their internal MLLM/node chains, and the convergence to DreamPRM. * **Footer:** Contains the parameter legend. * **Text Transcription:** All text is in English. Key phrases include: "Lower-level Optimization," "Upper-level Optimization," "Domain 1," "Domain k," "Domain k+1," "Quality imbalance," "What is the area of yellow region?," "What is the largest pie area?," "What is the value of x?," "2x+6=13," "MLLM," "DreamPRM," "Domain weights," "PRM," "BLO," "Activated parameters," "Frozen parameters." ### Key Observations 1. **Quality Imbalance:** The diagram explicitly labels the challenge of "Quality imbalance" across different problem domains (e.g., visual geometry vs. textual algebra). 2. **Two-Tiered Training:** The framework separates training into domain-specific, lower-level optimization and a global, upper-level optimization that manages domain weights. 3. **Parameter Management:** The legend and PRM icons indicate a strategy where parameters are selectively activated (fine-tuned) or frozen during different stages of the bi-level process. 4. **Iterative Refinement:** The upper-level domain (k+1) shows an internal feedback loop, suggesting iterative self-improvement or reinforcement within a single domain type. ### Interpretation This diagram outlines a sophisticated machine learning training strategy designed to create a robust and generalizable "DreamPRM" model. The core problem it addresses is **domain imbalance**—where a model might perform well on some types of problems (e.g., visual puzzles) but poorly on others (e.g., symbolic math). The **Lower-level Optimization** appears to be responsible for training the model on individual, diverse task domains in parallel. The outputs from these specialized trainings are then used to update the core DreamPRM model. The **Upper-level Optimization**, governed by the BLO, acts as a meta-learner. It doesn't train on raw problems but instead optimizes the "Domain weights." This means it learns *how much importance* to assign to each domain's training signal when updating the final PRM. The feedback loop (dashed purple arrows) suggests it evaluates the PRM's performance and adjusts these weights to ensure balanced mastery across all domains, directly countering the "Quality imbalance." The use of **activated vs. frozen parameters** implies an efficient training methodology, possibly akin to parameter-efficient fine-tuning (PEFT), where only specific parts of the model are updated during certain phases to preserve knowledge and reduce computational cost. In essence, the framework proposes a **hierarchical learning system**: the lower level learns *what* to solve in each domain, while the upper level learns *how to balance* that learning to produce a single, well-rounded model (DreamPRM) that performs reliably across a wide spectrum of tasks. </details> Figure 3: The proposed bi-level optimization based domain-reweighting method. Lower-level optimization: In this stage, PRM’s parameters are updated on multiple datasets with domain weights, allowing the PRM to prioritize domains with better quality. Upper-level optimization: In this stage, the PRM is evaluated on a separate meta dataset to compute an aggregation function loss and optimize the domain weights. DreamPRM helps address dataset quality imbalance problems and leads to stronger and more generalizable reasoning performance. #### Overview. Training process reward models (PRMs) for MLLMs is challenging for two reasons: (1) dataset (domain) quality imbalance, and (2) discrepancy between training and inference procedures. To address these two challenges, we propose DreamPRM, which automatically searches for domain importance using a novel aggregation function loss that better simulates the inference process of PRM. Under a bi-level optimization framework, it optimizes PRM parameters with Monte Carlo signals at the lower level, and optimizes trainable domain importance weights with aggregation function loss at the upper level. An overview of DreamPRM method is shown in Fig. 3. #### Datasets. We begin with $K{+}1$ datasets, each from a distinct domain (e.g., science, geometry). The first $K$ datasets form the training pool $D_tr=\{D_1,\dots,D_K\}$ , while the remaining dataset, $D_meta=D_K+1$ , is a meta (validation) dataset with better quality. #### Lower-level optimization: domain-reweighted training of PRM. In lower-level optimization, we aim to update the weights $φ$ of PRM with domain-reweighted training. We first define the typical PRM training loss $L_tr$ on a single domain $D_k$ , given PRM parameters $φ$ , as follows: $$ \displaystyleL_tr(D_k,φ)=∑_(x,y)∈D_k∑_i=1^nL_MSE(V_φ(x,\hat{y}_i),p_i) \tag{5} $$ where $\hat{y}_i$ is the prefix of MLLM generated text $\hat{y}=M_θ(x)$ given input pair $x=(t,I)$ , and $p_i$ is the process supervision signal value obtained by Monte Carlo estimation given input pair $x$ , prefix $\hat{y}_i$ and ground truth label $y$ , as previously defined in Equation 1. The PRM is optimized by minimizing the mean squared error (MSE) between supervision signal and PRM predicted score $V_φ(x,\hat{y}_i)$ . With the PRM training loss on a single domain $D_k$ above, we next define the domain-reweighted training objective of PRM on multiple training domains $D=\{D_k\}_k=1^K$ . The overall objective is a weighted sum of the single-domain PRM training losses, allowing the contribution of each domain to be adjusted during the learning process: $$ \displaystyleL_tr(D_tr,φ,α)=∑_k=1^Kα_kL_tr(D_k,φ) \tag{6} $$ Here, $α=\{α_k\}_k=1^K$ represents the trainable domain weight parameters, indicating the importance of each domain. By optimizing this objective, we obtain the optimal value of PRM parameters $φ^*$ : $$ \displaystyleφ^*(α)= \displaystyle\underset{φ}{\arg\min}L_tr(D_tr,φ,α) \tag{7} $$ It is worth mentioning that only $φ$ is optimized at this level, while $α$ remains fixed. #### Upper-level optimization: learning domain reweighting parameters. In upper-level optimization, we optimize the domain reweighting parameter $α$ on meta dataset $D_meta$ given optimal PRM weights $φ^*(α)$ obtained from the lower level. To make the meta learning target more closely reflect the actual PRM-based inference process, we propose a novel meta loss function $L_meta$ , different from the training loss $L_tr$ . Specifically, we first obtain an aggregated score $A({p})$ for each generated solution $\hat{y}$ from the MLLM given input pair $x=(t,I)$ , following process in Section 3. We then create a ground truth signal $r(\hat{y},y)$ by assigning it a value of 1 if the generated $\hat{y}$ contains ground truth $y$ , and 0 otherwise. The meta loss is defined as the mean squared error between aggregated score and ground truth signal: $$ \displaystyleL_meta(D_meta,φ^*(α))=∑_(x,y)∈D_metaL_MSE(σ(A(V_φ^*(α)(x,\hat{y}))),r(\hat{y},y)) \tag{8} $$ where $A$ represents the aggregation function as previously defined in Equation 2, and $σ$ denotes the sigmoid function to map the aggregated score to a probability. Accordingly, the optimization problem at the upper level is formulated as follows: $$ \displaystyle\underset{α}{\min}L_meta(D_meta,φ^*(α)) \tag{9} $$ To solve this optimization problem, we propose an efficient gradient-based algorithm, which is detailed in Appendix A. ## 5 Experimental Results ### 5.1 Experimental settings #### Multistage reasoning. To elicit consistent steady reasoning responses from current MLLMs, we draw on the Llava-CoT approach [75], which fosters structured thinking prior to answer generation. Specifically, we prompt MLLMs to follow five reasoning steps: (1) Restate the question. (2) Gather evidence from the image. (3) Identify any background knowledge needed. (4) Reason with the current evidence. (5) Summarize and conclude with all the information. We also explore zero-shot prompting settings in conjunction with structural reasoning, which can be found in Appendix C. We use 8 different chain-of-thought reasoning trajectories for all test-time scaling methods, unless otherwise stated. Table 1: Comparative evaluation of DreamPRM and baselines on multimodal reasoning benchmarks. Bold numbers indicate the best performance, while underlined numbers indicate the second best. The table reports accuracy (%) on five datasets: WeMath, MathVista, MathVision, MMVet, and MMStar. | | Math Reasoning WeMath (loose) | General Reasoning MathVista (testmini) | MathVision (test) | MMVet (v1) | MMStar (test) | | --- | --- | --- | --- | --- | --- | | Zero-shot Methods | | | | | | | Gemini-1.5-Pro [50] | 46.0 | 63.9 | 19.2 | 64.0 | 59.1 | | GPT-4v [46] | 51.4 | 49.9 | 21.7 | 67.7 | 62.0 | | LLaVA-OneVision-7B [26] | 44.8 | 63.2 | 18.4 | 57.5 | 61.7 | | Qwen2-VL-7B [66] | 42.9 | 58.2 | 16.3 | 62.0 | 60.7 | | InternVL-2.5-8B-MPO [67] | 51.7 | 65.4 | 20.4 | 55.9 | 58.9 | | Test-time Scaling Methods (InternVL-2.5-8B-MPO based) | | | | | | | Self-consistency [68] | 56.4 | 67.1 | 20.7 | 57.4 | 59.6 | | Self-correction [17] | 54.0 | 63.8 | 21.6 | 54.9 | 59.7 | | ORM [52] | 56.9 | 65.3 | 20.5 | 55.9 | 60.1 | | Vanilla PRM [29] | 54.2 | 67.2 | 20.6 | 58.9 | 60.8 | | CaR-PRM [16] | 54.7 | 67.5 | 21.0 | 60.6 | 61.1 | | s1-PRM [44] | 57.1 | 65.8 | 20.2 | 60.1 | 60.4 | | DreamPRM (ours) | 57.4 | 68.9 | 22.1 | 61.4 | 62.3 | #### Base models. For inference, we use InternVL-2.5-8B-MPO [67] as the base MLLM, which has undergone post-training to enhance its reasoning abilities and is well-suited for our experiment. For fine-tuning PRM, we adopt Qwen2-VL-2B-Instruct [66]. Qwen2-VL is a state-of-the-art multimodal model pretrained for general vision-language understanding tasks. This pretrained model serves as the initialization for our fine-tuning process. #### Training hyperparameters. In the lower-level optimization, we perform 5 inner gradient steps per outer update (unroll steps = 5) using the AdamW [32] optimizer with learning rate set to $5× 10^-7$ . In the upper-level optimization, we use the AdamW optimizer ( $lr=0.01$ , weight decay $=10^-3$ ) and a StepLR scheduler (step size = 5000, $γ=0.5$ ). In total, DreamPRM is fine-tuned for 10000 iterations. Our method is implemented with Betty [7], and the fine-tuning process takes approximately 10 hours on one NVIDIA A100 GPUs. #### Baselines. We use three major categories of baselines: (1) State-of-the-art models on public leaderboards, including Gemini-1.5-Pro [50], GPT-4V [46], LLaVA-OneVision-7B [26], Qwen2-VL-7B [66]. We also carefully reproduce the results of InternVL-2.5-8B-MPO with structural thinking. (2) Test-time scaling methods (excluding PRM) based on the InternVL-2.5-8B-MPO model, including: (i) Self-consistency [68], which selects the most consistent reasoning chain via majority voting over multiple responses; (ii) Self-correction [17], which prompts the model to critically reflect on and revise its initial answers; and (iii) Outcome Reward Model (ORM) [52], which evaluates and scores the final response to select the most promising one. (3) PRM-based methods, including: (i) Vanilla PRM trained without any data selection, as commonly used in LLM settings [29]; (ii) s1-PRM, which selects high-quality reasoning responses based on three criteria - difficulty, quality, and diversity - following the s1 strategy [44]; and (iii) CaR-PRM, which filters high-quality visual questions using clustering and ranking techniques, as proposed in CaR [16]. #### Datasets and benchmarks. We use 15 multimodal datasets for lower-level optimization ( $D_tr$ ), covering four domains: science, chart, geometry, and commonsense, as listed in Appendix Table 2. For upper-level optimization ( $D_meta$ ), we adopt the MMMU [79] dataset. Evaluation is conducted on five multimodal reasoning benchmarks: WeMath [48], MathVista [33], MathVision [63], MMVet [78], and MMStar [5]. Details are provided in Appendix B. ### 5.2 Benchmark evaluation of DreamPRM Tab. 1 presents the primary experimental results. We observe that: (1) DreamPRM outperforms other PRM-based methods, highlighting the effectiveness of our domain reweighting strategy. Compared to the vanilla PRM trained without any data selection, DreamPRM achieves a consistent performance gain of 2%-3% across all five datasets, suggesting that effective data selection is crucial for training high-quality multimodal PRMs. Moreover, DreamPRM also outperforms s1-PRM and CaR-PRM, which rely on manually designed heuristic rules for data selection. These results indicate that selecting suitable reasoning datasets for PRM training is a complex task, and handcrafted rules are often suboptimal. In contrast, our automatic domain-reweighting approach enables the model to adaptively optimize its learning process, illustrating how data-driven optimization offers a scalable solution to dataset selection challenges. (2) DreamPRM outperforms SOTA MLLMs with much fewer parameters, highlighting the effectiveness of DreamPRM. For example, DreamPRM significantly surpasses two trillion-scale closed-source LLMs (GPT-4v and Gemini-1.5-Pro) on 4 out of 5 datasets. In addition, it consistently improves the performance of the base model, InternVL-2.5-8B-MPO, achieving an average gain of 4% on the five datasets. These results confirm that DreamPRM effectively yields a high-quality PRM, which is capable of enhancing multimodal reasoning across a wide range of benchmarks. (3) DreamPRM outperforms other test-time scaling methods, primarily because it enables the training of a high-quality PRM that conducts fine-grained, step-level evaluation. While most test-time scaling methods yield moderate improvements, DreamPRM leads to the most substantial gains, suggesting that the quality of the reward model is critical for effective test-time scaling. We further provide case studies in Appendix D, which intuitively illustrate how DreamPRM assigns higher scores to coherent and high-quality reasoning trajectories. <details> <summary>x4.png Details</summary> ![b4febd5b](/v1/image/b4febd5b64477a87675c88c7767ae52a5d17adad4fa3837bd7a46cddb10a9d70) ### Visual Description ## Bar Chart: Leaderboard on MathVista ### Overview The image is a horizontal bar chart displaying the performance scores of various AI models on the "MathVista" benchmark. The chart ranks models from highest to lowest score, with each model represented by a distinct colored bar. The title "Leaderboard on MathVista" is centered at the top. ### Components/Axes * **Chart Title:** "Leaderboard on MathVista" (centered, top). * **Y-Axis (Vertical):** Represents the performance score as a percentage. The axis is labeled with markers at 0%, 20%, 40%, 60%, 80%, and 100%. * **X-Axis (Horizontal):** Lists the names of the AI models being compared. The labels are positioned below each corresponding bar. * **Data Labels:** Each bar has its exact percentage score displayed directly above it. * **Legend/Color Mapping:** Each model is assigned a unique color for its bar. The mapping is as follows (from left to right): * Blue: `o4-mini + DreamPRM` * Orange: `VL-Rethinker` * Green: `Step R1-V-Mini-preview-20250308` * Red: `Kimi-k1.6-preview-20250308` * Purple: `Doubao-pro-1.5` * Brown: `Ovis2_34B` * Pink: `Kimi-k1.5` * Grey: `OpenAI o1` * Yellow-Green: `Llama 4 Maverick` * Cyan: `Vision-R1-7B` ### Detailed Analysis The chart presents a ranked list of 10 AI models based on their MathVista benchmark scores. The data is sorted in descending order of performance. 1. **o4-mini + DreamPRM** (Blue bar, far left): **85.2%**. This is the highest-performing model on the chart. 2. **VL-Rethinker** (Orange bar): **80.3%**. 3. **Step R1-V-Mini-preview-20250308** (Green bar): **80.1%**. 4. **Kimi-k1.6-preview-20250308** (Red bar): **80.0%**. 5. **Doubao-pro-1.5** (Purple bar): **79.5%**. 6. **Ovis2_34B** (Brown bar): **77.1%**. 7. **Kimi-k1.5** (Pink bar): **74.9%**. 8. **OpenAI o1** (Grey bar): **73.9%**. 9. **Llama 4 Maverick** (Yellow-Green bar): **73.7%**. 10. **Vision-R1-7B** (Cyan bar, far right): **73.2%**. This is the lowest-performing model shown. **Trend Verification:** The visual trend is a clear, steady decline in bar height from left to right, corresponding to the descending order of the numerical scores. There are no sudden jumps or outliers that break this descending pattern. ### Key Observations * **Performance Cluster:** The top four models (`o4-mini + DreamPRM`, `VL-Rethinker`, `Step R1-V-Mini`, `Kimi-k1.6`) form a leading cluster, all scoring at or above 80.0%. The gap between the 1st and 4th place is only 5.2 percentage points. * **Significant Drop:** There is a noticeable performance drop of 2.4 percentage points between the 5th place model (`Doubao-pro-1.5` at 79.5%) and the 6th place model (`Ovis2_34B` at 77.1%). * **Tight Grouping at the Lower End:** The bottom three models (`OpenAI o1`, `Llama 4 Maverick`, `Vision-R1-7B`) are very closely grouped, with only a 0.7 percentage point spread between them (73.9% to 73.2%). * **Model Naming Conventions:** Several model names include version numbers or date stamps (e.g., `-preview-20250308`, `-1.5`, `_34B`), indicating they are likely specific releases or configurations. ### Interpretation This leaderboard provides a snapshot of the competitive landscape for AI models on the MathVista benchmark, which evaluates mathematical and visual reasoning capabilities. * **State of the Art:** The `o4-mini + DreamPRM` combination demonstrates a clear lead, suggesting that its specific architecture or training methodology (potentially involving a "DreamPRM" component) is currently highly effective for this type of task. * **Competitive Middle Tier:** The tight clustering of models between 73% and 80% indicates a highly competitive field where incremental improvements can significantly change ranking. The presence of multiple models from similar families (e.g., two "Kimi" variants) shows iterative development within organizations. * **Benchmark Context:** The scores, ranging from 73.2% to 85.2%, suggest that MathVista is a challenging benchmark where even top models do not achieve near-perfect scores. This implies the tasks involve complex reasoning that remains difficult for current AI systems. * **Actionable Insight:** For researchers or users, this chart highlights which models are currently top performers for mathematical visual reasoning. The close scores among many models suggest that factors beyond raw accuracy—such as computational efficiency, speed, or specific sub-task performance—may be important for practical selection. The date stamps in some names also emphasize the rapid pace of development in this field. </details> Figure 4: Leaderboard on MathVista (as of October 15, 2025). The first column (“o4-mini + DreamPRM”) reports our own evaluation, while the remaining results are taken from the official MathVista leaderboard. The compared models include VL-Rethinker [62], Step R1-V-Mini [58], Kimi-k1.6-preview [43], Kimi-k1.5 [24], Doubao-pro-1.5 [60], Ovis2-34B [1], OpenAI o1 [45], Llama 4 Maverick [41, 42], and Vision-R1-7B [18]. ### 5.3 Leaderboard performance of DreamPRM As shown in Fig. 4, DreamPRM achieves the top-1 accuracy of 85.2% on the MathVista leaderboard (as of October 15, 2025). The result (o4-mini + DreamPRM) has been officially verified through the MathVista evaluation. Compared with a series of strong multimodal reasoning baselines, including VL-Rethinker [62], Step R1-V-Mini [58], Kimi-k1.6-preview [43], Doubao-pro-1.5 [60], Ovis2-34B [1], OpenAI o1 [45], Llama 4 Maverick [41, 42], and Vision-R1-7B [18], DreamPRM demonstrates clearly superior multimodal reasoning capability. Table 5 in Appendix provides a detailed comparison among various Process Reward Model (PRM) variants built on the same o4-mini backbone. DreamPRM surpasses all counterparts, improving the base o4-mini model from 80.6% (pass@1) and 81.7% (self-consistency@8) to 85.2%. This consistent gain verifies the effectiveness of DreamPRM in enhancing reasoning accuracy through process-level supervision and reliable consensus across multiple chains of thought. <details> <summary>x5.png Details</summary> ![9e14397e](/v1/image/9e14397e1accf0f7137aea3d7e20164abd4617036fc9f395b51189a64cab4c58) ### Visual Description ## Radar Charts: Multi-Benchmark Performance Comparison ### Overview The image displays three radar charts (also known as spider charts) arranged horizontally. Each chart compares the performance of different methods or model variants across five common benchmarks: MathVista, WeMath, MMStar, MMVet, and MathVision. The charts are titled "Data selection comparison," "Test-time scaling comparison," and "Ablation study," respectively. A consistent method labeled "DreamPRM" (cyan line) appears in all three charts, serving as a baseline for comparison. ### Components/Axes - **Chart Type:** Radar Charts (Spider Plots) - **Common Axes (Benchmarks):** Five axes radiate from the center, each representing a benchmark: 1. MathVista (Top) 2. WeMath (Top-Right) 3. MMStar (Bottom-Right) 4. MMVet (Bottom-Left) 5. MathVision (Top-Left) - **Scale:** The concentric circles represent performance scores, increasing from the center (0) outward. The outermost ring appears to represent a score of approximately 70. - **Legends:** Each chart has a legend positioned directly below it, mapping line colors to method names. ### Detailed Analysis #### Chart 1: Data selection comparison - **Legend (Bottom-Left):** - Orange: No selection - Purple: CaR selection - Red: s1 selection - Cyan: DreamPRM - **Data Series & Approximate Values (Score on each benchmark):** - **DreamPRM (Cyan):** Forms the outermost polygon. Values: MathVista ~68.9, WeMath ~57.1, MMStar ~61.1, MMVet ~60.1, MathVision ~65.0. - **s1 selection (Red):** Forms an inner polygon. Values: MathVista ~65.8, WeMath ~52.7, MMStar ~50.1, MMVet ~50.1, MathVision ~60.0. - **CaR selection (Purple):** Forms an inner polygon, generally inside the red line. Values: MathVista ~65.3, WeMath ~52.7, MMStar ~49.1, MMVet ~49.1, MathVision ~59.0. - **No selection (Orange):** Forms the innermost polygon. Values: MathVista ~61.5, WeMath ~47.7, MMStar ~47.1, MMVet ~47.1, MathVision ~56.0. - **Trend Verification:** The cyan line (DreamPRM) is consistently the outermost, indicating the highest performance across all five benchmarks. The red line (s1 selection) is generally next, followed by purple (CaR selection), with orange (No selection) being the innermost. #### Chart 2: Test-time scaling comparison - **Legend (Bottom-Center):** - Orange: Self-consistency - Purple: ORM - Red: Self-correction - Cyan: DreamPRM - **Data Series & Approximate Values:** - **DreamPRM (Cyan):** Outermost polygon. Values: MathVista ~68.9, WeMath ~60.1, MMStar ~62.3, MMVet ~61.3, MathVision ~65.0. - **Self-correction (Red):** Inner polygon. Values: MathVista ~63.8, WeMath ~54.9, MMStar ~50.1, MMVet ~57.4, MathVision ~59.0. - **ORM (Purple):** Inner polygon. Values: MathVista ~65.3, WeMath ~54.9, MMStar ~50.1, MMVet ~55.9, MathVision ~59.0. - **Self-consistency (Orange):** Innermost polygon. Values: MathVista ~67.1, WeMath ~54.9, MMStar ~50.1, MMVet ~57.4, MathVision ~59.0. - **Trend Verification:** DreamPRM (cyan) again forms the outermost shape. The other three methods (Self-consistency, ORM, Self-correction) are clustered more closely together in the middle range, with Self-consistency (orange) showing a notably higher score on MathVista compared to its performance on other axes. #### Chart 3: Ablation study - **Legend (Bottom-Right):** - Orange: w/o AFL - Purple: w/o ST - Red: w/o BLO - Cyan: DreamPRM - **Data Series & Approximate Values:** - **DreamPRM (Cyan):** Outermost polygon. Values: MathVista ~68.9, WeMath ~55.3, MMStar ~61.3, MMVet ~61.2, MathVision ~65.0. - **w/o BLO (Red):** Inner polygon. Values: MathVista ~66.1, WeMath ~55.0, MMStar ~59.6, MMVet ~59.6, MathVision ~60.4. - **w/o ST (Purple):** Inner polygon. Values: MathVista ~66.4, WeMath ~55.0, MMStar ~59.6, MMVet ~59.6, MathVision ~60.4. - **w/o AFL (Orange):** Innermost polygon. Values: MathVista ~66.1, WeMath ~55.0, MMStar ~59.6, MMVet ~59.6, MathVision ~60.4. - **Trend Verification:** DreamPRM (cyan) is the outermost. The three ablated versions (w/o AFL, w/o ST, w/o BLO) form nearly identical, overlapping polygons, suggesting that removing any one of these components (AFL, ST, BLO) has a similar, detrimental effect on performance across all benchmarks. ### Key Observations 1. **Consistent Superiority:** The "DreamPRM" method (cyan line) achieves the highest score on every single benchmark across all three comparison charts. 2. **Performance Hierarchy:** In the "Data selection comparison," a clear performance hierarchy is visible: DreamPRM > s1 selection > CaR selection > No selection. 3. **Clustering of Alternatives:** In the "Test-time scaling comparison," the alternative methods (Self-consistency, ORM, Self-correction) cluster together, performing significantly below DreamPRM but above the "No selection" baseline from the first chart. 4. **Impact of Ablation:** The "Ablation study" shows that removing any of the three components (AFL, ST, BLO) from the DreamPRM framework results in a similar and substantial drop in performance, indicating all are critical to its effectiveness. 5. **Benchmark Difficulty:** The relative ordering of benchmarks by score is not perfectly consistent across methods, but MathVista generally yields the highest scores, while MMStar and MMVet often yield the lowest for the non-DreamPRM methods. ### Interpretation This set of charts presents a compelling technical narrative for the effectiveness of the "DreamPRM" method. - **What the data suggests:** The data strongly suggests that DreamPRM is a superior approach for the task(s) measured by these five mathematical reasoning benchmarks (MathVista, WeMath, etc.). Its advantage is not marginal but substantial and consistent. - **How elements relate:** The three charts build a logical argument: 1. **Chart 1** establishes that intelligent data selection (s1, CaR) helps, but DreamPRM's selection strategy is better. 2. **Chart 2** shows that even advanced test-time techniques (Self-consistency, ORM) are outperformed by DreamPRM's approach. 3. **Chart 3** deconstructs DreamPRM, revealing that its core components (AFL, ST, BLO) are all essential; removing any one degrades performance to a similar, lower level. - **Notable Anomalies/Patterns:** The near-identical performance of the three ablated models in Chart 3 is striking. It suggests these components may be interdependent or contribute equally vital, non-redundant functionality. The high score of "Self-consistency" on MathVista in Chart 2, relative to its other scores, might indicate that this particular benchmark benefits more from simple ensemble methods than others do. **In summary, the visual evidence positions DreamPRM as a state-of-the-art method whose performance gain stems from a synergistic combination of its core components, outperforming both simpler selection strategies and other sophisticated test-time scaling techniques.** </details> Figure 5: Comparative evaluation of DreamPRM on multimodal reasoning benchmarks. Radar charts report accuracy (%) on five datasets (WeMath, MathVista, MathVision, MMVet, and MMStar). (a) Impact of different data selection strategies. (b) Comparison with existing test-time scaling methods. (c) Ablation study of three key components, i.e. w/o aggregation function loss (AFL), w/o bi-level optimization (BLO), and w/o structural thinking (ST). <details> <summary>x6.png Details</summary> ![018387e0](/v1/image/018387e0d1e4043825b6f65cab3d13ce94014dfebd05e717d26a42ce3e7ed807) ### Visual Description ## Radar Chart: Scaling Ability Across Five Benchmarks ### Overview This image is a radar chart (also known as a spider chart) titled "Scaling ability". It visually compares the performance of four different methods or models across five distinct benchmarks. The chart uses concentric polygons to represent the score of each method on each benchmark, allowing for a quick comparison of their overall capabilities and relative strengths. ### Components/Axes * **Chart Title:** "Scaling ability" (centered at the top). * **Axes (Benchmarks):** The chart has five axes radiating from the center, each representing a benchmark. The labels are placed at the outer end of each axis: * **Top (12 o'clock):** MathVista * **Top-Right (2-3 o'clock):** WeMath * **Bottom-Right (4-5 o'clock):** MMStar * **Bottom-Left (7-8 o'clock):** MMVet * **Left (9 o'clock):** MathVision * **Legend:** Positioned at the bottom of the chart, centered horizontally. It contains four entries, each with a colored line segment and a label: * **Orange Line:** Zero-shot * **Red Line:** DreamPRM@2 * **Pink Line:** DreamPRM@4 * **Cyan (Light Blue) Line:** DreamPRM@8 * **Scale:** The chart uses concentric polygons (approximating circles) to indicate score levels. The outermost ring represents the highest possible score (implied to be 100 or a normalized maximum), with scores decreasing toward the center. Specific numerical score labels are placed directly on the chart near the vertices of each polygon. ### Detailed Analysis The chart plots the score of each method on each benchmark. The visual trend for each series is as follows: * **DreamPRM@8 (Cyan):** Forms the outermost polygon, indicating the highest overall performance. Its shape is relatively balanced but peaks at MathVista. * **DreamPRM@4 (Pink):** Forms a polygon inside the cyan one, showing lower performance than @8 but generally higher than the others. * **DreamPRM@2 (Red):** Forms a polygon inside the pink one, with a notably sharp, narrow shape, indicating more variable performance across benchmarks. * **Zero-shot (Orange):** Forms the innermost polygon, representing the baseline performance. **Data Points by Benchmark (Approximate Values):** 1. **MathVista (Top Axis):** * DreamPRM@8 (Cyan): 68.9 * DreamPRM@4 (Pink): 66.5 * DreamPRM@2 (Red): 65.3 * Zero-shot (Orange): 65.3 (Note: The red and orange labels overlap here, suggesting very close or identical scores.) 2. **WeMath (Top-Right Axis):** * DreamPRM@8 (Cyan): 57.4 * DreamPRM@4 (Pink): 54.5 * DreamPRM@2 (Red): 53.6 * Zero-shot (Orange): 51.7 3. **MMStar (Bottom-Right Axis):** * DreamPRM@8 (Cyan): 62.3 * DreamPRM@4 (Pink): 60.0 * DreamPRM@2 (Red): 59.3 * Zero-shot (Orange): 58.9 4. **MMVet (Bottom-Left Axis):** * DreamPRM@8 (Cyan): 61.4 * DreamPRM@4 (Pink): 60.3 * DreamPRM@2 (Red): 60.3 (The red and pink labels are very close, indicating near-identical scores.) * Zero-shot (Orange): 55.9 5. **MathVision (Left Axis):** * DreamPRM@8 (Cyan): 60.0 (The label is placed near the axis but slightly offset; the vertex is on the axis line.) * DreamPRM@4 (Pink): 20.0 (This appears to be a significant outlier, with the pink vertex pulled sharply inward.) * DreamPRM@2 (Red): 20.0 (Similar outlier, with the red vertex also pulled inward to the same point.) * Zero-shot (Orange): 20.0 (All three inner series converge at this low point on the MathVision axis.) ### Key Observations 1. **Clear Performance Hierarchy:** There is a consistent ranking across all benchmarks: DreamPRM@8 > DreamPRM@4 > DreamPRM@2 ≈ Zero-shot. The cyan polygon fully encloses the pink, which encloses the red/orange cluster. 2. **Scaling Effect:** The "@8", "@4", and "@2" notation likely refers to a scaling parameter (e.g., number of reasoning steps, ensemble size). Performance improves monotonically with this parameter, as seen by the expanding polygons from Zero-shot to DreamPRM@8. 3. **MathVision Anomaly:** All methods except DreamPRM@8 show a dramatic drop in performance on the MathVision benchmark, scoring only ~20. This creates a sharp inward "pinch" in the red, pink, and orange polygons on the left side of the chart. DreamPRM@8 maintains a high score (~60) here, suggesting it uniquely overcomes a challenge present in this benchmark for the other methods. 4. **Converging Scores:** On the MMVet axis, the scores for DreamPRM@4 and DreamPRM@2 are nearly identical (60.3 vs. 60.3). On the MathVision axis, the scores for Zero-shot, DreamPRM@2, and DreamPRM@4 are identical (20.0). 5. **Most Improved Benchmark:** The largest absolute gain from the baseline (Zero-shot) to the top method (DreamPRM@8) appears on the MathVision benchmark (+40 points), followed by MathVista (+3.6 points) and WeMath (+5.7 points). ### Interpretation This radar chart effectively demonstrates the "scaling ability" of the DreamPRM method. The core finding is that increasing the scaling parameter (from @2 to @4 to @8) leads to consistent and significant performance improvements across a diverse set of visual-mathematical reasoning benchmarks. The data suggests that the DreamPRM@8 model is not only the strongest overall but also the most robust. Its ability to maintain high performance on MathVision, where all other methods fail catastrophically, indicates it has developed a more generalizable or powerful reasoning capability that the others lack. The near-identical, low scores of the other methods on MathVision point to a specific, shared limitation in their approach that scaling to @8 overcomes. The chart tells a story of progressive capability enhancement. The Zero-shot and DreamPRM@2 models are tightly clustered, showing limited benefit from minimal scaling. DreamPRM@4 shows a clear step up, and DreamPRM@8 represents a substantial leap, particularly in conquering the most difficult benchmark (MathVision). This visual evidence strongly supports the conclusion that the scaling mechanism employed in DreamPRM is highly effective for improving multimodal mathematical reasoning. </details> <details> <summary>x7.png Details</summary> ![76e77e57](/v1/image/76e77e57c1c72d0703e3fbcd1989684605458b5c07fcf2413bb75b1929039bbf) ### Visual Description ## Line Chart: Best-of-N Accuracy with Different Models ### Overview The image is a line chart titled "Best-of-N accuracy with different models." It compares the performance of three distinct AI models as a function of the number of selected Chain-of-Thought (CoT) reasoning paths, denoted as 'k'. The chart demonstrates how accuracy improves for each model as more CoT paths are considered. ### Components/Axes * **Title:** "Best-of-N accuracy with different models" (centered at the top). * **Y-Axis:** Labeled "Accuracy (%)". The scale runs from 65.0 to 85.0, with major tick marks every 2.5 units (65.0, 67.5, 70.0, 72.5, 75.0, 77.5, 80.0, 82.5, 85.0). * **X-Axis:** Labeled "Number of selected CoTs (k)". The scale shows discrete values: 2, 4, 6, and 8. * **Legend:** Positioned in the top-left quadrant of the plot area. It contains three entries: 1. **InternVL-2.5-8B-MPO:** Represented by a blue line with circular markers. 2. **GPT-4.1-mini (4-14-25):** Represented by a red line with square markers. 3. **o4-mini (4-16-25):** Represented by a green line with 'x' (cross) markers. * **Baseline Indicators:** Each model has a corresponding horizontal dashed line in its color, indicating a baseline accuracy level (likely the model's performance without CoT selection or with k=1). ### Detailed Analysis The chart plots three data series, each showing a positive, roughly linear trend where accuracy increases with the number of selected CoTs (k). **1. InternVL-2.5-8B-MPO (Blue Line, Circle Markers)** * **Trend:** Slopes gently upward from left to right. * **Data Points (Approximate):** * k=2: ~65.2% * k=4: ~66.5% * k=6: ~67.8% * k=8: ~68.9% * **Baseline (Blue Dashed Line):** Positioned at approximately 65.4%. **2. GPT-4.1-mini (4-14-25) (Red Line, Square Markers)** * **Trend:** Slopes upward, with a slightly steeper incline than the blue line. * **Data Points (Approximate):** * k=2: ~71.8% * k=4: ~72.5% * k=6: ~73.2% * k=8: ~74.4% * **Baseline (Red Dashed Line):** Positioned at approximately 71.5%. **3. o4-mini (4-16-25) (Green Line, Cross Markers)** * **Trend:** Slopes upward with the steepest incline of the three models. * **Data Points (Approximate):** * k=2: ~81.5% * k=4: ~82.5% * k=6: ~84.2% * k=8: ~85.2% * **Baseline (Green Dashed Line):** Positioned at approximately 80.5%. ### Key Observations 1. **Consistent Hierarchy:** The o4-mini model consistently achieves the highest accuracy across all values of k, followed by GPT-4.1-mini, and then InternVL-2.5-8B-MPO. The performance gaps between models are significant and remain relatively stable. 2. **Positive Scaling:** All three models show a clear benefit from increasing the number of selected CoTs (k). The accuracy gain from k=2 to k=8 is approximately 3.7% for InternVL, 2.6% for GPT-4.1-mini, and 3.7% for o4-mini. 3. **Baseline Comparison:** For each model, the plotted accuracy at k=2 is already above its respective dashed baseline, indicating that even selecting from just two CoT paths provides a measurable improvement over the baseline. 4. **Marginal Diminishing Returns:** While the trend is positive, the rate of improvement appears to slow slightly for each model as k increases (the lines are slightly concave down), suggesting diminishing marginal returns from adding more CoT paths. ### Interpretation This chart provides empirical evidence for the "Best-of-N" sampling strategy in AI reasoning tasks. The data suggests that: * **CoT Selection is Effective:** Generating multiple reasoning paths (CoTs) and selecting among them (likely based on a confidence metric or verifier) reliably improves final answer accuracy compared to a single-path baseline for all tested models. * **Model Capability is Paramount:** While the strategy improves all models, the underlying capability of the base model (o4-mini > GPT-4.1-mini > InternVL) is the primary determinant of absolute performance. The best strategy cannot close the fundamental gap between model generations or architectures. * **Practical Trade-off:** The positive but sub-linear scaling indicates a trade-off between computational cost (generating and evaluating k paths) and accuracy gain. The optimal 'k' in practice would balance this trade-off, as moving from k=6 to k=8 yields a smaller gain than moving from k=2 to k=4. * **Consistency of Improvement:** The fact that all models follow the same trend reinforces the generalizability of the Best-of-N technique across different model families and sizes. The specific dates in the model names (4-14-25, 4-16-25) may indicate versioning or release dates, suggesting this is a comparison of contemporaneous models. </details> Figure 6: Scaling ability and cross-model generalization. (a) Radar chart of five multimodal reasoning benchmarks shows that DreamPRM delivers monotonic accuracy gains as the number of selected chains-of-thought increases (@2, @4, @8) over the pass@1 baseline. (b) Best-of- N accuracy curves for InternVL-2.5-8B-MPO (blue), GPT-4.1-mini (red) and o4-mini (green) on MathVista confirm that the same DreamPRM-ranked CoTs generalize across models, consistently outperforming pass@1 performance (dashed lines) as $k$ grows. ### 5.4 Scaling and generalization analysis of DreamPRM DreamPRM scales reliably with more CoT candidates. As shown in the left panel of Fig. 6, the accuracy of DreamPRM consistently improves on all five benchmarks as the number of CoTs increases from $k{=}2$ to $k{=}8$ , expanding the radar plot outward. Intuitively, a larger set of candidates increases the likelihood of including high-quality reasoning trajectories, but it also makes identifying the best ones more challenging. The consistent performance gains indicate that DreamPRM effectively verifies and ranks CoTs, demonstrating its robustness in selecting high-quality reasoning trajectories under more complex candidate pools. DreamPRM transfers seamlessly to stronger base MLLMs. The right panel of Fig. 6 shows the MathVista accuracy when applying DreamPRM to recent MLLMs, GPT-4.1-mini (2025-04-14) [46] and o4-mini (2025-04-16) [45]. For o4-mini model, the pass@1 score of 80.6% steadily increases to 85.2% at $k{=}8$ , surpassing the previous state-of-the-art performance. This best-of- $N$ trend, previously observed with InternVL, also holds for GPT-4.1-mini and o4-mini, demonstrating the generalization ability of DreamPRM. Full results of these experiments are provided in Tab. 3. ### 5.5 Ablation study In this section, we investigate the importance of three components in DreamPRM: (1) bi-level optimization, (2) aggregation function loss in upper-level, and (3) structural thinking prompt (detailed in Section 5.1). As shown in the rightmost panel of Fig. 5, the complete DreamPRM achieves the best results compared to three ablation baselines across all five benchmarks. Eliminating bi-level optimization causes large performance drop (e.g., -3.5% on MathVista and -3.4% on MMStar). Removing aggregation function loss leads to a consistent 1%-2% decline (e.g., 57.4% $→$ 56.3% on WeMath). Excluding structural thinking also degrades performance (e.g., -1.8% on MathVision). These results indicate that all three components are critical for DreamPRM to achieve the best performance. More detailed results are shown in Appendix Tab. 4. ### 5.6 Analysis of learned domain weights <details> <summary>x8.png Details</summary> ![2e11b7a9](/v1/image/2e11b7a9ca9a892df294c4d9110c3b0c4338dd019b1280a64ca961b50c474e26) ### Visual Description ## Horizontal Bar Chart: Domain Weights ### Overview The image displays a horizontal bar chart titled "Domain Weights." It visually compares numerical weight values assigned to 15 distinct domains, likely representing datasets or task categories in a machine learning or evaluation context. The bars are arranged in descending order of weight from top to bottom. ### Components/Axes * **Chart Title:** "Domain Weights" (centered at the top). * **Y-Axis (Vertical):** Lists the names of 15 domains. The axis itself is not labeled with a title, but the domain names serve as categorical labels. * **X-Axis (Horizontal):** Represents the numerical weight value. It is labeled with major tick marks at intervals of 0.2, ranging from 0.0 to 1.4. The axis line is visible at the bottom. * **Data Representation:** Each domain is represented by a horizontal bar. The length of the bar corresponds to its weight value, which is also printed numerically at the end of each bar. * **Color Coding:** Each bar is filled with a distinct, solid color. There is no separate legend; the domain names are directly adjacent to their respective bars. ### Detailed Analysis The domains and their corresponding weights, listed from highest to lowest, are as follows: 1. **m3cot** (Orange bar): **1.49** 2. **figureqa** (Light brown/tan bar): **1.47** 3. **unifiedqa** (Purple bar): **1.16** 4. **infographicqa** (Light purple/lavender bar): **1.16** 5. **chartqa** (Pink bar): **1.10** 6. **geo170k** (Light red/salmon bar): **1.06** 7. **scienceqa** (Green bar): **1.05** 8. **tatqa** (Light green bar): **1.01** 9. **geomverse** (Grey bar): **0.98** 10. **mapqa** (Light grey bar): **0.97** 11. **clever** (Dark purple/maroon bar): **0.95** 12. **geosqa** (Brown bar): **0.84** 13. **dvqa** (Teal bar): **0.79** 14. **iconqa** (Light yellow/beige bar): **0.75** 15. **ai2d** (Blue bar): **0.55** **Trend:** The data series exhibits a clear descending trend. The weights start at a high of 1.49 and decrease steadily, with a notable drop between the top two domains (m3cot, figureqa) and the next cluster (unifiedqa, infographicqa). The decline continues to the lowest value of 0.55 for ai2d. ### Key Observations * **Top Performers:** `m3cot` and `figureqa` are the clear outliers with weights significantly above 1.4, separating them from the rest of the group. * **Clustering:** There is a tight cluster of domains with weights between approximately 0.95 and 1.16 (from `unifiedqa` down to `clever`). * **Color Distinction:** The use of a wide palette of distinct colors (orange, brown, purples, pinks, reds, greens, greys, teal, yellow, blue) makes each bar easily distinguishable, aiding in visual comparison. * **Precision:** Each weight is reported to two decimal places, suggesting a calculated or normalized metric. ### Interpretation This chart provides a comparative ranking of domains based on an assigned "weight." In a technical context, such weights often signify importance, difficulty, prevalence, or a performance score within a model's training or evaluation framework. * **What the data suggests:** The domain `m3cot` is considered the most significant or highest-performing according to this metric, followed closely by `figureqa`. The domain `ai2d` is weighted the lowest, indicating it may be considered less critical, easier, or have a lower performance baseline in this specific context. * **How elements relate:** The direct labeling of values on the bars eliminates ambiguity and allows for precise reading. The descending order immediately communicates the ranking. The color coding, while not explained in a legend, serves to visually separate the many categories. * **Notable patterns:** The significant gap between the top two domains and the rest suggests a potential tier system. The cluster in the middle indicates a group of domains with relatively similar weightings. The chart effectively communicates a hierarchy of importance or outcome across these 15 evaluated items. </details> Figure 7: Learned domain weights after the convergence of the DreamPRM training process. The final domain weights (Fig. 7) range from 0.55 to 1.49: M3CoT [6] and FigureQA [21] receive the highest weights (approximately 1.5), while AI2D [23] and IconQA [36] are assigned lower weights (less than 0.8). This learned weighting pattern contributes to improved PRM performance, indicating that the quality imbalance problem across reasoning datasets is real and consequential. Additionally, as shown in Fig. 9 in Appendix, all domain weights are initialized to 1.0 and eventually converge during the training process of DreamPRM. ## 6 Conclusions We propose DreamPRM, the first domain-reweighted PRM framework for multimodal reasoning. By automatically searching for domain weights using a bi-level optimization framework, DreamPRM effectively mitigates issues caused by dataset quality imbalance and significantly enhances the generalizability of multimodal PRMs. Extensive experiments on five diverse benchmarks confirm that DreamPRM outperforms both vanilla PRMs without domain reweighting and PRMs using heuristic data selection methods. We also observe that the domain weights learned by DreamPRM correlate with dataset quality, effectively separating challenging, informative sources from overly simplistic or noisy ones. These results highlight the effectiveness of our proposed automatic domain reweighting strategy. ## Acknowledgments This work was supported by the National Science Foundation (IIS2405974 and IIS2339216) and the National Institutes of Health (R35GM157217). ## References - [1] AIDC-AI. Ovis2-34b (model card). https://huggingface.co/AIDC-AI/Ovis2-34B, 2025. Related paper: arXiv:2405.20797; Accessed 2025-10-15. - [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. - [3] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps, 2022. - [4] Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression, 2022. - [5] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. - [6] Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3 cot: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought, 2024. - [7] Sang Keun Choe, Willie Neiswanger, Pengtao Xie, and Eric Xing. Betty: An automatic differentiation library for multilevel optimization. In The Eleventh International Conference on Learning Representations, 2023. - [8] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022. - [9] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. - [10] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. - [11] Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Progressive multimodal reasoning via active retrieval, 2024. - [12] Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge: Domain reweighting with generalization estimation, 2024. - [13] Simin Fan, Matteo Pagliardini, and Martin Jaggi. DOGE: Domain reweighting with generalization estimation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 12895–12915. PMLR, 21–27 Jul 2024. - [14] Chelsea Finn, P. Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017. - [15] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, and Lingpeng Kong. G-llava: Solving geometric problem with multi-modal large language model, 2023. - [16] Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, and Jingbo Zhu. Clustering and ranking: Diversity-preserved instruction selection through expert-aligned quality estimation, 2024. - [17] Jiayi He, Hehai Lin, Qingyun Wang, Yi Fung, and Heng Ji. Self-correction is more than refinement: A learning framework for visual and language reasoning tasks, 2024. - [18] Wenxuan Huang, Bohan Jia, Zijie Zhai, et al. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025. - [19] Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, and Hongsheng Li. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency, 2025. - [20] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering, 2018. - [21] Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Akos Kadar, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning, 2018. - [22] Mehran Kazemi, Hamidreza Alvari, Ankit Anand, Jialin Wu, Xi Chen, and Radu Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning, 2023. - [23] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016. - [24] Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025. - [25] Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213, 2022. - [26] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. - [27] Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making large language models better reasoners with step-aware verifier, 2023. - [28] Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges. arXiv preprint arXiv:2501.02189, 2025. - [29] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2024. - [30] Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning, 2022. - [31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019. - [32] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. - [33] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024. - [34] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021. - [35] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022. - [36] Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning, 2022. - [37] Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision, 2024. - [38] Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning, 2023. - [39] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022. - [40] Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. Infographicvqa, 2021. - [41] Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelligence. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025. Llama 4 Maverick announcement; Accessed 2025-10-15. - [42] Meta Llama. Llama-4-maverick-17b-128e-instruct (model card). https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct, 2025. Accessed 2025-10-15. - [43] Moonshot AI / Kimi. Kimi-k1.6-preview-20250308 (preview announcement). https://x.com/RotekSong/status/1900061355945926672, 2025. Accessed 2025-10-15; preview model announcement. - [44] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. - [45] OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Y. Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, and Zhuohan Li. Openai o1 system card, 2024. - [46] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. - [47] Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025. - [48] Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. - [49] Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. - [50] Alex Reid et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens, 2024. - [51] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1466–1476, Lisbon, Portugal, September 2015. Association for Computational Linguistics. - [52] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. - [53] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting, 2019. - [54] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. - [55] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. - [56] Shezheng Song, Xiaopeng Li, Shasha Li, Shan Zhao, Jie Yu, Jun Ma, Xiaoguang Mao, and Weimin Zhang. How to bridge the gap between modalities: Survey on multimodal large language model, 2025. - [57] Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu, Shiqiang Wang, Hans-Arno Jacobsen, and Yingbin Liang. Dynamic loss-based sample reweighting for improved large language model pretraining, 2025. - [58] StepFun. Step-r1-v-mini: A lightweight yet powerful multimodal reasoning model. https://www.stepfun.com/docs/en/step-r1-v-mini, 2025. Accessed 2025-10-15. - [59] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. - [60] Volcengine / ByteDance. Doubao large models (product page). https://www.volcengine.com/product/doubao, 2025. Accessed 2025-10-15. - [61] Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning, 2024. - [62] Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025. - [63] Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset, 2024. - [64] Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, Bangkok, Thailand, August 2024. Association for Computational Linguistics. - [65] Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, Bangkok, Thailand, August 2024. Association for Computational Linguistics. - [66] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. - [67] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024. - [68] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. - [69] Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo, Le Hou, Hongkun Yu, and Jingbo Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision, 2024. - [70] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. - [71] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. - [72] Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S. Yu. Multimodal large language models: A survey, 2023. - [73] Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. - [74] Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2024. - [75] Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. - [76] Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. In The Thirteenth International Conference on Learning Representations, 2025. - [77] Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. - [78] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2024. - [79] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024. - [80] Di Zhang. Aime_1983_2024 (revision 6283828), 2025. - [81] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2023. - [82] Haojie Zheng, Tianyang Xu, Hanchi Sun, Shu Pu, Ruoxi Chen, and Lichao Sun. Thinking before looking: Improving multimodal llm reasoning via mitigating visual hallucination, 2024. ## NeurIPS Paper Checklist 1. Claims 1. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? 1. Answer: [Yes] 1. Justification: The abstract and introduction faithfully present the contributions and scope of the paper. 1. Guidelines: - The answer NA means that the abstract and introduction do not include the claims made in the paper. - The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. - The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. - It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 1. Limitations 1. Question: Does the paper discuss the limitations of the work performed by the authors? 1. Answer: [Yes] 1. Justification: We include the limitations of our work in Section E. 1. Guidelines: - The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. - The authors are encouraged to create a separate "Limitations" section in their paper. - The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. - The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. - The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. - The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. - If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. - While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 1. Theory assumptions and proofs 1. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 1. Answer: [N/A] 1. Justification: This paper does not include theoretical results. 1. Guidelines: - The answer NA means that the paper does not include theoretical results. - All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. - All assumptions should be clearly stated or referenced in the statement of any theorems. - The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. - Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. - Theorems and Lemmas that the proof relies upon should be properly referenced. 1. Experimental result reproducibility 1. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? 1. Answer: [Yes] 1. Justification: All the information needed to reproduce the main experimental results are provided in Section 3, 4, and 5. We will release the implementation if the paper is accepted. 1. Guidelines: - The answer NA means that the paper does not include experiments. - If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. - If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. - Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. - While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example 1. If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. 1. If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. 1. If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). 1. We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 1. Open access to data and code 1. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? 1. Answer: [Yes] 1. Justification: We will release the code if the paper is accepted or through an anonymous link per reviewer’s request. 1. Guidelines: - The answer NA means that paper does not include experiments requiring code. - Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details. - While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). - The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details. - The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. - The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. - At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). - Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 1. Experimental setting/details 1. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? 1. Answer: [Yes] 1. Justification: The detailed experimental settings are included in Section 5.1 and Appendix B, C. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. - The full details can be provided either with the code, in appendix, or as supplemental material. 1. Experiment statistical significance 1. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? 1. Answer: [No] 1. Justification: Due to the resource limitation, we do not report error bars. But note that we conduct experiments on diverse datasets and follow the protocol used by previous works for fair comparisons. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. - The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). - The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) - The assumptions made should be given (e.g., Normally distributed errors). - It should be clear whether the error bar is the standard deviation or the standard error of the mean. - It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. - For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). - If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 1. Experiments compute resources 1. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? 1. Answer: [Yes] 1. Justification: Compute resources used in the experiments are reported in Section 5.1. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. - The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. - The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper). 1. Code of ethics 1. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? 1. Answer: [Yes] 1. Justification: Our paper followed the NeurIPS Code of Ethics. 1. Guidelines: - The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. - If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. - The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 1. Broader impacts 1. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? 1. Answer: [Yes] 1. Justification: Our work helps to enhance multimodal reasoning with DreamPRM. Although the models could still produce errors, we suggest not to rely completely on LLMs and don’t perceive it as major negative societal impact. 1. Guidelines: - The answer NA means that there is no societal impact of the work performed. - If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. - Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. - The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. - The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. - If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 1. Safeguards 1. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? 1. Answer: [N/A] 1. Justification: This paper poses no such risks. 1. Guidelines: - The answer NA means that the paper poses no such risks. - Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. - Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. - We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 1. Licenses for existing assets 1. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? 1. Answer: [Yes] 1. Justification: We have properly cited papers and models used in our paper. 1. Guidelines: - The answer NA means that the paper does not use existing assets. - The authors should cite the original paper that produced the code package or dataset. - The authors should state which version of the asset is used and, if possible, include a URL. - The name of the license (e.g., CC-BY 4.0) should be included for each asset. - For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. - If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. - For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. - If this information is not available online, the authors are encouraged to reach out to the asset’s creators. 1. New assets 1. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? 1. Answer: [Yes] 1. Justification: We will release our code with detailed readme files and instructions. 1. Guidelines: - The answer NA means that the paper does not release new assets. - Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. - The paper should discuss whether and how consent was obtained from people whose asset is used. - At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 1. Crowdsourcing and research with human subjects 1. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? 1. Answer: [N/A] 1. Justification: This work does not involve crowdsourcing nor research with human subjects. 1. Guidelines: - The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. - Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. - According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 1. Institutional review board (IRB) approvals or equivalent for research with human subjects 1. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? 1. Answer: [N/A] 1. Justification: This work does not involve crowdsourcing nor research with human subjects. 1. Guidelines: - The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. - Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. - We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. - For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review. 1. Declaration of LLM usage 1. Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, declaration is not required. 1. Answer: [Yes] 1. Justification: LLMs, specifically MLLMs, are used in the experiments as the paper is about multimodal reasoning. The usage is described in Secion 3, 4. In terms of writing, LLMs are only used for checking grammar, spelling, and word choices. 1. Guidelines: - The answer NA means that the core method development in this research does not involve LLMs as any important, original, or non-standard components. - Please refer to our LLM policy (https://neurips.cc/Conferences/2025/LLM) for what should or should not be described. ## Appendix ## Appendix A Optimization algorithm Directly solving the bi-level optimization problem in Equation 9 can be computational prohibitive due to its nested structure. Following previous work [7], we use approximated algorithm with a few unrolling steps. For example, under one-step unrolling, the updating of PRM’s weights can be expressed as: $$ φ^(t+1)=φ^(t)-β_1∇_φL_tr(D_tr,φ,α) \tag{10} $$ where $β_1$ is the learning rate in lower level optimization. After obtaining the updated PRM parameter $φ^(t+1)$ from Equation 10, the domain-reweighting parameter $α$ is then updated as follows: $$ α^(t+1)=α^(t)-β_2∇_αL_meta(D_meta,φ^*(α)) \tag{11} $$ where $β_2$ is the learning rate for upper level optimization. The two optimization steps in Equation 10 and Equation 11 are conducted iteratively until convergence to get optimal PRM weights $φ^*$ and optimal domain reweighting parameter $α^*$ . ## Appendix B Datasets and benchmarks Table 2: Multimodal datasets involved in the fine-tuning of DreamPRM, organized by task category. | Science Chart Geometry | AI2D [23], ScienceQA [35], M3CoT [6] ChartQA [39], DVQA [20], MapQA [3], FigureQA [21] Geo170k [15], Geometry3K [34], UniGeo [4], GeomVerse [22], GeoS [51] | | --- | --- | | Commonsense | IconQA [36], InfographicsVQA [40], CLEVR-Math [30] | For datasets used in lower-level optimization ( $D_tr$ in Section 4), our study utilizes a diverse set of datasets, spanning multiple domains to ensure a comprehensive coverage of multimodal reasoning tasks, as reported in Tab. 2. The selected 15 multimodal datasets covers 4 major categories including science, chart, geometry and commonsense, with a wide range of task types (QA, OCR, spatial understanding). Additionally, we observe that for some questions, given the current structural thinking prompts, MLLMs consistently produce either correct or incorrect answers. Continuing to sample such questions is a waste of computational resources. Inspired by the dynamic sampling strategy in DAPO [78], we propose a similar dynamic sampling technique for Monte Carlo estimation that focuses on prompts with varied outcomes to improve efficiency. After processing and sampling, the training datasets in lower-level $D_tr$ have around 15k examples (1k per each of the 15 domains), while the meta dataset in the upper-level $D_meta$ has around 1k validation examples from the MMMU [79] dataset. For the dataset used in upper-level optimization ( $D_meta$ in Section 4), we select data from MMMU [79] to simulate a realistic and diverse reasoning scenario. MMMU focuses on advanced perception and reasoning with domain-specific knowledge. Its questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. At evaluation time, we use five multimodal reasoning benchmarks for testing the capability of DreamPRM. WeMath [48], MathVista [33], and MathVision [63] focus more on math-related reasoning tasks and logic and critical thinking, while MMVet [78] and MMStar [5] focus more on real-life tasks that require common knowledge and general reasoning abilities. ## Appendix C Structural Thinking Prompt The detailed structural thinking prompt applied in our experiments is reported in Fig. 8. We carefully design 5 reasoning steps to boost the reasoning capabilities of the MLLMs and enable process supervision. <details> <summary>figures/7-1.png Details</summary> ![df4ca519](/v1/image/df4ca519b0cb49ccc93ac8291b5d165a2ee3c9fe481395d02f954b4117867d3f) ### Visual Description ## [Diagram/Text Composite]: 5-Step Structural Thinking for Multimodal Reasoning with Math Example ### Overview The image is a composite instructional graphic. The left side contains a textual description of a five-step reasoning framework. The right side contains a mathematical function plot and a corresponding multiple-choice question with metadata, serving as an example of the framework's application. The overall purpose is to illustrate a structured method for solving problems that integrate visual and textual information. ### Components/Axes **Left Panel - Textual Framework:** * **Title:** "5-step structural thinking for multimodal reasoning" * **Introductory Text:** "You have been given a question that involves both an image and a text. Your task is to analyze the question by following exactly five steps:" * **Step 1:** "Step 1: **Restate the question.**" * Sub-point: "- Clearly rephrase or clarify the question in your own words." * **Step 2:** "Step 2: **Gather evidence from the image.**" * Sub-point: "- Describe any relevant visual details (e.g., objects, people, locations, interactions) that might address the question." * **Step 3:** "Step 3: **Identify any background knowledge needed.**" * Sub-point: "- Note any general facts, assumptions, or external knowledge that is necessary to address the question." * **Step 4:** "Step 4: **Reason with the current evidence.**" * Sub-point: "- Integrate the information from the image, text, and relevant background knowledge." * Sub-point: "- Show how these pieces of evidence lead toward an answer." * **Step 5:** "Step 5: **Summarize and conclude with all the information.**" * Sub-point: "- Provide a concise, direct answer to the question, referencing the supporting evidence and reasoning." * **Final Instruction:** "Once you have completed your reasoning, provide your final answer in the format: **Final answer: ...**" **Right Panel - Mathematical Example:** * **Graph (Top Right):** * **Type:** 2D line plot of two functions. * **X-axis:** Linear scale, labeled with integers from -4 to 4. Major ticks at every integer. * **Y-axis:** Linear scale, labeled from -1.0 to 1.0 with increments of 0.5. The axis label is "f(x)" (rotated vertically). * **Data Series 1 (Red Line):** A smooth, periodic wave. It starts at approximately (0, 1), decreases to a minimum near (-1.57, -1), rises to a maximum near (1.57, 1), and falls again. This visually corresponds to the function `f(x) = cos(x)`. * **Data Series 2 (Blue Line):** A smooth, periodic wave. It starts at approximately (0, 0), increases to a maximum near (1.57, 1), and falls to a minimum near (4.71, -1). This visually corresponds to the function `f(x) = sin(x)`. * **Legend:** Not explicitly present. The functions are distinguished solely by color (red and blue). * **Question & Answer Block (Below Graph):** * **Question:** "Question: Which function is monotonic in range [0, pi]?" * **Choices:** * "(A) the red one" * "(B) the blue one" * "(C) both" * "(D) none of them" * **Answer:** "Answer: **(B) the blue one**" (The text "(B) the blue one" is highlighted in green). * **Metadata Block (Below Answer):** * **Category:** "Math-targeted" * **Task:** "Textbook question answering" * **Context:** "Function plot" * **Grade:** "College" * **Math:** "Algebraic reasoning" * **Source Label (Bottom Right Corner):** "MathVista" ### Detailed Analysis **Textual Content:** The left panel provides a complete, self-contained procedural guide. Each step is clearly numbered and bolded, with explanatory sub-points. The language is instructional and precise. **Graph Analysis:** * **Red Line Trend:** The line has a negative slope from x=0 to x≈3.14 (π), indicating a decreasing trend over the interval [0, π]. It is not monotonic over the full visible range [-4, 4] as it changes direction multiple times. * **Blue Line Trend:** The line has a positive slope from x=0 to x≈1.57 (π/2), then a negative slope from x≈1.57 to x≈4.71 (3π/2). Over the specific interval [0, π], it increases from 0 to π/2 and then decreases from π/2 to π. Therefore, it is **not** monotonic over the entire interval [0, π]. * **Discrepancy Note:** The provided answer "(B) the blue one" contradicts the visual evidence from the graph. Based on standard mathematical definitions and the plotted curves, the red function (cosine) is monotonic (strictly decreasing) on [0, π], while the blue function (sine) is not. This is a critical observation. ### Key Observations 1. **Instructional Design:** The image pairs a general methodology (left) with a concrete example (right), demonstrating the framework in action. 2. **Graph-Answer Contradiction:** The most significant observation is the conflict between the visual data in the graph and the provided answer. The graph shows the blue function is not monotonic on [0, π], yet it is marked as correct. 3. **Metadata Context:** The metadata clarifies this is a college-level algebraic reasoning task from a dataset or system called "MathVista," framed as a "Textbook question answering" task. 4. **Spatial Layout:** The instructional text occupies the majority (left ~65%) of the image. The example is contained in a bordered box on the right (~35%), with the graph at the top and the question/answer/metadata stacked below it. ### Interpretation This image serves a dual purpose. Primarily, it is a **pedagogical tool** teaching a structured, five-step approach to multimodal reasoning. The framework emphasizes restating the problem, gathering visual evidence, applying background knowledge, reasoning, and concluding. Secondarily, the included math problem acts as a **case study**. However, the apparent error in the provided answer (stating the blue/sine function is monotonic on [0, π]) creates an interesting subtext. It could be interpreted in several ways: * It may be an intentional error to test the viewer's critical application of the very framework being taught (i.e., "Step 2: Gather evidence from the image" should lead one to question the given answer). * It may be a simple mistake in the example material. * It highlights the importance of the reasoning process (Steps 3 & 4) over passively accepting provided answers, especially when visual evidence contradicts a stated conclusion. The "MathVista" label suggests this image is part of a larger dataset or benchmark for evaluating AI systems' ability to perform such multimodal reasoning, where correctly identifying such discrepancies would be a key test of capability. The image, therefore, not only teaches a method but also implicitly demonstrates the kind of critical analysis required when human or AI-generated answers must be verified against visual data. </details> Figure 8: Zero-shot prompting for structural thinking. Table 3: Accuracy on MathVista using DreamPRM with varying numbers $k$ of CoTs. | InternVL-2.5-8B-MPO [67] GPT-4.1-mini (4-14-25) [46] | 65.4 71.5 | 65.3 71.8 | 66.5 72.5 | 67.8 73.2 | 68.9 74.4 | | --- | --- | --- | --- | --- | --- | Table 4: Ablation study evaluating the impact of individual components of DreamPRM | DreamPRM (original) w/o aggregation function loss w/o bi-level optimization | 57.4 56.3 (-1.1) 55.0 (-2.4) | 68.9 66.1 (-2.8) 65.4 (-3.5) | 22.1 20.1 (-2.0) 19.9 (-2.2) | 61.4 60.0 (-1.4) 61.2 (-0.2) | 62.3 59.6 (-2.7) 58.9 (-3.4) | | --- | --- | --- | --- | --- | --- | | w/o structural thinking | 54.6 (-2.8) | 65.7 (-3.2) | 20.3 (-1.8) | 57.5 (-3.9) | 61.6 (-0.7) | ## Appendix D Additional Experimental Results Leaderboard performance details. Table 5 presents a comprehensive comparison of different PRM variants built upon the same o4-mini backbone. DreamPRM consistently outperforms all baselines, elevating the base o4-mini performance from 80.6These steady improvements demonstrate the effectiveness of DreamPRM in enhancing reasoning accuracy through process-level supervision and promoting more reliable consensus across multiple chains of thought. Best-of-N results. Tab. 3 reports the accuracy of two state-of-the-art models on MathVista dataset using DreamPRM with varying numbers $k$ of CoTs. The results indicate that the performance scales well with the number of CoTs. Table 5: Comparison of different PRM variants on the o4-mini model (evaluated on eight CoTs). | o4-mini + Self-consistency + ORM | 80.6 81.7 80.8 | | --- | --- | | + Vanilla-PRM | 84.2 | | + DreamPRM | 85.2 | Ablation studies. The exact results of ablation experiments in the main paper are included in Tab. 4, which emphasizes the importance of all the components in DreamPRM. Loss curves and domain weights. The loss curves and domain weights during the fine-tuning of DreamPRM are illustrated in Fig. 9. It can be observed that the learnt distribution emphasizes informative mathematical figure domains while attenuating less relevant sources. Additionally, domain weights start at 1.0 and quickly diverge, stabilizing after roughly half the training, and the inner and outer losses decrease steadily and plateau, indicating stable convergence of the bi‑level training procedure. Case study. A complete case study illustrating DreamPRM’s step-wise evaluation is reported in Fig. 10. DreamPRM assigns higher scores to high-quality, coherent reasoning steps, while penalizes flawed or unsupported steps. <details> <summary>figures/6-3.png Details</summary> ![4eb65303](/v1/image/4eb653034b42598e57c27e04a904e3ee6997679edf273574be02eda09b0b907d) ### Visual Description ## [Chart Type: Dual Line Charts] ### Overview The image displays two side-by-side line charts sharing a common x-axis ("Training Progress") but measuring different metrics. The left chart tracks optimization loss, while the right chart tracks the evolution of domain weights for various datasets during a training process. The overall visual suggests a machine learning or model training context where performance (loss) and internal weighting of data sources are being monitored over time. ### Components/Axes **Left Chart: "Upper & Lower Optimization Loss"** * **X-Axis:** "Training Progress" (linear scale, from 0.0 to 1.0). * **Y-Axis:** "Loss" (linear scale, from 0.22 to 0.26). * **Legend:** Located in the top-right corner. * Orange line: "Upper Optimization Loss" * Red line: "Lower Optimization Loss" **Right Chart: "Domain Weights"** * **X-Axis:** "Training Progress" (linear scale, from 0.0 to 1.0). * **Y-Axis:** "Domain Weight" (linear scale, from 0.4 to 1.6). * **Legend:** Located in the top-left corner, titled "Dataset". It lists 15 datasets with corresponding line colors: * `ai2d` (blue), `chartqa` (light blue), `m3cot` (orange), `scienceqa` (green), `mapqa` (light green), `geo170k` (pink), `clevr` (purple), `geometry3k` (brown), `figureqa` (light brown), `infographics` (magenta), `unigeo` (gray), `geomverse` (light gray), `iconqa` (olive), `dvqa` (cyan), `geos` (light cyan). ### Detailed Analysis **Left Chart: Upper & Lower Optimization Loss** * **Trend Verification:** Both lines show a general downward trend from the start (Training Progress 0.0) to approximately 0.4, followed by a period of fluctuation and relative stability. * **Upper Optimization Loss (Orange Line):** * Starts at approximately 0.245. * Decreases steadily, reaching a local minimum of ~0.22 around progress 0.4. * Fluctuates between ~0.218 and 0.225 for the remainder of the training, ending near 0.22. * **Lower Optimization Loss (Red Line):** * Starts significantly higher at approximately 0.262. * Shows a steeper initial decline, dropping to ~0.235 by progress 0.4. * After progress 0.4, it exhibits more volatility than the upper loss, with values ranging roughly between 0.235 and 0.245. It ends the training at approximately 0.245. **Right Chart: Domain Weights** * **Trend Verification:** The 15 data series exhibit diverse behaviors. Some show clear upward or downward trends, while others fluctuate around a central value. The lines are densely packed, indicating complex, dynamic adjustments. * **Key Data Series & Approximate Final Values (at Training Progress 1.0):** * **Highest Weights:** `m3cot` (orange) ends highest at ~1.55. `infographics` (magenta) is next at ~1.25. * **Lowest Weights:** `ai2d` (blue) shows the most dramatic decrease, ending at ~0.55. `iconqa` (olive) also trends downward to ~0.75. * **Clustered Around 1.0:** Many datasets, including `scienceqa` (green), `clevr` (purple), `unigeo` (gray), and `geomverse` (light gray), finish in a band between approximately 0.95 and 1.15. * **Notable Trajectories:** * `ai2d` (blue): Sharp, consistent decline from ~1.0 to ~0.55. * `m3cot` (orange): Strong upward trend from ~1.0 to ~1.55. * `iconqa` (olive): Gradual decline from ~1.0 to ~0.75. * `infographics` (magenta): Rises early, peaks near 1.4 around progress 0.5, then declines to ~1.25. ### Key Observations 1. **Loss Convergence:** The "Upper Optimization Loss" converges to a lower and more stable value than the "Lower Optimization Loss," which remains higher and more volatile. This suggests the optimization process for the "upper" component is more effective or stable. 2. **Dynamic Weight Adjustment:** Domain weights are not static; they are actively adjusted throughout training. The system appears to be dynamically re-weighting the importance of different datasets. 3. **Divergent Importance:** The training process significantly increases the weight of `m3cot` and `infographics` datasets while drastically reducing the weight of `ai2d`. This implies the model finds the former more valuable for its objective and the latter less so, or potentially noisy. 4. **Stability Phase:** After an initial period of rapid change (roughly progress 0.0 to 0.4), both loss values and most domain weights enter a phase of slower change or fluctuation, suggesting the training is approaching a steady state. ### Interpretation This data visualizes the inner dynamics of a multi-domain model training process, likely using a technique like meta-learning or dynamic curriculum learning where the model learns *how* to weight its training data. * **What the data suggests:** The simultaneous decrease in loss (left chart) and the dramatic re-weighting of domains (right chart) indicate the model is successfully learning to prioritize certain data sources over others to improve its overall performance. The high final weight for `m3cot` (a multi-step chain-of-thought dataset) suggests the training heavily favors data that promotes complex reasoning. * **Relationship between elements:** The two charts are directly linked. The adjustments in domain weights (right) are the mechanism driving the changes in the optimization loss (left). The system is likely trying to minimize the loss by finding the optimal mix of training data. * **Notable anomalies:** The starkly different fates of `ai2d` (plummets) and `m3cot` (soars) are the most significant outliers. This could indicate a mismatch between the `ai2d` dataset's content and the model's target task, or that `ai2d` is being "down-weighted" to make room for more beneficial data. The volatility in the "Lower Optimization Loss" after progress 0.4, despite stable weights, might point to an inherent instability in that part of the optimization landscape. </details> Figure 9: Optimization loss curves and dynamic domain weights throughout DreamPRM fine-tuning. <details> <summary>x9.png Details</summary> ![9aaa8faf](/v1/image/9aaa8fafea75a10af676128c12507893464e50539c0eb0c03449d8cc3244f058) ### Visual Description ## Mathematical Problem Analysis: Derivative Comparison ### Overview The image presents a mathematical problem involving the comparison of derivatives of an absolute value function at two specific points. It includes a graph of the function, the problem statement with multiple-choice answers, the ground truth answer, and two distinct step-by-step solution attempts with confidence scores. The image serves as an educational or benchmarking tool, likely from a dataset like "MathVista," to evaluate reasoning about calculus concepts. ### Components/Axes **1. Graph (Top-Left Quadrant):** * **Function:** `f(x) = |2x - 3| + 1` * **Graph Type:** A V-shaped graph characteristic of an absolute value function. * **Axes:** * **X-axis:** Labeled with numerical markers at -5, 0, 5, and 10. * **Y-axis:** Labeled with numerical markers at 0, 5, and 10. * **Key Points:** Two points are explicitly marked with blue dots on the graph: * One point is located at approximately `(2, 2)`. * Another point is located at approximately `(5, 8)`. * **Vertex:** The vertex (minimum point) of the V-shape is at `(1.5, 1)`. **2. Problem Statement (Top-Right Quadrant):** * **Question:** "The derivative of f(x) at x=2 is ______ that at x=5" * **Choices:** * (A) larger than * (B) equal to * (C) smaller than * **Ground Truth:** B * **Benchmark:** MathVista **3. Solution Attempt 1 (Middle Box - Green Border):** * **Title:** "Step 1: Restate the question." through "Step 5: Summarize and conclude..." * **Content:** A five-step reasoning process concluding that the derivatives are equal. * **Final Answer:** "Final answer: B ✓" (with a green checkmark). * **Confidence Scores:** Each step is followed by a `[DreamPRM: X.XXX]` score in green text, ranging from 0.575 to 0.812. **4. Solution Attempt 2 (Bottom Box - Red Border):** * **Title:** "Step 1: Restate the question." through "Step 5: Summarize and conclude..." * **Content:** A five-step reasoning process concluding that the derivative at x=2 is larger (in absolute value) than at x=5. * **Final Answer:** "Final answer: A ✗" (with a red cross). * **Confidence Scores:** Each step is followed by a `[DreamPRM: X.XXX]` score in red text, ranging from 0.396 to 0.676. * **Highlighted Text:** Certain lines of reasoning are highlighted in red, indicating errors. ### Detailed Analysis **Text Transcription:** * **Graph Label:** `f(x) = |2x - 3| + 1` * **Solution 1, Step 3:** "The derivative of `|2x - 3| + 1` is `2` for `x > 1.5` and `-2` for `x < 1.5`." * **Solution 1, Step 4:** "For `x = 2`, the function is in the interval where `x > 1.5`, so the derivative is `2`." / "For `x = 5`, the function is also in the interval where `x > 1.5`, so the derivative is `2`." * **Solution 2, Step 3:** "The function `f(x) = |2x - 3| + 1` changes slope at the vertex of the absolute value, which is at `x = \frac{3}{2}`." * **Solution 2, Step 4 (Highlighted in Red):** "At `x = 2`, the function appears linear with a **negative slope** (since it's to the left of the vertex)." / "At `x = 5`, the function appears linear with a positive slope (since it's to the right of the vertex)." / "The slope at `x = 2` (negative) is steeper/more negative than the slope at `x = 5` (positive)." **Reasoning Comparison:** * **Solution 1 (Correct):** Correctly identifies that both x=2 and x=5 are greater than the vertex x-coordinate (1.5). Therefore, both points lie on the right branch of the absolute value function, where the derivative is constant and positive (2). Concludes the derivatives are equal. * **Solution 2 (Incorrect):** Makes a critical spatial error. It incorrectly states that x=2 is "to the left of the vertex." The vertex is at x=1.5, so x=2 is actually to the right. This leads to the false assumption that the function has a negative slope at x=2, resulting in an incorrect comparison and final answer. ### Key Observations 1. **Contradictory Conclusions:** The two solution attempts arrive at opposite final answers (B vs. A) for the same problem. 2. **Source of Error:** The incorrect solution (Solution 2) contains a fundamental misreading of the graph's spatial layout, misplacing x=2 relative to the function's vertex. 3. **Confidence vs. Accuracy:** Solution 2 maintains moderate confidence scores (0.396-0.676) despite containing a critical logical error, highlighting a potential pitfall in automated reasoning systems. 4. **Visual Grounding:** The graph clearly shows both marked points (x=2 and x=5) on the same upward-sloping line segment to the right of the vertex, visually supporting the correct conclusion of equal derivatives. ### Interpretation This image is a diagnostic tool that reveals how errors in **spatial grounding** and **graph interpretation** can lead to incorrect mathematical reasoning. The core task requires connecting the algebraic definition of a derivative (the slope) to its visual representation on a graph. * **What the Data Suggests:** The problem tests understanding that the derivative of a linear segment (like the branches of an absolute value function) is constant. The correct reasoning (Solution 1) demonstrates this by analytically determining the derivative on the relevant interval. The incorrect reasoning (Solution 2) attempts a visual slope comparison but fails at the first step of correctly identifying which branch of the function each point belongs to. * **Relationship Between Elements:** The graph is the primary source of evidence. The question asks for a comparison that must be derived from the graph's properties. The solutions show two different pathways to interpret that evidence—one successful, one flawed. * **Notable Anomaly:** The most significant anomaly is the **factual error in Solution 2, Step 4**, where it claims x=2 is left of the vertex. This is not a subtle miscalculation but a clear misobservation of the provided visual data, leading to a cascade of incorrect inferences. It underscores the importance of precise visual grounding in multimodal reasoning tasks. The presence of the "Ground Truth: B" label confirms the correct answer, making the image useful for evaluating and debugging reasoning models. </details> Figure 10: A case study of DreamPRM’s step-wise evaluation. ## Appendix E Limitations & Future Work. DreamPRM currently assumes a fixed set of domains and requires Monte-Carlo sampling, which can be computationally heavy. Future work could explore instance-level reweighting, adaptive sampling strategies, and integration with retrieval-augmented generation to further cut compute while broadening coverage. We will release code, trained weights, and evaluation scripts to facilitate reproducibility and community adoption.

Rendering Paper...