## Composite Technical Figure: DreamPRM Performance and Dataset Examples
### Overview
The image is a composite technical figure divided into two main sections. The left section contains a bar chart quantifying the performance improvement of a method called "DreamPRM" compared to a baseline ("PRM w/o data selection") across five different datasets. The right section consists of two vertically stacked panels, each presenting an example question from a specific dataset (AIZD and M3CoT) along with associated metadata and analysis determined by DreamPRM.
### Components/Axes
**Left Chart:**
* **Chart Type:** Grouped bar chart.
* **Y-Axis:** Label: "Accuracy Improvement (%)". Scale: 0 to 7, with major ticks at 0, 1, 2, 3, 4, 5, 6, 7.
* **X-Axis:** Lists five datasets: "WeMath", "MMVet", "MathVista", "MMStar", "MathVision".
* **Legend:** Located in the top-left corner.
* Blue bar: "DreamPRM"
* Yellow bar: "PRM w/o data selection"
* **Additional Annotation:** A horizontal dashed line at y=4.0, labeled "avg. = +4.0".
**Right Panels (Top and Bottom):**
* Each panel is a self-contained box with a white background and black border.
* **Top Panel (AIZD Example):**
* **Image (Top-Left):** A black-and-white diagram of a simple aquatic food chain: Sun → Phytoplankton → Zooplankton → Small Fish → Large Fish → Bird (Eagle/Hawk).
* **Text Block (Right of Image):**
* **Question:** "What does the bird feed on?"
* **Choices:** "A. zooplankton", "B. grass", "C. predator fish", "D. none of the above"
* **Answer:** "C"
* **Dataset:** "AIZD (2016)"
* **Metadata Block (Below Image and Question):**
* "Dataset difficulty: easy (InternVL-2.5-MPO-8B's accuracy 84.6%)"
* "Unnecessary modality: can answer without image"
* "Requirements for reasoning: do not require complicated reasoning"
* "Domain weight: 0.55 (Determined by DreamPRM)" (This line is in blue text).
* **Bottom Panel (M3CoT Example):**
* **Image (Top-Left):** A color photograph of a white and grey bird (likely a gull) in flight against a blue sky. Below it are four smaller thumbnail images of other animals.
* **Text Block (Right of Image):**
* **Question:** "Determine the scientific nomenclature of the organism shown in the primary image."
* **Choices:** "A. Hemidactylus turcicus", "B. Felis silvestris", "C. Macropus agilis", "D. None of the above"
* **Answer:** "D"
* **Dataset:** "M3CoT (2024)"
* **Metadata Block (Below Image and Question):**
* "Dataset difficulty: hard (InternVL-2.5-MPO-8B's accuracy 62.1%)"
* "Unnecessary modality: cannot answer without image"
* "Requirements for reasoning: require complicated reasoning"
* "Domain weight: 1.49 (Determined by DreamPRM)" (This line is in blue text).
### Detailed Analysis
**Left Chart - Data Points:**
The chart shows the percentage improvement in accuracy for DreamPRM (blue) versus the baseline without data selection (yellow) for each dataset.
1. **WeMath:**
* DreamPRM (Blue): +5.7%
* PRM w/o data selection (Yellow): +2.5%
2. **MMVet:**
* DreamPRM (Blue): +5.5%
* PRM w/o data selection (Yellow): +3.0%
3. **MathVista:**
* DreamPRM (Blue): +3.5%
* PRM w/o data selection (Yellow): +1.8%
4. **MMStar:**
* DreamPRM (Blue): +3.4%
* PRM w/o data selection (Yellow): +1.9%
5. **MathVision:**
* DreamPRM (Blue): +1.7%
* PRM w/o data selection (Yellow): +0.2%
* **Average Line:** The dashed line indicates the average improvement across all datasets for DreamPRM is +4.0%.
**Right Panels - Content Details:**
* **AIZD (2016) Example:** The question tests basic understanding of a food chain diagram. The correct answer (C. predator fish) can be deduced from the diagram's arrows without needing complex reasoning. DreamPRM assigns it a low "Domain weight" of 0.55, correlating with its "easy" difficulty rating.
* **M3CoT (2024) Example:** The question requires identifying the scientific name of a specific bird from a photograph, a task requiring specialized knowledge and visual analysis. The correct answer is "D. None of the above" (as the bird is a gull, not a gecko, cat, or kangaroo). DreamPRM assigns it a high "Domain weight" of 1.49, correlating with its "hard" difficulty rating and the stated requirement for "complicated reasoning."
### Key Observations
1. **Consistent Superiority:** DreamPRM (blue bars) shows a higher accuracy improvement than the baseline (yellow bars) across all five datasets.
2. **Magnitude of Improvement:** The performance gap is largest on the "WeMath" (+3.2% difference) and "MMVet" (+2.5% difference) datasets. The gap narrows for the other three datasets.
3. **Dataset Difficulty Spectrum:** The two example panels illustrate a clear contrast. The AIZD task is labeled "easy" with high model accuracy (84.6%) and low domain weight (0.55). The M3CoT task is labeled "hard" with lower model accuracy (62.1%) and high domain weight (1.49).
4. **Modality Relevance:** The metadata explicitly states when the image is unnecessary ("can answer without image") versus essential ("cannot answer without image") for solving the problem.
5. **Domain Weight as a Metric:** The "Domain weight" value, determined by DreamPRM, appears to be a quantitative measure that aligns with the qualitative difficulty and reasoning requirements of a task.
### Interpretation
This composite figure serves a dual purpose: demonstrating the efficacy of the DreamPRM method and illustrating its analytical capabilities on diverse multimodal reasoning tasks.
* **Performance Validation:** The bar chart provides empirical evidence that DreamPRM enhances model accuracy more effectively than a baseline approach that lacks its data selection mechanism. The consistent outperformance suggests the method is robust across different types of mathematical and visual reasoning benchmarks (WeMath, MMVet, etc.).
* **Analytical Insight:** The right-hand panels showcase how DreamPRM can be used to *characterize* datasets. It doesn't just solve problems; it analyzes them to assign metadata like difficulty, modality necessity, reasoning requirements, and a novel "Domain weight." This weight seems to function as a proxy for task complexity or the degree of specialized knowledge required.
* **Underlying Principle:** The contrast between the two examples suggests DreamPRM's core function may involve intelligently selecting or weighting training data based on these characterized properties. By assigning higher "domain weight" to hard, image-dependent, reasoning-intensive tasks (like M3CoT), the system likely prioritizes learning from such challenging examples, leading to the overall accuracy improvements seen in the chart. The figure argues that effective data selection (the difference between the blue and yellow bars) is key to improving performance on complex multimodal reasoning.