\n
## Bar Charts: Model Accuracy on Math and Physics Problems
### Overview
The image presents two side-by-side bar charts comparing the accuracy of two models, ThinkPRM-14B (orange) and DiscPRM-14B (teal), on math and physics problems. The x-axis represents problems binned by difficulty (1 to 5), and the y-axis represents accuracy in percentage (%). The left chart focuses on Math-500 problems, while the right chart focuses on GPQA-Physics problems.
### Components/Axes
* **X-axis Label (Both Charts):** "Problems binned by difficulty"
* **Y-axis Label (Both Charts):** "Accuracy (%)"
* **Left Chart Title:** "Best-of-32: Math-500"
* **Right Chart Title:** "Best-of-32: GPQA-Physics"
* **Legend (Bottom Center):**
* Orange: "ThinkPRM-14B"
* Teal: "DiscPRM-14B"
* **X-axis Markers (Both Charts):** 1, 2, 3, 4, 5 (representing difficulty levels)
* **Y-axis Markers (Both Charts):** 0, 20, 40, 60, 80, 100
### Detailed Analysis or Content Details
**Left Chart (Math-500):**
* **DiscPRM-14B (Teal):**
* Difficulty 1: Approximately 95% accuracy.
* Difficulty 2: Approximately 80% accuracy.
* Difficulty 3: Approximately 85% accuracy.
* Difficulty 4: Approximately 70% accuracy.
* Difficulty 5: Approximately 45% accuracy.
* Trend: The teal bars generally decrease in height from difficulty 1 to 5, indicating decreasing accuracy with increasing difficulty.
* **ThinkPRM-14B (Orange):**
* Difficulty 1: Approximately 98% accuracy.
* Difficulty 2: Approximately 90% accuracy.
* Difficulty 3: Approximately 95% accuracy.
* Difficulty 4: Approximately 70% accuracy.
* Difficulty 5: Approximately 40% accuracy.
* Trend: The orange bars also generally decrease in height from difficulty 1 to 5, mirroring the teal bars.
**Right Chart (GPQA-Physics):**
* **DiscPRM-14B (Teal):**
* Difficulty 1: Approximately 100% accuracy.
* Difficulty 2: Approximately 80% accuracy.
* Difficulty 3: Approximately 60% accuracy.
* Difficulty 4: Approximately 10% accuracy.
* Trend: The teal bars show a significant decrease in height from difficulty 1 to 4.
* **ThinkPRM-14B (Orange):**
* Difficulty 1: Approximately 100% accuracy.
* Difficulty 2: Approximately 95% accuracy.
* Difficulty 3: Approximately 70% accuracy.
* Difficulty 4: Approximately 15% accuracy.
* Trend: The orange bars also show a decrease in height from difficulty 1 to 4, but the decrease is less pronounced than for the teal bars.
### Key Observations
* In both charts, both models perform best on the easiest problems (difficulty 1) and their performance degrades as the difficulty increases.
* For Math-500, ThinkPRM-14B consistently outperforms DiscPRM-14B across all difficulty levels, though the difference is not substantial.
* For GPQA-Physics, ThinkPRM-14B also generally outperforms DiscPRM-14B, especially at higher difficulty levels. The performance drop for DiscPRM-14B is more dramatic on the physics problems.
* The accuracy of both models on the most difficult problems (difficulty 5 for Math-500 and difficulty 4 for GPQA-Physics) is significantly lower than on easier problems.
### Interpretation
The data suggests that both ThinkPRM-14B and DiscPRM-14B are capable of solving math and physics problems, but their performance is highly sensitive to the difficulty of the problems. ThinkPRM-14B appears to be slightly more robust to increasing difficulty, particularly in the GPQA-Physics domain. The substantial drop in accuracy for both models on the most difficult problems indicates a limitation in their ability to handle complex reasoning or knowledge requirements. The difference in performance between the two models on the physics problems could be due to differences in their training data or architectures, potentially making ThinkPRM-14B better suited for the specific challenges posed by physics questions. The charts provide a comparative performance assessment of the two models across different problem difficulties, highlighting their strengths and weaknesses. The consistent trend of decreasing accuracy with increasing difficulty is expected, as harder problems inherently require more sophisticated problem-solving skills.