## Chart: GenPRM Performance as Verifier and Critic
### Overview
This image presents two distinct charts, (a) and (b), illustrating the performance of GenPRM in two roles: as a verifier and as a critic. Chart (a) is a dual Y-axis bar chart comparing "Best-of-32 Accuracy" and "ProcessBench F1 Score" for various models, including different configurations of GenPRM, against several baselines. Chart (b) is a line chart showing the "Accuracy (%)" of GenPRM-7B and two other models ("DeepSeek-R1-Distill-7B", "Self-Refine") over multiple "# Refinement Turn"s, demonstrating GenPRM's capability as a critic.
### Components/Axes
#### Chart (a): GenPRM as a Verifier (Best-of-N & ProcessBench)
This chart is a bar chart with two independent Y-axes.
* **Title:** (a) GenPRM as a Verifier (Best-of-N & ProcessBench)
* **Left Y-axis (Primary):** "Best-of-32 Accuracy (%)". The scale ranges from 45 to 69, with major ticks at 45, 49, 53, 57, 61, 65, 69.
* **Right Y-axis (Secondary):** "ProcessBench F1 Score (%)". The scale ranges from 30 to 90, with major ticks at 30, 40, 50, 60, 70, 80, 90.
* **X-axis:** Represents different models or configurations. The labels are rotated for readability and are, from left to right:
* Skywork-PRM-1.5B
* Skywork-PRM-7B
* Owen2.5-Math-7B-PRM800K
* Owen2.5-Math-PRM-7B
* Owen2.5-Math-PRM-72B
* Direct GenPRM-7B
* GenPRM-7B (Pass@1)
* GenPRM-7B (Maj@8)
* **Legend (located in the top-left corner of chart (a)):**
* "Best-of-32": Represented by teal/green colored bars.
* "ProcessBench": Represented by orange colored bars.
* **Horizontal Reference Lines:**
* A solid green line at 67.6% on the left Y-axis, labeled "Pass@32 (67.6)".
* A dashed orange line at 61.9% on the left Y-axis, labeled "GPT-4o (61.9)".
* A dashed light blue line at 54.1% on the left Y-axis, labeled "Maj@32 (54.1)".
#### Chart (b): GenPRM as a Critic
This chart is a line chart with markers.
* **Title:** (b) GenPRM as a Critic
* **Y-axis:** "Accuracy (%)". The scale ranges from 46 to 52, with major ticks at 46, 47, 48, 49, 50, 51, 52.
* **X-axis:** "# Refinement Turn". The scale ranges from 0 to 3, with major ticks at 0, 1, 2, 3.
* **Legend (located in the top-left corner of chart (b)):**
* "GenPRM-7B": Represented by a teal/green line with star markers.
* "DeepSeek-R1-Distill-7B": Represented by an orange dashed line with circular markers.
* "Self-Refine": Represented by a grey dashed line with circular markers.
* **Horizontal Reference Line:** A black dashed line near the bottom of the chart, labeled "Pass@1", indicating a baseline accuracy of approximately 45.5%.
* **Annotation:** A vertical dashed arrow is positioned at Refinement Turn 3, extending from the "DeepSeek-R1-Distill-7B" line to the "GenPRM-7B" line. It is labeled "3.4x".
### Detailed Analysis
#### Chart (a): GenPRM as a Verifier
This chart displays two performance metrics for eight different models/configurations. The last two configurations, "GenPRM-7B (Pass@1)" and "GenPRM-7B (Maj@8)", are visually distinguished by a black outline around their bars.
**Best-of-32 Accuracy (teal bars):**
* **Skywork-PRM-1.5B:** 52.5%
* **Skywork-PRM-7B:** 54.1%
* **Owen2.5-Math-7B-PRM800K:** 53.1%
* **Owen2.5-Math-PRM-7B:** 53.8%
* **Owen2.5-Math-PRM-72B:** 56.2%
* **Direct GenPRM-7B:** 52.2%
* **GenPRM-7B (Pass@1):** 55.9%
* **GenPRM-7B (Maj@8):** 57.1%
* **Trend:** The Best-of-32 Accuracy generally shows an increasing trend across the models, with some fluctuations. It starts at 52.5%, reaches a local peak at 56.2% for Owen2.5-Math-PRM-72B, dips for Direct GenPRM-7B, and then rises to its highest value of 57.1% for GenPRM-7B (Maj@8). All models are below the "Maj@32 (54.1)", "GPT-4o (61.9)", and "Pass@32 (67.6)" reference lines, except for Skywork-PRM-7B, Owen2.5-Math-PRM-72B, GenPRM-7B (Pass@1), and GenPRM-7B (Maj@8) which are above the "Maj@32 (54.1)" line.
**ProcessBench F1 Score (orange bars):**
* **Skywork-PRM-1.5B:** 36.4%
* **Skywork-PRM-7B:** 42.1%
* **Owen2.5-Math-7B-PRM800K:** 56.5%
* **Owen2.5-Math-PRM-7B:** 73.5%
* **Owen2.5-Math-PRM-72B:** 78.3%
* **Direct GenPRM-7B:** 60.0%
* **GenPRM-7B (Pass@1):** 75.2%
* **GenPRM-7B (Maj@8):** 80.5%
* **Trend:** The ProcessBench F1 Score shows a strong, generally increasing trend across the models. It starts at 36.4% and rises significantly, reaching its peak at 80.5% for GenPRM-7B (Maj@8). There is a notable dip for Direct GenPRM-7B (60.0%) compared to the preceding Owen2.5-Math models.
#### Chart (b): GenPRM as a Critic
This chart illustrates how accuracy changes with the number of refinement turns for three different models.
**GenPRM-7B (teal line with star markers):**
* **Trend:** Shows a strong, consistent increase in accuracy with each refinement turn.
* **Turn 0:** Approximately 45.5% (just above the Pass@1 line).
* **Turn 1:** Approximately 49.5%.
* **Turn 2:** Approximately 50.8%.
* **Turn 3:** Approximately 51.8%.
**DeepSeek-R1-Distill-7B (orange dashed line with circle markers):**
* **Trend:** Shows an initial increase in accuracy, then largely flattens out.
* **Turn 0:** Approximately 45.5% (just above the Pass@1 line).
* **Turn 1:** Approximately 47.0%.
* **Turn 2:** Approximately 47.3%.
* **Turn 3:** Approximately 47.3%.
**Self-Refine (grey dashed line with circle markers):**
* **Trend:** Shows a very slight initial increase, then largely flattens or slightly decreases.
* **Turn 0:** Approximately 45.5% (just above the Pass@1 line).
* **Turn 1:** Approximately 46.0%.
* **Turn 2:** Approximately 45.8%.
* **Turn 3:** Approximately 46.0%.
**Annotation "3.4x":** This annotation, located at Refinement Turn 3, indicates that the improvement in accuracy of GenPRM-7B from the Pass@1 baseline is approximately 3.4 times greater than the improvement of DeepSeek-R1-Distill-7B from the same baseline at Turn 3. (Calculated as: (51.8 - 45.5) / (47.3 - 45.5) = 6.3 / 1.8 ≈ 3.5, which is consistent with "3.4x").
### Key Observations
* **Chart (a) - Verifier Performance:**
* GenPRM-7B (Maj@8) achieves the highest ProcessBench F1 Score (80.5%) and Best-of-32 Accuracy (57.1%) among all tested models, indicating strong performance as a verifier.
* The ProcessBench F1 Score generally shows a more pronounced improvement across models compared to Best-of-32 Accuracy.
* The "Owen2.5-Math-PRM-72B" model also shows very competitive ProcessBench F1 Score (78.3%) and Best-of-32 Accuracy (56.2%).
* Direct GenPRM-7B performs lower than the Owen2.5-Math-PRM-7B and -72B variants in both metrics, suggesting that the "Pass@1" and "Maj@8" strategies significantly boost GenPRM-7B's verification capabilities.
* All models fall significantly short of the "Pass@32 (67.6)" and "GPT-4o (61.9)" accuracy benchmarks for Best-of-32.
* **Chart (b) - Critic Performance:**
* GenPRM-7B demonstrates superior performance as a critic, showing a continuous and substantial increase in accuracy with each refinement turn, reaching nearly 52% at Turn 3.
* In contrast, DeepSeek-R1-Distill-7B and Self-Refine show limited improvement after the first refinement turn, with their accuracy largely plateauing around 47.3% and 46.0% respectively.
* The "3.4x" annotation highlights the significant advantage of GenPRM-7B in leveraging multiple refinement turns to improve accuracy compared to DeepSeek-R1-Distill-7B.
### Interpretation
The data strongly suggests that GenPRM, particularly the 7B variant, is a highly effective model both as a verifier and as a critic in the context of the evaluated tasks.
As a **verifier** (Chart a), GenPRM-7B, especially when employing strategies like "Maj@8" (Majority voting over 8 samples), achieves the highest F1 scores on ProcessBench, indicating its strong ability to correctly identify and validate solutions. While its "Best-of-32 Accuracy" is also the highest among the tested models, it still lags behind the "GPT-4o" and "Pass@32" benchmarks, suggesting there's room for improvement in achieving very high accuracy on the Best-of-32 metric. The significant difference between "Direct GenPRM-7B" and its "Pass@1" and "Maj@8" variants underscores the importance of these verification strategies in boosting performance.
As a **critic** (Chart b), GenPRM-7B demonstrates a unique capability to iteratively improve performance through multiple refinement turns. Its accuracy consistently climbs, while other models like DeepSeek-R1-Distill-7B and Self-Refine quickly hit a ceiling. The "3.4x" annotation quantifies this advantage, showing that GenPRM-7B's ability to refine solutions leads to a much greater gain in accuracy compared to DeepSeek-R1-Distill-7B. This implies that GenPRM-7B is not just good at evaluating a single solution, but can effectively guide an iterative improvement process, making it a powerful tool for tasks requiring refinement.
In summary, GenPRM-7B stands out for its robust performance in both verification and critical evaluation roles, with its iterative refinement capability being a particularly strong differentiator against other models.