## Line Chart: Average F1 Score vs. Estimated Annotation Cost
### Overview
This chart displays the relationship between the estimated annotation cost (in Generation Tokens) and the average F1 score for three different models: ActPRM, UniversalPRM, and Qwen2.5-Math-PRM-7B. The chart shows how the F1 score changes as the annotation cost increases. A dashed line indicates a cost of 17.3x.
### Components/Axes
* **X-axis:** Est. Annotation Cost (Gen. Tokens). Scale ranges from approximately 227 to 235. Marked values are 228, 230, 232, and 234.
* **Y-axis:** Average F1 Score. Scale ranges from approximately 0.68 to 0.76. Marked values are 0.68, 0.70, 0.72, 0.74, 0.75, and 0.76.
* **Legend:** Located in the bottom-right corner.
* ActPRM (Red circle marker)
* UniversalPRM (Blue star marker)
* Qwen2.5-Math-PRM-7B (Dark blue star marker)
* **Dashed Vertical Line:** Located at approximately x = 230, indicating a cost threshold.
* **Text Labels:**
* "0.750" at approximately (230, 0.75)
* "4.8x Cost" above the ActPRM line near x=230
* "17.3x Cost" above the dashed line near x=230
* "0.743" at approximately (232, 0.743)
* "0.735" at approximately (234, 0.735)
### Detailed Analysis
* **ActPRM (Red Line):** The line starts at approximately (228, 0.68) and increases sharply until reaching a peak at approximately (230, 0.750). After x=230, the line decreases slightly to approximately (232, 0.743) and then decreases further to approximately (234, 0.735).
* **UniversalPRM (Blue Line):** The line starts at approximately (228, 0.72) and increases to approximately (230, 0.74). After x=230, the line remains relatively flat at approximately 0.74.
* **Qwen2.5-Math-PRM-7B (Dark Blue Line):** The line starts at approximately (228, 0.70) and increases to approximately (230, 0.73). After x=230, the line remains relatively flat at approximately 0.73.
### Key Observations
* ActPRM demonstrates the most significant improvement in F1 score with increasing annotation cost, peaking at x=230. However, its performance declines slightly after this point.
* UniversalPRM and Qwen2.5-Math-PRM-7B show a more gradual increase in F1 score with annotation cost, and their performance plateaus after x=230.
* The dashed vertical line at x=230 appears to mark a point of diminishing returns for ActPRM, as the F1 score begins to decrease after this cost.
### Interpretation
The chart suggests that increasing the estimated annotation cost initially leads to substantial improvements in the F1 score, particularly for the ActPRM model. However, there is a point (around 230 Gen. Tokens) where further increases in annotation cost yield diminishing returns, and the F1 score may even decrease. This could indicate that the model has reached its optimal performance level with the available data and that additional annotation effort is not effectively improving its accuracy. The relatively stable performance of UniversalPRM and Qwen2.5-Math-PRM-7B suggests they may be less sensitive to annotation cost or have already reached their performance limits. The "4.8x Cost" and "17.3x Cost" labels likely refer to the relative cost increase compared to a baseline, and the dashed line highlights a cost threshold where the benefit of further annotation diminishes. The chart provides valuable insights into the trade-off between annotation cost and model performance, helping to optimize resource allocation for model training.