## Bar Chart: CoTs without a valid label on ProcessBench
### Overview
The chart compares the percentage of "CoTs without a valid label" across four models (QwQ-32B-preview, R1-Qwen-14B, R1-Qwen-7B, R1-Qwen-1.5B) using two evaluation methods: ThinkPRM (orange) and LLM-as-a-judge (blue). The y-axis represents the percentage of total cases, while the x-axis lists the models. The legend is positioned at the bottom, with ThinkPRM in orange and LLM-as-a-judge in blue.
### Components/Axes
- **Title**: "CoTs without a valid label on ProcessBench"
- **Y-axis**: "Percentage of total (%)" (ranging from 0% to 60%)
- **X-axis**: Four model categories:
1. QwQ-32B-preview
2. R1-Qwen-14B
3. R1-Qwen-7B
4. R1-Qwen-1.5B
- **Legend**:
- Orange: ThinkPRM
- Blue: LLM-as-a-judge
### Detailed Analysis
- **QwQ-32B-preview**:
- ThinkPRM: 11.5% (orange bar)
- LLM-as-a-judge: 9.4% (blue bar)
- **R1-Qwen-14B**:
- ThinkPRM: 2.3% (orange bar)
- LLM-as-a-judge: 16.0% (blue bar)
- **R1-Qwen-7B**:
- ThinkPRM: 1.2% (orange bar)
- LLM-as-a-judge: 19.5% (blue bar)
- **R1-Qwen-1.5B**:
- ThinkPRM: 1.9% (orange bar)
- LLM-as-a-judge: 53.2% (blue bar)
### Key Observations
1. **LLM-as-a-judge consistently outperforms ThinkPRM** across all models, with higher percentages of CoTs without valid labels.
2. **R1-Qwen-1.5B** exhibits a dramatic outlier, with LLM-as-a-judge reporting **53.2%** (nearly 5x higher than ThinkPRM's 1.9%).
3. **QwQ-32B-preview** shows the closest performance between the two methods (11.5% vs. 9.4%).
### Interpretation
The data suggests that **LLM-as-a-judge is more effective at identifying CoTs without valid labels** compared to ThinkPRM, particularly in larger models like R1-Qwen-1.5B. The extreme value for R1-Qwen-1.5B (53.2%) raises questions about potential model-specific biases or evaluation challenges. This could indicate that larger models may have more ambiguous or edge-case outputs that LLM-as-a-judge flags more aggressively. The disparity between methods highlights the importance of evaluation strategy in assessing model reliability.