## Line Graph: Filtering based on Process vs. Outcome
### Overview
The graph compares the accuracy of two filtering approaches ("Process-based" and "Outcome-based") across increasing numbers of beams (2⁰ to 2⁴). A dashed reference line labeled "LLM-as-a-judge" is included for comparison. The y-axis represents accuracy in percentage, ranging from 56% to 68%.
### Components/Axes
- **Title**: "Filtering based on Process vs. Outcome"
- **X-axis**: "Number of beams" (logarithmic scale: 2⁰, 2¹, 2², 2³, 2⁴)
- **Y-axis**: "Accuracy (%)" (linear scale: 56% to 68%)
- **Legend**:
- Orange line with stars: "Process-based (ours)"
- Orange line with circles: "Outcome-based (GenRM)"
- Blue dashed line: "LLM-as-a-judge"
### Detailed Analysis
- **Process-based (ours)**:
- 2⁰: ~61%
- 2¹: ~61%
- 2²: ~64%
- 2³: ~66%
- 2⁴: ~68%
- Trend: Steady upward slope after 2¹.
- **Outcome-based (GenRM)**:
- 2⁰: ~58%
- 2¹: ~58%
- 2²: ~56%
- 2³: ~57%
- 2⁴: ~59%
- Trend: Slight dip at 2², then gradual recovery.
- **LLM-as-a-judge**: Constant dashed line at ~62%.
### Key Observations
1. The Process-based method surpasses the LLM-as-a-judge baseline at 2² (64% vs. 62%) and maintains higher accuracy thereafter.
2. Outcome-based accuracy fluctuates, with a notable dip at 2² (56%) before recovering.
3. The LLM-as-a-judge line acts as a reference threshold, with Process-based consistently exceeding it after 2².
### Interpretation
The data suggests that **process-based filtering** (using Llama-3.2-3B-Instruct) demonstrates superior performance in accuracy as the number of beams increases, particularly outperforming the LLM-as-a-judge approach. The Outcome-based method (GenRM) shows inconsistent results, with a temporary drop at 2² that may indicate sensitivity to beam count. The LLM-as-a-judge provides a static benchmark, highlighting the dynamic advantage of process-based filtering in scaling scenarios. This aligns with the hypothesis that process-oriented evaluation (e.g., step-by-step reasoning) may be more robust than outcome-focused metrics in complex tasks.