## Line Charts: Gemma Model Accuracy Comparison (ORM vs. PAV vs. Pass@N)
### Overview
The image contains three side-by-side line charts, labeled (a), (b), and (c), comparing the performance of three methods—ORM, PAV (ours), and Pass @N—across three different model sizes: Gemma-2B, Gemma-9B, and Gemma-27B. The charts plot "Accuracy" on the y-axis against a variable "N" on the x-axis, where N increases in powers of 2 from 2¹ to 2⁷. Each chart includes annotations highlighting relative performance improvements.
### Components/Axes
* **Chart Layout:** Three subplots arranged horizontally.
* **Titles:** Each subplot is titled with the model name: "Gemma-2B" (left), "Gemma-9B" (center), "Gemma-27B" (right).
* **X-Axis:** Labeled "N". The axis markers are categorical, representing powers of two: 2¹, 2², 2³, 2⁴, 2⁵, 2⁶, 2⁷.
* **Y-Axis:** Labeled "Accuracy". The scale varies per chart:
* (a) Gemma-2B: 0.1 to 0.4
* (b) Gemma-9B: 0.4 to 0.6
* (c) Gemma-27B: 0.4 to 0.6
* **Legend:** Positioned in the top-left corner of each subplot. It defines three data series:
* `ORM`: Blue dashed line with circular markers.
* `PAV (ours)`: Orange solid line with star markers.
* `Pass @N`: Gray dotted line with square markers.
* **Annotations:** Each chart contains black dashed arrows and text annotations comparing the performance of PAV (ours) to ORM.
### Detailed Analysis
#### **Chart (a): Gemma-2B**
* **Trend Verification:**
* **ORM (Blue):** Slopes gently upward from left to right.
* **PAV (Orange):** Slopes upward more steeply than ORM.
* **Pass @N (Gray):** Slopes upward most steeply of all three lines.
* **Data Points (Approximate):**
* **ORM:** Starts at ~0.12 (N=2¹), rises to ~0.20 (N=2⁷).
* **PAV (ours):** Starts at ~0.15 (N=2¹), rises to ~0.28 (N=2⁷).
* **Pass @N:** Starts at ~0.15 (N=2¹), rises to ~0.45 (N=2⁷).
* **Annotations:**
* A horizontal double-headed arrow between the ORM and PAV lines at N=2⁴ is labeled "5 ×".
* A vertical double-headed arrow between the ORM and PAV lines at N=2⁷ is labeled "10%".
#### **Chart (b): Gemma-9B**
* **Trend Verification:**
* **ORM (Blue):** Rises to a peak around N=2⁴ or 2⁵, then slightly declines.
* **PAV (Orange):** Slopes steadily upward.
* **Pass @N (Gray):** Slopes steadily upward, maintaining the highest accuracy.
* **Data Points (Approximate):**
* **ORM:** Starts at ~0.38 (N=2¹), peaks at ~0.46 (N=2⁴/2⁵), ends at ~0.45 (N=2⁷).
* **PAV (ours):** Starts at ~0.40 (N=2¹), rises to ~0.54 (N=2⁷).
* **Pass @N:** Starts at ~0.40 (N=2¹), rises to ~0.65 (N=2⁷).
* **Annotations:**
* A horizontal double-headed arrow between the ORM and PAV lines at N=2³ is labeled "2 ×".
* A vertical double-headed arrow between the ORM and PAV lines at N=2⁷ is labeled "10%".
#### **Chart (c): Gemma-27B**
* **Trend Verification:**
* **ORM (Blue):** Rises to a peak around N=2⁴, then declines more noticeably.
* **PAV (Orange):** Slopes upward, peaking around N=2⁶ before a slight dip.
* **Pass @N (Gray):** Slopes steadily upward.
* **Data Points (Approximate):**
* **ORM:** Starts at ~0.42 (N=2¹), peaks at ~0.52 (N=2⁴), drops to ~0.50 (N=2⁷).
* **PAV (ours):** Starts at ~0.45 (N=2¹), peaks at ~0.58 (N=2⁶), ends at ~0.57 (N=2⁷).
* **Pass @N:** Starts at ~0.45 (N=2¹), rises to ~0.68 (N=2⁷).
* **Annotations:**
* A horizontal double-headed arrow between the ORM and PAV lines at N=2³ is labeled "1.5 ×".
* A vertical double-headed arrow between the ORM and PAV lines at N=2⁶ is labeled "8%".
### Key Observations
1. **Consistent Hierarchy:** In all three charts, the `Pass @N` method achieves the highest accuracy, followed by `PAV (ours)`, with `ORM` performing the lowest.
2. **Model Size Impact:** As the model size increases (2B → 9B → 27B), the absolute accuracy values for all methods increase significantly. The y-axis scale shifts upward.
3. **Diminishing Relative Gain:** The annotated relative improvement of PAV over ORM decreases as model size increases: "5 ×" for Gemma-2B, "2 ×" for Gemma-9B, and "1.5 ×" for Gemma-27B. This suggests the performance gap between PAV and ORM narrows with larger models.
4. **ORM Performance Plateau/Decline:** For the larger models (Gemma-9B and Gemma-27B), the ORM method's accuracy plateaus and then declines after a certain N value (around 2⁴-2⁵), while PAV and Pass@N continue to improve or maintain performance.
5. **Annotation Placement:** The "multiplier" annotations (5×, 2×, 1.5×) are placed at lower N values (2³-2⁴), while the "percentage" annotations (10%, 10%, 8%) are placed at the highest N value (2⁶-2⁷), highlighting different aspects of the comparison.
### Interpretation
The data demonstrates the comparative effectiveness of the proposed `PAV (ours)` method against the `ORM` baseline across different scales of the Gemma model. The key finding is that **PAV provides a consistent accuracy improvement over ORM, but the magnitude of this relative advantage diminishes as the base model becomes larger and more capable.**
The `Pass @N` method serves as a strong upper-bound benchmark, consistently outperforming both. The fact that PAV's curve is always between ORM and Pass@N suggests it successfully bridges part of the performance gap. The plateau and decline of ORM at higher N for larger models indicates a potential limitation or instability in that method under those conditions, which PAV appears to mitigate, as its curve remains more stable and continues to rise.
The annotations tell a story of **scaling efficiency**: for the smallest model (2B), PAV offers a dramatic 5x improvement at a certain point, but for the largest model (27B), the improvement is a more modest 1.5x. This implies that advanced methods like PAV may be most crucial for boosting the performance of smaller, more constrained models, while larger models can achieve high performance through scale alone, though PAV still provides a meaningful absolute gain (8-10% at high N). The charts collectively argue for the value of the PAV method, especially in resource-constrained scenarios involving smaller models.