Image d8b75b15fac4...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Line Chart: Test Accuracy on GSM8K with Scaled Inference Compute

### Overview
The chart illustrates the relationship between the number of samples (`n`) and test accuracy (Maj@n %) for three distinct methods: Independent Samples, Self-rewarding Correction (IFT), and Self-rewarding Correction (IFT + M-DPO). Accuracy improves as the number of samples increases, with the IFT + M-DPO method consistently outperforming the others.

### Components/Axes
- **X-axis**: `n: the number of samples` (values: 2, 4, 8, 16, 32, 64).
- **Y-axis**: `Maj@n (%)` (range: 78% to 92%).
- **Legend**: Located at the bottom-right corner, with three entries:
  - Green line: Independent Samples
  - Blue line: Self-rewarding Correction (IFT)
  - Red line: Self-rewarding Correction (IFT + M-DPO)

### Detailed Analysis
1. **Independent Samples (Green Line)**:
   - Starts at **77.5%** at `n=2`.
   - Gradually increases to **89.5%** at `n=64`.
   - Slope: Moderate upward trend.

2. **Self-rewarding Correction (IFT) (Blue Line)**:
   - Starts at **80.5%** at `n=2`.
   - Rises sharply to **91.8%** at `n=64`.
   - Slope: Steeper than Independent Samples, with a plateau near `n=32`.

3. **Self-rewarding Correction (IFT + M-DPO) (Red Line)**:
   - Starts at **81.5%** at `n=2`.
   - Peaks at **92.2%** at `n=64`.
   - Slope: Steepest initial rise, maintaining the highest accuracy across all `n`.

### Key Observations
- The **red line (IFT + M-DPO)** consistently outperforms the other methods, achieving the highest accuracy at every sample size.
- The **blue line (IFT)** surpasses the green line (Independent Samples) at `n=4` and maintains a lead thereafter.
- All methods show diminishing returns as `n` increases beyond 32, with accuracy gains becoming smaller.

### Interpretation
The data demonstrates that **self-rewarding correction mechanisms** (particularly when combined with M-DPO) significantly enhance test accuracy on GSM8K compared to independent sampling. The scaling of inference compute (increasing `n`) benefits all methods, but the synergistic effect of IFT and M-DPO yields the most robust performance. This suggests that integrating self-rewarding strategies with M-DPO could be critical for optimizing large language model evaluations. The plateau observed in the blue line at higher `n` implies potential saturation limits for IFT alone, whereas the red line’s sustained growth indicates greater scalability.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

d8b75b15fac4a99e0ed7a0f4

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1