Image ebc402146d12...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Charts: Evaluation on Verification and Correction (Base Model: Qwen2.5-Math-7B)

### Overview
The image contains two side-by-side bar charts comparing performance metrics for three model configurations:  
1. **SFT** (Self-Taught Fine-tuning)  
2. **SFT + Process-level RL** (Reinforcement Learning)  
3. **SFT + Outcome-level RL**  

Metrics are split into **Self-verification** (left chart) and **Self-correction** (right chart). All values are percentages.

---

### Components/Axes
#### Left Chart (Self-verification Metrics)
- **X-axis**:  
  - Verification Accuracy  
  - Error Recall  
  - Correct Precision  
- **Y-axis**: Value (%) from 0% to 90%  
- **Legend**:  
  - Gray: SFT  
  - Teal: SFT + Process-level RL  
  - Orange: SFT + Outcome-level RL  

#### Right Chart (Self-correction Metrics)
- **X-axis**:  
  - Incorrect to Correct  
  - Correct to Incorrect  
- **Y-axis**: Value (%) from 0% to 14%  
- **Legend**: Same color coding as left chart  

---

### Detailed Analysis
#### Self-verification Metrics (Left Chart)
1. **Verification Accuracy**  
   - SFT: 61.58%  
   - SFT + Process-level RL: 74.61%  
   - SFT + Outcome-level RL: 66.49%  

2. **Error Recall**  
   - SFT: 66.83%  
   - SFT + Process-level RL: 64.75%  
   - SFT + Outcome-level RL: 70.11%  

3. **Correct Precision**  
   - SFT: 84.94%  
   - SFT + Process-level RL: 90.28%  
   - SFT + Outcome-level RL: 87.85%  

#### Self-correction Metrics (Right Chart)
1. **Incorrect to Correct**  
   - SFT: 6.52%  
   - SFT + Process-level RL: 12.22%  
   - SFT + Outcome-level RL: 13.64%  

2. **Correct to Incorrect**  
   - SFT: 1.96%  
   - SFT + Process-level RL: 1.46%  
   - SFT + Outcome-level RL: 0.97%  

---

### Key Observations
1. **Self-verification**:  
   - **SFT + Process-level RL** outperforms SFT in all metrics, with the largest gain in **Verification Accuracy** (+13.03%).  
   - **SFT + Outcome-level RL** shows mixed results: lower than SFT in Verification Accuracy but higher in Error Recall and Correct Precision.  

2. **Self-correction**:  
   - **SFT + Outcome-level RL** achieves the highest **Incorrect to Correct** rate (+7.12% over SFT) and the lowest **Correct to Incorrect** rate (-1.01% over SFT).  
   - **SFT + Process-level RL** improves **Incorrect to Correct** by 5.7% over SFT but underperforms Outcome-level RL.  

---

### Interpretation
1. **Process-level RL** enhances **verification robustness**, particularly in **Correct Precision** (90.28%), suggesting it improves the model's ability to identify valid solutions.  
2. **Outcome-level RL** excels in **correction efficiency**, reducing errors (Correct to Incorrect drops to 0.97%) while maximizing successful corrections (Incorrect to Correct: 13.64%).  
3. **Trade-offs**:  
   - Process-level RL slightly reduces Error Recall (64.75% vs. SFT's 66.83%), possibly due to stricter validation.  
   - Outcome-level RL sacrifices some verification accuracy (66.49% vs. SFT's 61.58%) but gains significant correction performance.  

The data implies that **Process-level RL** is optimal for tasks requiring high verification accuracy, while **Outcome-level RL** is better suited for error correction scenarios. Combining both approaches could balance these trade-offs.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

ebc402146d1240392ebb4fff

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1