Image 2c84a229f1a4...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
# Self-Correction Performance Comparison on MATH Dataset

## Chart Structure
- **Title**: "Self-Correction Performance Comparison on MATH Dataset"
- **X-Axis**: Model names (categorical)
- **Y-Axis**: Accuracy (%) [40% to 85%]
- **Legend**: Located in top-left corner
  - Blue: Base LLMs
  - Red: Self-Correction
  - Yellow: Self-Correction + Our Cross-DPO

## Data Series
1. **Base LLMs** (Blue)
2. **Self-Correction** (Red)
3. **Self-Correction + Our Cross-DPO** (Yellow)

## Key Data Points & Annotations
| Model                  | Base LLMs (%) | Self-Correction (%) | Self-Correction + Cross-DPO (%) | Change vs Base LLMs |
|------------------------|---------------|---------------------|----------------------------------|---------------------|
| Llama3.1-8B-Instruct   | 51.9          | 49.8                | -                                | -2.1%               |
| DeepSeek-Math-7B       | 46.8          | 43.2                | -                                | -3.6%               |
| Qwen2.5-Math-7B        | 55.1          | 55.4                | -                                | +0.3%               |
| GPT-4o                 | 76.6          | 77.8                | -                                | +1.2%               |
| Claude-3.5-Sonnet      | 71.1          | 73.4                | -                                | +2.3%               |
| Gemini-1.5-pro         | 67.7          | 69.1                | -                                | +1.4%               |
| SuperCorrect-Qwen-7B   | 70.2          | 75.4                | 75.4                             | +5.2%               |

## Spatial Grounding
- **Legend Position**: Top-left corner
- **Bar Colors**:
  - Blue bars: Base LLMs
  - Red bars: Self-Correction
  - Yellow bar: Self-Correction + Cross-DPO (only for SuperCorrect-Qwen-7B)

## Trend Verification
1. **Llama3.1-8B-Instruct**: Base LLMs (51.9%) → Self-Correction (49.8%) [↓]
2. **DeepSeek-Math-7B**: Base LLMs (46.8%) → Self-Correction (43.2%) [↓]
3. **Qwen2.5-Math-7B**: Base LLMs (55.1%) → Self-Correction (55.4%) [↑]
4. **GPT-4o**: Base LLMs (76.6%) → Self-Correction (77.8%) [↑]
5. **Claude-3.5-Sonnet**: Base LLMs (71.1%) → Self-Correction (73.4%) [↑]
6. **Gemini-1.5-pro**: Base LLMs (67.7%) → Self-Correction (69.1%) [↑]
7. **SuperCorrect-Qwen-7B**: 
   - Base LLMs (70.2%) → Self-Correction (75.4%) [↑]
   - Self-Correction + Cross-DPO (75.4%) [→]

## Critical Observations
1. **Performance Gains**: 
   - Self-Correction improves accuracy in 5/7 models
   - Cross-DPO further enhances performance in SuperCorrect-Qwen-7B (+5.2%)
2. **Largest Improvement**: SuperCorrect-Qwen-7B shows the most significant gain (+5.2%)
3. **Performance Loss**: 
   - Llama3.1-8B-Instruct and DeepSeek-Math-7B show declines (-2.1% and -3.6%)

## Language Notes
- All text is in English
- No non-English content detected
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

2c84a229f1a44a6d957ea8e2

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1