Image 44526bbe19fc...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Heatmap: Correlation matrix of mean accuracy across datasets

### Overview
A 3x3 correlation matrix visualizing relationships between three datasets: scli5, gsm8k_sc, and prm800k_sc. Values range from -1 to 1, with darker red indicating stronger positive correlation.

### Components/Axes
- **Rows/Columns**:
  - Top row: scli5
  - Middle row: gsm8k_sc
  - Bottom row: prm800k_sc
- **Color Scale**:
  - Blue (-1.0) to Red (+1.0)
  - White (0.0) as midpoint
- **Values**:
  - Diagonal: All 1.0 (perfect self-correlation)
  - Off-diagonal:
    - scli5-gsm8k_sc: 0.72
    - scli5-prm800k_sc: 0.49
    - gsm8k_sc-prm800k_sc: 0.56

### Detailed Analysis
- **scli5**:
  - Strongest correlation with gsm8k_sc (0.72)
  - Moderate correlation with prm800k_sc (0.49)
- **gsm8k_sc**:
  - Moderate correlation with prm800k_sc (0.56)
- **prm800k_sc**:
  - Weakest overall correlation (0.49 with scli5)

### Key Observations
- All datasets show positive correlations
- scli5 and gsm8k_sc share the strongest relationship
- prm800k_sc demonstrates weaker but still positive relationships

### Interpretation
The matrix reveals that scli5 and gsm8k_sc are most closely related in terms of mean accuracy performance across datasets. prm800k_sc shows more divergent behavior, suggesting different underlying characteristics or performance patterns compared to the other two datasets.

---

## Scatter Plot: SCLI5 vs GSM8K-SC (r = 0.724)

### Overview
Scatter plot comparing SCLI5 and GSM8K-SC macro averages with fitted and ideal trend lines. Points labeled with model names.

### Components/Axes
- **X-axis**: SCLI5 macro average (0.0-1.0)
- **Y-axis**: GSM8K-SC macro average (0.0-1.0)
- **Legend**:
  - Dashed red: Fitted line
  - Dotted gray: Ideal line (y=x)

### Detailed Analysis
- **Trend**:
  - Fitted line (r=0.724) shows strong positive correlation
  - Points generally cluster near the ideal line
- **Data Points**:
  - **Bottom-left cluster** (0.0-0.2 SCLI5, 0.0-0.2 GSM8K):
    - Mistral-Small-24B-Instruct-v1.0
    - Qwen3-32B
    - Qwen3-30B-A3B
  - **Middle cluster** (0.3-0.6 SCLI5, 0.2-0.4 GSM8K):
    - Qwen2.5-7B-Instruct
    - Qwen2.5-7B-Instruct-i4
    - Qwen3-14B
  - **Top-right cluster** (0.7-1.0 SCLI5, 0.4-0.8 GSM8K):
    - Qwen2.5-72B-Instruct
    - DeepSeek-V3-0324
    - Llama-3-70B-Instruct-v1.0

### Key Observations
- High-performing models (top-right) show strong alignment between SCLI5 and GSM8K-SC
- Lower-performing models cluster in the bottom-left
- Fitted line closely follows the ideal line, indicating linear relationship

### Interpretation
The strong correlation (r=0.724) suggests that performance on SCLI5 strongly predicts performance on GSM8K-SC. The clustering of models indicates distinct performance tiers, with high-performing models showing consistent excellence across both benchmarks.

---

## Scatter Plot: GSM8K-SC vs PRM800K-SC (r = 0.559)

### Overview
Scatter plot comparing GSM8K-SC and PRM800K-SC macro averages with fitted and ideal trend lines. Points labeled with model names.

### Components/Axes
- **X-axis**: GSM8K-SC macro average (0.0-0.6)
- **Y-axis**: PRM800K-SC macro average (0.0-0.6)
- **Legend**:
  - Dashed red: Fitted line
  - Dotted gray: Ideal line (y=x)

### Detailed Analysis
- **Trend**:
  - Fitted line (r=0.559) shows moderate positive correlation
  - Points show more dispersion than previous plot
- **Data Points**:
  - **Bottom-left cluster** (0.0-0.2 GSM8K, 0.0-0.2 PRM800K):
    - Mistral-Small-24B-Instruct-v1.0
    - Qwen3-32B
    - Qwen2.5-7B-Instruct
  - **Middle cluster** (0.2-0.4 GSM8K, 0.1-0.3 PRM800K):
    - Qwen3-14B
    - Llama-3-70B-Instruct-v1.0
  - **Top-right cluster** (0.4-0.6 GSM8K, 0.3-0.6 PRM800K):
    - DeepSeek-V3-0324
    - Llama-4-Maverick-17B-Instruct
    - Qwen2.5-72B-Instruct

### Key Observations
- Weaker correlation (r=0.559) compared to SCLI5-GSM8K relationship
- More dispersed data points indicate less consistent relationships
- High-performing models show better alignment with the fitted line

### Interpretation
The moderate correlation suggests that while there's some relationship between GSM8K-SC and PRM800K-SC performance, it's less consistent than the SCLI5-GSM8K relationship. The dispersion of points indicates that models may perform differently across these benchmarks, suggesting varying strengths in different reasoning domains.

---

## Cross-Plot Analysis
1. **Consistency**:
   - SCLI5-GSM8K shows strongest correlation (r=0.724)
   - GSM8K-PRM800K shows weakest correlation (r=0.559)
2. **Model Performance**:
   - Qwen2.5-72B-Instruct consistently performs best across all benchmarks
   - Mistral-Small-24B-Instruct-v1.0 consistently performs worst
3. **Trend Lines**:
   - Fitted lines in both scatter plots closely follow ideal lines, suggesting linear relationships
   - steeper slope in SCLI5-GSM8K plot indicates stronger relationship

## Conclusion
The correlation matrix and scatter plots reveal distinct performance patterns across different reasoning benchmarks. The strong SCLI5-GSM8K relationship suggests shared characteristics in these benchmarks, while the weaker GSM8K-PRM800K relationship indicates more divergent performance characteristics. Model performance tiers are clearly distinguishable, with high-performing models showing consistent excellence across all benchmarks.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

44526bbe19fc3dafc3de44a6

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1