\n
## Correlation Matrix and Scatter Plots: Model Performance Analysis
### Overview
The image contains three distinct charts arranged horizontally. From left to right: a correlation matrix heatmap, a scatter plot comparing SCLI5 vs GSM8K-SC performance, and a scatter plot comparing GSM8K-SC vs PRM800K-SC performance. The overall theme is analyzing the correlation between mean accuracy scores of various language models across three different evaluation datasets.
### Components/Axes
**Chart 1 (Left): Correlation Matrix**
* **Title:** "Correlation matrix of mean accuracy across datasets"
* **Axes Labels (Y-axis, top to bottom):** `scli5`, `gsm8k_sc`, `prm800k_sc`
* **Axes Labels (X-axis, left to right):** `scli5`, `gsm8k_sc`, `prm800k_sc`
* **Color Bar Legend (Right side):** A vertical gradient bar ranging from blue (-1.00) to red (1.00), with tick marks at -1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00.
**Chart 2 (Middle): Scatter Plot**
* **Title:** "SCLI5 vs GSM8K-SC (r = 0.724)"
* **X-axis:** "SCLI5 macro average" (Scale: 0.0 to 1.0)
* **Y-axis:** "GSM8K-SC macro average" (Scale: 0.0 to 1.0)
* **Legend (Top-left):** Contains two entries: "Fitted line" (red dashed line) and "Ideal line" (gray dotted line).
* **Data Points:** Blue circles, each labeled with a model name.
**Chart 3 (Right): Scatter Plot**
* **Title:** "GSM8K-SC vs PRM800K-SC (r = 0.559)"
* **X-axis:** "GSM8K-SC macro average" (Scale: 0.0 to 0.6)
* **Y-axis:** "PRM800K-SC macro average" (Scale: 0.0 to 0.6)
* **Legend (Top-left):** Contains two entries: "Fitted line" (red dashed line) and "Ideal line" (gray dotted line).
* **Data Points:** Green circles, each labeled with a model name.
### Detailed Analysis
**Chart 1: Correlation Matrix**
The heatmap displays Pearson correlation coefficients between the mean accuracy scores on three datasets.
* **Diagonal (Self-correlation):** All values are `1` (dark red), as expected.
* **Off-diagonal Values:**
* `scli5` vs `gsm8k_sc`: **0.72** (medium orange-red)
* `scli5` vs `prm800k_sc`: **0.49** (light orange)
* `gsm8k_sc` vs `prm800k_sc`: **0.56** (medium orange)
* **Interpretation:** The strongest correlation (0.72) is between SCLI5 and GSM8K-SC. The weakest correlation (0.49) is between SCLI5 and PRM800K-SC.
**Chart 2: SCLI5 vs GSM8K-SC Scatter Plot**
* **Trend:** The data points show a clear positive linear trend. The red "Fitted line" slopes upward from left to right, confirming the positive correlation (r=0.724). Most points lie below the gray "Ideal line" (y=x), indicating that models generally score higher on SCLI5 than on GSM8K-SC.
* **Data Points (Approximate Coordinates - X:SCLI5, Y:GSM8K-SC):**
* `Qwen2.5-72B-Instruct`: (~0.95, ~0.58) - Highest on both axes.
* `Llama-4-Maverick-...`: (~0.90, ~0.40)
* `DeepSeek-V3-0324`: (~0.85, ~0.40)
* `Llama-3.3-70B-Ins...`: (~0.60, ~0.28)
* `Qwen2.5-7B-Instruct`: (~0.55, ~0.19)
* `Llama-4-Scout-17B...`: (~0.95, ~0.24) - Notable outlier, high SCLI5 but lower GSM8K-SC.
* `Qwen2-7B-Instruct`: (~0.60, ~0.08)
* `Qwen3-14B`: (~0.05, ~0.09)
* `Qwen3-30B-A3B`: (~0.15, ~0.05)
* `Qwen3-32B`: (~0.05, ~0.05)
* `Mistral-Small-24B...`: (~0.05, ~0.01)
**Chart 3: GSM8K-SC vs PRM800K-SC Scatter Plot**
* **Trend:** The data points show a moderate positive linear trend. The red "Fitted line" slopes upward, confirming the correlation (r=0.559). The spread of points around the fitted line is wider than in the middle chart, indicating a noisier relationship. Most points are below the "Ideal line."
* **Data Points (Approximate Coordinates - X:GSM8K-SC, Y:PRM800K-SC):**
* `DeepSeek-V3-0324`: (~0.40, ~0.48) - Highest on both axes.
* `Llama-4-Maverick-...`: (~0.40, ~0.46)
* `Qwen3-235B-A22B`: (~0.08, ~0.35) - Notable outlier, very low GSM8K-SC but high PRM800K-SC.
* `Qwen3-14B`: (~0.10, ~0.26)
* `Llama-4-Scout-17B...`: (~0.25, ~0.26)
* `Llama-3.3-70B-Ins...`: (~0.28, ~0.25)
* `Qwen3-30B-A3B`: (~0.05, ~0.19)
* `Qwen2.5-7B-Instruct`: (~0.19, ~0.14)
* `Qwen2.5-72B-Instruct`: (~0.58, ~0.15) - Notable outlier, highest GSM8K-SC but relatively low PRM800K-SC.
* `Qwen3-32B`: (~0.08, ~0.08)
* `Qwen2-7B-Instruct`: (~0.10, ~0.06)
* `Mistral-Small-24B...`: (~0.02, ~0.02)
### Key Observations
1. **Strongest Link:** Performance on SCLI5 and GSM8K-SC is most strongly correlated (r=0.724).
2. **General Underperformance:** In both scatter plots, the majority of models fall below the "Ideal line" (y=x), suggesting they achieve lower macro-average scores on the second dataset (GSM8K-SC or PRM800K-SC) compared to the first (SCLI5 or GSM8K-SC).
3. **Significant Outliers:**
* `Llama-4-Scout-17B...` in the middle chart: High SCLI5 score but disproportionately lower GSM8K-SC score.
* `Qwen3-235B-A22B` in the right chart: Very low GSM8K-SC score but a high PRM800K-SC score.
* `Qwen2.5-72B-Instruct` in the right chart: The highest GSM8K-SC score but a relatively low PRM800K-SC score, breaking the general trend.
4. **Model Clustering:** Lower-performing models (e.g., `Mistral-Small-24B...`, `Qwen3-32B`) cluster near the origin (0,0) in both scatter plots.
### Interpretation
The data suggests that the evaluation datasets (SCLI5, GSM8K-SC, PRM800K-SC) measure related but distinct capabilities of language models. The strong correlation between SCLI5 and GSM8K-SC indicates these two benchmarks may be testing similar underlying skills (potentially related to mathematical or logical reasoning, given the "GSM" in the name). The weaker correlation with PRM800K-SC implies it assesses a different dimension of model performance.
The consistent pattern of models scoring lower on the second dataset in each pair could indicate that GSM8K-SC and PRM800K-SC are more difficult than SCLI5 and GSM8K-SC, respectively, for this set of models. The notable outliers are crucial: they represent models with specialized strengths or weaknesses. For example, `Qwen3-235B-A22B`'s performance profile suggests it may be uniquely optimized for the tasks in PRM800K-SC while lacking in GSM8K-SC skills. Conversely, `Qwen2.5-72B-Instruct` excels at GSM8K-SC but does not transfer that advantage to PRM800K-SC to the same degree as other top models. This analysis highlights that model evaluation is multi-faceted, and a single aggregate score can mask significant performance variations across different task types.