## Scatter Plot: Language Model Performance Delta Across Datasets
### Overview
This image is a scatter plot comparing the performance delta (Δ, in percentage) of various large language models (LLMs) across seven different benchmark datasets. Each data point represents a specific model's performance relative to a baseline (Δ=0). The plot categorizes models as Open LLMs, Closed LLMs, or Reasoning LLMs using distinct marker shapes.
### Components/Axes
* **Y-Axis:** Labeled "Δ (%)". The scale ranges from -15 to 25, with major tick marks every 5 units. A horizontal dashed green line at Δ=0 is labeled "Baseline (Δ=0)".
* **X-Axis:** Labeled "Dataset". It lists seven categorical datasets: `HotpotQA`, `CS-QA`, `GPQA`, `AQUA`, `GSM8K`, `MATH`, and `HumanEval`. Vertical dashed lines separate each dataset category.
* **Legend (Positioned on the right side):**
* **Models (by color):**
* `LLaMA3.1-8B` (Teal circle)
* `LLaMA3.1-70B` (Light green circle)
* `Qwen2.5-7B` (Light blue circle)
* `Qwen2.5-72B` (Red circle)
* `Claude3.5` (Dark blue circle)
* `GPT-3.5` (Orange pentagon)
* `GPT-4o` (Yellow-green pentagon)
* `QWQ-32B` (Pink diamond)
* `DeepSeek-V3` (Purple diamond)
* **Model Type (by shape):**
* `Open LLM` (Circle)
* `Closed LLM` (Pentagon)
* `Reasoning LLM` (Diamond)
* **Baseline:** `Baseline (Δ=0)` (Green dashed line)
### Detailed Analysis
Performance deltas are approximate, read from the chart's grid.
**1. HotpotQA:**
* `LLaMA3.1-8B` (Teal, Circle): ~ -13%
* `Qwen2.5-7B` (Light blue, Circle): ~ -2%
* `LLaMA3.1-70B` (Light green, Circle): ~ +3%
* `Qwen2.5-72B` (Red, Circle): ~ +4%
* `Claude3.5` (Dark blue, Circle): ~ +1%
* `GPT-3.5` (Orange, Pentagon): ~ +5%
* `GPT-4o` (Yellow-green, Pentagon): ~ +3%
* `QWQ-32B` (Pink, Diamond): ~ -1%
* `DeepSeek-V3` (Purple, Diamond): ~ +17%
**2. CS-QA:**
* `LLaMA3.1-8B` (Teal, Circle): ~ -8%
* `Qwen2.5-7B` (Light blue, Circle): ~ -9%
* `LLaMA3.1-70B` (Light green, Circle): ~ -6%
* `Qwen2.5-72B` (Red, Circle): ~ 0%
* `Claude3.5` (Dark blue, Circle): ~ +7%
* `GPT-3.5` (Orange, Pentagon): ~ +8%
* `GPT-4o` (Yellow-green, Pentagon): ~ +6%
* `QWQ-32B` (Pink, Diamond): ~ -8%
* `DeepSeek-V3` (Purple, Diamond): ~ -4%
**3. GPQA:**
* `LLaMA3.1-8B` (Teal, Circle): ~ +9%
* `Qwen2.5-7B` (Light blue, Circle): ~ +20%
* `LLaMA3.1-70B` (Light green, Circle): ~ +20%
* `Qwen2.5-72B` (Red, Circle): ~ +16%
* `Claude3.5` (Dark blue, Circle): ~ +9%
* `GPT-3.5` (Orange, Pentagon): ~ +14%
* `GPT-4o` (Yellow-green, Pentagon): ~ +16%
* `QWQ-32B` (Pink, Diamond): ~ +11%
* `DeepSeek-V3` (Purple, Diamond): ~ +12%
**4. AQUA:**
* `LLaMA3.1-8B` (Teal, Circle): ~ -5%
* `Qwen2.5-7B` (Light blue, Circle): ~ -2%
* `LLaMA3.1-70B` (Light green, Circle): ~ +1%
* `Qwen2.5-72B` (Red, Circle): ~ +3%
* `Claude3.5` (Dark blue, Circle): ~ +1%
* `GPT-3.5` (Orange, Pentagon): ~ +16%
* `GPT-4o` (Yellow-green, Pentagon): ~ +15%
* `QWQ-32B` (Pink, Diamond): ~ +1%
* `DeepSeek-V3` (Purple, Diamond): ~ +5%
**5. GSM8K:**
* `LLaMA3.1-8B` (Teal, Circle): ~ -1%
* `Qwen2.5-7B` (Light blue, Circle): ~ +9%
* `LLaMA3.1-70B` (Light green, Circle): ~ +5%
* `Qwen2.5-72B` (Red, Circle): ~ +9%
* `Claude3.5` (Dark blue, Circle): ~ +2%
* `GPT-3.5` (Orange, Pentagon): ~ +22%
* `GPT-4o` (Yellow-green, Pentagon): ~ +11%
* `QWQ-32B` (Pink, Diamond): ~ +10%
* `DeepSeek-V3` (Purple, Diamond): ~ +4%
**6. MATH:**
* `LLaMA3.1-8B` (Teal, Circle): ~ +9%
* `Qwen2.5-7B` (Light blue, Circle): ~ +4%
* `LLaMA3.1-70B` (Light green, Circle): ~ +4%
* `Qwen2.5-72B` (Red, Circle): ~ +1%
* `Claude3.5` (Dark blue, Circle): ~ +2%
* `GPT-3.5` (Orange, Pentagon): ~ +9%
* `GPT-4o` (Yellow-green, Pentagon): ~ +10%
* `QWQ-32B` (Pink, Diamond): ~ -4%
* `DeepSeek-V3` (Purple, Diamond): ~ -4%
**7. HumanEval:**
* `LLaMA3.1-8B` (Teal, Circle): ~ -5%
* `Qwen2.5-7B` (Light blue, Circle): ~ -5%
* `LLaMA3.1-70B` (Light green, Circle): ~ +7%
* `Qwen2.5-72B` (Red, Circle): ~ -10%
* `Claude3.5` (Dark blue, Circle): ~ -1%
* `GPT-3.5` (Orange, Pentagon): ~ +1%
* `GPT-4o` (Yellow-green, Pentagon): ~ +4%
* `QWQ-32B` (Pink, Diamond): ~ +10%
* `DeepSeek-V3` (Purple, Diamond): ~ -4%
### Key Observations
1. **High Variance:** Performance deltas vary dramatically across both models and datasets. No single model dominates all benchmarks.
2. **Dataset Difficulty:** Models show the widest spread of performance on `GPQA` and `GSM8K`, suggesting these datasets may differentiate model capabilities more sharply.
3. **Top Performers:** `GPT-3.5` (Orange pentagon) achieves the single highest delta on the chart (~+22% on GSM8K). `Qwen2.5-7B` and `LLaMA3.1-70B` also show strong peaks (~+20% on GPQA).
4. **Notable Underperformance:** `LLaMA3.1-8B` (Teal circle) has the lowest delta (~-13% on HotpotQA). `Qwen2.5-72B` (Red circle) shows a significant drop on HumanEval (~-10%).
5. **Reasoning LLMs (Diamonds):** `DeepSeek-V3` (Purple) shows high variance, with a strong positive outlier on HotpotQA (~+17%) but negative performance on MATH and HumanEval. `QWQ-32B` (Pink) is generally closer to the baseline.
6. **Open vs. Closed:** Closed LLMs (Pentagons: GPT-3.5, GPT-4o) tend to cluster in the upper half of the chart for most datasets, but are not universally superior.
### Interpretation
This chart visualizes the **non-uniform progress and specialization** in the current LLM landscape. The data suggests:
* **Benchmark Sensitivity:** A model's "capability" is not a single number but a profile across tasks. Strengths in mathematical reasoning (GSM8K, MATH) do not guarantee strength in question answering (HotpotQA, CS-QA) or code generation (HumanEval).
* **The "No Free Lunch" Theorem in AI:** The absence of a model that excels in all categories indicates that architectural choices, training data, and optimization targets create trade-offs. For example, a model fine-tuned for math may see regressions on other tasks.
* **The Baseline is Key:** The Δ=0 baseline is critical for interpretation. It likely represents the performance of a reference model (e.g., an earlier version or a standard baseline). Points above the line indicate improvement over this reference; points below indicate regression. The chart therefore measures **relative advancement**, not absolute accuracy.
* **Strategic Implications:** For a practitioner, this chart argues for **model selection based on the specific task**. Choosing a model requires consulting its performance profile on benchmarks analogous to the intended application, rather than relying on aggregate scores or reputation. The high variance, especially among open models, highlights the rapid and divergent evolution in this field.