## Bar Chart Composite: AI Model Benchmark Performance
### Overview
The image displays a composite of eight bar charts, organized into four thematic categories, comparing the performance of six different AI models across various standardized benchmarks. The models compared are: **Nanbeige4-3B** (highlighted in teal), **Qwen3-4B-2507**, **Qwen3-8B**, **Qwen3-14B**, **Qwen3-32B**, and **Qwen3-30B-A3B-2507** (all in shades of gray). The charts are grouped under the headings: Mathematical Reasoning, Scientific Reasoning, Tool Use & Coding, and Human Preference Alignment.
### Components/Axes
* **Chart Structure:** Eight individual bar charts arranged in a 2x4 grid.
* **Categories (Top Headers):**
* Top Left: **Mathematical Reasoning**
* Top Right: **Scientific Reasoning**
* Bottom Left: **Tool Use & Coding**
* Bottom Right: **Human Preference Alignment**
* **Sub-Charts (Benchmark Titles):**
* Under Mathematical Reasoning: **AIME 2024**, **AIME 2025**
* Under Scientific Reasoning: **GPQA-Diamond**, **SuperGPQA**
* Under Tool Use & Coding: **BFCL-v4**, **Fullstack Bench**
* Under Human Preference Alignment: **ArenaHard-V2**, **Multi-Challenge**
* **X-Axis (All Charts):** Lists the six model names. The labels are rotated approximately 45 degrees for readability.
* **Y-Axis (All Charts):** Represents the benchmark score. The scale is not explicitly numbered with ticks, but the numerical value is printed directly above each bar.
* **Legend/Color Key:**
* **Teal Bar:** Nanbeige4-3B
* **Gray Bars (from left to right):** Qwen3-4B-2507, Qwen3-8B, Qwen3-14B, Qwen3-32B, Qwen3-30B-A3B-2507.
### Detailed Analysis
#### **Mathematical Reasoning**
1. **AIME 2024:**
* Nanbeige4-3B: **90.4** (Highest)
* Qwen3-4B-2507: **83.3**
* Qwen3-8B: **76.0**
* Qwen3-14B: **79.3**
* Qwen3-32B: **81.4**
* Qwen3-30B-A3B-2507: **89.2** (Second highest)
* *Trend:* Nanbeige4-3B leads, followed closely by the largest Qwen model (30B-A3B). Performance dips for the mid-sized Qwen models (8B, 14B, 32B).
2. **AIME 2025:**
* Nanbeige4-3B: **85.6** (Highest)
* Qwen3-4B-2507: **81.3**
* Qwen3-8B: **67.3**
* Qwen3-14B: **70.4**
* Qwen3-32B: **72.9**
* Qwen3-30B-A3B-2507: **85.0** (Very close second)
* *Trend:* Similar pattern to AIME 2024. Nanbeige4-3B and Qwen3-30B-A3B-2507 are nearly tied at the top, with a significant drop for the 8B, 14B, and 32B models.
#### **Scientific Reasoning**
1. **GPQA-Diamond:**
* Nanbeige4-3B: **82.2** (Highest)
* Qwen3-4B-2507: **67.2**
* Qwen3-8B: **62.0**
* Qwen3-14B: **64.0**
* Qwen3-32B: **68.7**
* Qwen3-30B-A3B-2507: **73.4**
* *Trend:* Clear lead for Nanbeige4-3B. Performance generally increases with model size within the Qwen series, but all are below Nanbeige4-3B.
2. **SuperGPQA:**
* Nanbeige4-3B: **53.2** (Highest)
* Qwen3-4B-2507: **46.7**
* Qwen3-8B: **39.1**
* Qwen3-14B: **46.8**
* Qwen3-32B: **54.1** (Slightly higher than Nanbeige4-3B)
* Qwen3-30B-A3B-2507: **56.8** (Highest)
* *Trend:* This is the only benchmark where a Qwen model (Qwen3-30B-A3B-2507) clearly outperforms Nanbeige4-3B. Qwen3-32B also scores slightly higher. The smallest Qwen model (8B) performs notably worse.
#### **Tool Use & Coding**
1. **BFCL-v4:**
* Nanbeige4-3B: **53.8** (Highest)
* Qwen3-4B-2507: **44.9**
* Qwen3-8B: **42.2**
* Qwen3-14B: **45.4**
* Qwen3-32B: **47.9**
* Qwen3-30B-A3B-2507: **48.6**
* *Trend:* Nanbeige4-3B holds a clear lead. Performance among Qwen models improves with scale but remains below Nanbeige4-3B.
2. **Fullstack Bench:**
* Nanbeige4-3B: **48.0**
* Qwen3-4B-2507: **47.1**
* Qwen3-8B: **51.5**
* Qwen3-14B: **55.7**
* Qwen3-32B: **58.2** (Highest)
* Qwen3-30B-A3B-2507: **54.4**
* *Trend:* This benchmark shows a different pattern. Nanbeige4-3B is not the leader. Performance generally increases with Qwen model size, peaking at Qwen3-32B.
#### **Human Preference Alignment**
1. **ArenaHard-V2:**
* Nanbeige4-3B: **60.0** (Tied for Highest)
* Qwen3-4B-2507: **40.5**
* Qwen3-8B: **26.4** (Lowest across all charts)
* Qwen3-14B: **39.9**
* Qwen3-32B: **48.4**
* Qwen3-30B-A3B-2507: **60.0** (Tied for Highest)
* *Trend:* Nanbeige4-3B and Qwen3-30B-A3B-2507 are tied at the top. There is a very sharp drop for the Qwen3-8B model.
2. **Multi-Challenge:**
* Nanbeige4-3B: **41.2**
* Qwen3-4B-2507: **41.8** (Slightly higher than Nanbeige4-3B)
* Qwen3-8B: **35.8**
* Qwen3-14B: **36.4**
* Qwen3-32B: **39.2**
* Qwen3-30B-A3B-2507: **49.4** (Highest)
* *Trend:* Qwen3-30B-A3B-2507 is the clear leader. Nanbeige4-3B is competitive but slightly outperformed by the smaller Qwen3-4B-2507.
### Key Observations
1. **Nanbeige4-3B Dominance:** The Nanbeige4-3B model (teal) achieves the highest score in 6 out of the 8 benchmarks presented (AIME 2024, AIME 2025, GPQA-Diamond, SuperGPQA, BFCL-v4, ArenaHard-V2).
2. **Strongest Competitor:** The **Qwen3-30B-A3B-2507** model is the most consistent high performer among the Qwen series, often coming in a close second or even surpassing Nanbeige4-3B (as in SuperGPQA, Fullstack Bench, and Multi-Challenge).
3. **Performance vs. Scale (Qwen Series):** Within the Qwen models, performance generally improves with increasing parameter size (from 4B to 32B), but this is not perfectly linear. The 30B-A3B-2507 variant often outperforms the standard 32B model.
4. **Notable Low Point:** The **Qwen3-8B** model shows a significant performance dip, particularly in the ArenaHard-V2 benchmark where it scores only 26.4, the lowest value in the entire composite.
5. **Benchmark Variability:** No single model dominates every category. The relative performance shifts between mathematical, scientific, coding, and alignment-focused tasks.
### Interpretation
This composite chart provides a comparative snapshot of AI model capabilities across a diverse evaluation suite. The data suggests that **Nanbeige4-3B** is a highly capable and well-rounded model, excelling particularly in reasoning-heavy tasks (Math, Science, GPQA). Its strength in the ArenaHard-V2 benchmark also indicates good alignment with human preferences.
The **Qwen3 series** demonstrates a clear scaling law, with larger models generally performing better. The **Qwen3-30B-A3B-2507** variant appears to be a particularly efficient or well-tuned architecture, as it frequently matches or beats the larger standard 32B model and competes directly with Nanbeige4-3B. Its victory in the **Multi-Challenge** benchmark suggests superior handling of diverse, complex tasks.
The poor showing of **Qwen3-8B** on ArenaHard-V2 is an outlier that may indicate a specific weakness in that model's alignment or a mismatch with that particular benchmark's evaluation criteria. Overall, the charts illustrate that model performance is highly task-dependent, and choosing the "best" model requires considering the specific application domain (e.g., math vs. coding vs. general assistant tasks). The visualization effectively communicates these nuanced comparisons through clear, direct score labeling and consistent color coding.