Image eddd3f4ecea0...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Document Extraction: DeepSeekMoE 16B vs. DeepSeek 7B Performance Comparison

This document provides a comprehensive extraction of data from a grid of 18 line charts comparing the performance of two AI models over the course of training.

## 1. General Metadata and Global Parameters

*   **Image Type:** A 6x3 grid of performance line charts.
*   **X-Axis (All Charts):** `# Training Tokens (B)`
    *   **Range:** 0 to 2000 Billion tokens.
    *   **Markers:** 0, 500, 1000, 1500, 2000.
*   **Y-Axis (All Charts):** `Performance`
    *   **Scale:** Linear, varies per benchmark (see individual chart details).
*   **Legend (All Charts):**
    *   **DeepSeekMoE 16B:** Represented by an **Orange** line.
    *   **DeepSeek 7B (Dense):** Represented by a **Blue** line.
    *   **Spatial Placement:** Bottom-right quadrant of each individual plot.

---

## 2. Individual Chart Data Extraction

The charts are processed row-by-row, from left to right.

### Row 1: General Reasoning & Knowledge
| Benchmark Title | Y-Axis Range | Trend Description | Comparative Analysis |
| :--- | :--- | :--- | :--- |
| **HellaSwag (Acc.)** | 0.3 - 0.7 | Both lines show logarithmic growth, plateauing after 1000B tokens. | **Orange** consistently maintains a slight lead (~0.02-0.03) over **Blue** until the very end where they converge near 0.75. |
| **PIQA (Acc.)** | 0.60 - 0.80 | Rapid initial growth followed by high-variance fluctuations. | **Orange** stays slightly above **Blue** for the majority of training, peaking near 0.80. |
| **ARC-easy (Acc.)** | 0.4 - 0.7 | Steady upward trend with significant noise. | Both models perform similarly, oscillating around each other. Both finish near 0.68. |

### Row 2: Academic & Reasoning
| Benchmark Title | Y-Axis Range | Trend Description | Comparative Analysis |
| :--- | :--- | :--- | :--- |
| **ARC-challenge (Acc.)** | 0.2 - 0.5 | Linear-logarithmic growth. | **Orange** and **Blue** are tightly coupled. **Orange** shows slightly higher peaks in the mid-training phase. |
| **RACE-middle (Acc.)** | 0.4 - 0.6 | Rapid rise in first 250B, then slow steady climb. | **Blue** and **Orange** are nearly identical, finishing at approximately 0.64. |
| **RACE-high (Acc.)** | 0.30 - 0.45 | Steady upward slope. | **Blue** starts slightly higher; **Orange** catches up by 1000B. Both finish near 0.47. |

### Row 3: Reading Comprehension & Coding
| Benchmark Title | Y-Axis Range | Trend Description | Comparative Analysis |
| :--- | :--- | :--- | :--- |
| **DROP (EM)** | 0.1 - 0.3 | Strong upward trend with high variance. | **Blue** leads slightly in the first half; **Orange** matches or exceeds it in the final 500B tokens. |
| **GSM8K (EM)** | 0.00 - 0.20 | Jittery upward trend; growth accelerates after 1500B. | **Orange** shows a distinct advantage over **Blue** after 1500B tokens, reaching 0.20 while Blue is at ~0.17. |
| **HumanEval (Pass@1)** | 0.0 - 0.3 | Extremely high variance (spiky) upward trend. | **Orange** generally maintains higher peaks than **Blue** throughout the training duration. |

### Row 4: Knowledge & QA
| Benchmark Title | Y-Axis Range | Trend Description | Comparative Analysis |
| :--- | :--- | :--- | :--- |
| **MBPP (Pass@1)** | 0.0 - 0.4 | High variance, steady climb. | **Orange** leads significantly between 500B-1500B tokens. Both converge at ~0.38 at 2000B. |
| **TriviaQA (EM)** | 0.0 - 0.6 | Smooth logarithmic curve. | **Orange** shows a clear and consistent lead over **Blue** from 250B tokens onward. |
| **NaturalQuestions (EM)** | 0.00 - 0.25 | Smooth logarithmic curve. | **Orange** significantly outperforms **Blue** throughout the entire training process. |

### Row 5: Multi-task & Linguistic (Chinese/English)
| Benchmark Title | Y-Axis Range | Trend Description | Comparative Analysis |
| :--- | :--- | :--- | :--- |
| **MMLU (Acc.)** | 0.25 - 0.45 | Late-stage acceleration. | **Blue** (Dense) significantly outperforms **Orange** (MoE) after 1000B tokens. |
| **WinoGrande (Acc.)** | 0.50 - 0.70 | Rapid initial rise, then plateau with noise. | Both models are neck-and-neck, finishing around 0.69. |
| **CLUEWSC (EM)** | 0.0 - 0.6 | Rapid rise to 0.6, then flat. | Both models perform almost identically, plateauing early. |

### Row 6: Chinese Benchmarks
| Benchmark Title | Y-Axis Range | Trend Description | Comparative Analysis |
| :--- | :--- | :--- | :--- |
| **CEval (Acc.)** | 0.25 - 0.45 | Steady climb with high variance. | **Blue** (Dense) maintains a consistent lead over **Orange** (MoE) after 750B tokens. |
| **CMMLU (Acc.)** | 0.25 - 0.45 | Steady climb. | **Blue** (Dense) shows a widening lead over **Orange** (MoE) as training progresses. |
| **CHID (Acc.)** | 0.5 - 0.9 | Very rapid initial rise, then near-total plateau. | Both models are identical, plateauing at ~0.88. |

---

## 3. Summary of Trends

*   **DeepSeekMoE 16B (Orange) Strengths:** Shows superior performance in Reading Comprehension and Question Answering tasks like *TriviaQA*, *NaturalQuestions*, and *GSM8K*.
*   **DeepSeek 7B Dense (Blue) Strengths:** Shows a clear advantage in multi-task knowledge benchmarks, specifically *MMLU*, *CEval*, and *CMMLU*.
*   **Parity:** On many standard reasoning benchmarks (*HellaSwag*, *ARC*, *RACE*, *WinoGrande*), the two architectures perform nearly identically despite the difference in total parameters vs. active parameters.
*   **Convergence:** Most benchmarks show that performance has not yet fully plateaued at 2000B tokens, suggesting further gains are possible with more training.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

eddd3f4ecea0b94f3fcc53ad

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1