Image 4a1b6ce3783c...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Data Extraction: Mixture-of-Depths (MoD) Performance Analysis

This document contains a detailed extraction of data from a multi-panel technical visualization comparing "Baseline" transformer models against "Mixture-of-Depths" (MoD) variants across parameters, FLOPs, and training time.

---

## 1. Main Chart: Loss vs. Parameters (Left Panel)

**Axis Labels:**
*   **Y-Axis:** Loss (Linear scale: 3.0 to 3.5)
*   **X-Axis:** Parameters (Logarithmic scale: 50M, 100M, 300M, 1B)

**Legend (Top Center):**
*   **Dark Purple Circle:** Baseline
*   **Teal Circle:** MoD (12.5% capacity, every 2)
*   **Indigo Circle:** MoD (50% capacity, every 2)
*   **Yellow Circle:** MoD (50% capacity, random routing, every 2)

**Data Series Trends and Key Points:**
*   **Baseline (Dark Purple):** Slopes downward to a minimum loss of ~3.14 at ~250M parameters, then curves upward toward 1B parameters.
*   **MoD 12.5% (Teal):** Slopes downward more aggressively than the baseline. It achieves a lower minimum loss (~3.10) at ~300M-400M parameters.
*   **MoD 50% (Indigo):** Follows a similar trajectory to the 12.5% MoD but stays slightly higher in loss across the parameter range.
*   **MoD 50% Random (Yellow):** Consistently the worst performer. Slopes downward but remains significantly higher in loss (min ~3.20) compared to learned routing.

**Specific Callouts (Numbered Markers):**
*   **① Baseline:** Pointed to the baseline curve at ~250M parameters.
*   **② Baseline (larger):** Pointed to the baseline curve at ~500M parameters.
*   **③ MoD:** Pointed to the Teal curve at ~200M parameters.
*   **④ MoD:** Pointed to the Teal curve at ~500M parameters.

---

## 2. Training Efficiency Charts (Center Panels)

These charts use a shared legend for four specific model configurations:
*   **① Solid Dark Purple:** Baseline (Small)
*   **② Dashed Dark Purple:** Baseline (Large)
*   **③ Solid Teal:** MoD (Small)
*   **④ Dashed Teal:** MoD (Large)

### Top Center: Loss vs. FLOPs
*   **Y-Axis:** Loss (3.0 to 3.6)
*   **X-Axis:** FLOPs (*1e18) (0 to 6)
*   **Trend:** All models show decreasing loss as FLOPs increase. The MoD models (③, ④) achieve lower loss for the same FLOP budget compared to their respective baselines (①, ②).

### Bottom Center: Loss vs. Wall-clock
*   **Y-Axis:** Loss (3.0 to 3.6)
*   **X-Axis:** Wall-clock (0 to 4000)
*   **Trend:** Similar to the FLOPs chart, MoD variants (③, ④) reach lower loss levels faster in terms of wall-clock time than the baselines.

---

## 3. Hardware & Compute Metrics (Right Panels)

### Top Right: FLOPs per Step
*   **Y-Axis:** FLOPs/step (*10^14) (0 to 12)
*   **X-Axis Categories:** Baselines (①, ②), MoD (③, ④)
*   **Data Points:**
    *   **① Baseline:** ~5.8 * 10^14
    *   **② Baseline:** ~11.2 * 10^14
    *   **③ MoD:** ~3.8 * 10^14
    *   **④ MoD:** ~7.2 * 10^14
*   **Observation:** MoD models require significantly fewer FLOPs per step than their baseline counterparts.

### Bottom Right: Throughput (Steps per Second)
*   **Y-Axis:** Steps/s (TPUv5) (0 to 5)
*   **X-Axis Categories:** Baselines (①, ②), MoD (③, ④)
*   **Data Points:**
    *   **① Baseline:** ~2.7 steps/s
    *   **② Baseline:** ~1.5 steps/s
    *   **③ MoD:** ~4.5 steps/s
    *   **④ MoD:** ~2.1 steps/s
*   **Observation:** MoD models achieve higher throughput (steps per second) on TPUv5 hardware compared to baselines.

---

## Summary Table of Numbered Configurations

| ID | Type | Color/Style | Relative Size | Efficiency Note |
| :--- | :--- | :--- | :--- | :--- |
| **①** | Baseline | Dark Purple (Solid) | Small | Standard compute baseline. |
| **②** | Baseline | Dark Purple (Dashed) | Large | Highest FLOPs/step, lowest throughput. |
| **③** | MoD | Teal (Solid) | Small | Lowest FLOPs/step, highest throughput. |
| **④** | MoD | Teal (Dashed) | Large | Better loss-to-FLOP ratio than Baseline ②. |
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

4a1b6ce3783c66eb68683dcf

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1