Image 7270b5ff30db...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview

INTEL_VERIFIED

# Technical Data Extraction: Expert Load Heatmaps

## Overview
This document contains a series of 12 heatmaps organized in pairs, comparing "Aux-Loss-Based" and "Aux-Loss-Free" routing mechanisms across six layers (Layer 1 through Layer 6) of a Mixture-of-Experts (MoE) neural network. The charts visualize the **Relative Expert Load** across three distinct datasets.

## Metadata and Legend
*   **Language:** English (en)
*   **Metric:** Relative Expert Load
*   **Scale:** 0 to 10 (Linear gradient)
    *   **0 (Light Yellow):** Low load/utilization.
    *   **10 (Dark Red/Maroon):** High load/utilization.
*   **X-Axis (Experts):** 64 discrete experts, labeled 1 through 64.
*   **Y-Axis (Datasets):**
    1.  Wikipedia (en)
    2.  Github
    3.  DM Mathematics

---

## Comparative Analysis by Layer

### Layer 1
*   **Aux-Loss-Based Layer 1:** Shows a highly uniform, low-intensity distribution (mostly light yellow). Expert 29 (Github) and Expert 54 (DM Mathematics) show slightly higher activation.
*   **Aux-Loss-Free Layer 1:** Shows significantly higher variance and "hotspots."
    *   **DM Mathematics:** High load on Experts 6, 8, and 9.
    *   **Github:** High load on Experts 29 and 43.
    *   **Wikipedia:** Generally uniform but slightly higher intensity than the loss-based counterpart.

### Layer 2
*   **Aux-Loss-Based Layer 2:** Very uniform distribution. The only significant outlier is **DM Mathematics** at Expert 23 (Dark Red).
*   **Aux-Loss-Free Layer 2:** Increased specialization.
    *   **Github:** High load on Experts 19, 49, and 60.
    *   **DM Mathematics:** High load on Experts 4, 43, 46, and 64.

### Layer 3
*   **Aux-Loss-Based Layer 3:** Extremely uniform across Wikipedia and Github. **DM Mathematics** shows a single high-load point at Expert 57.
*   **Aux-Loss-Free Layer 3:**
    *   **Github:** High load on Experts 43, 45, 50, 59, and 64.
    *   **DM Mathematics:** High load on Experts 46, 47, and 57.

### Layer 4
*   **Aux-Loss-Based Layer 4:** Uniform distribution with very minor intensity increases for DM Mathematics around Expert 41.
*   **Aux-Loss-Free Layer 4:**
    *   **Github:** Significant hotspot at Expert 30.
    *   **DM Mathematics:** High load at Experts 1, 41, 42, 47, and 64.

### Layer 5
*   **Aux-Loss-Based Layer 5:** Uniform distribution. Minor intensity for DM Mathematics at Expert 52.
*   **Aux-Loss-Free Layer 5:**
    *   **Github:** High load at Experts 20, 40, 52, and 62.
    *   **DM Mathematics:** High load at Experts 1, 2, 11, 36, 40, and 46.

### Layer 6
*   **Aux-Loss-Based Layer 6:** Uniform distribution. Minor intensity for DM Mathematics at Expert 6 and Expert 52.
*   **Aux-Loss-Free Layer 6:**
    *   **Github:** High load at Expert 1.
    *   **DM Mathematics:** High load at Experts 2, 19, 25, 30, and 61.

---

## Key Trends and Observations

1.  **Routing Mechanism Contrast:**
    *   **Aux-Loss-Based:** These layers exhibit a "load balancing" effect. The distribution is consistently pale, indicating that the auxiliary loss is successfully forcing the model to spread the computational load across all 64 experts relatively evenly.
    *   **Aux-Loss-Free:** These layers exhibit "Expert Specialization." Without the auxiliary loss constraint, specific experts become highly specialized for specific datasets (Github or DM Mathematics), resulting in dark red hotspots (high load) and many experts with near-zero load.

2.  **Dataset Specificity:**
    *   **Wikipedia (en):** Consistently shows the most uniform and lowest relative load across all layers and both routing types. It appears to be the "baseline" data that is processed by a wide variety of experts.
    *   **DM Mathematics & Github:** These datasets consistently trigger high-load hotspots in the Aux-Loss-Free models, suggesting they require more specialized computation that the router naturally gravitates toward when unconstrained.

3.  **Expert Utilization:**
    *   In the **Aux-Loss-Free** configuration, the "hot" experts change from layer to layer, suggesting that the model develops a hierarchical specialization strategy. For example, Expert 1 is a major hub for DM Mathematics in Layer 4 and 5, but shifts to Github in Layer 6.

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 3

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

# Technical Document Extraction: Heatmap Analysis of Expert Load Distribution

## **Axes and Labels**
- **X-Axis**: "Relative Expert Load" with numerical markers: `0, 2, 4, 6, 8, 10`.
- **Y-Axis**: Datasets listed in order:
  - `Wikipedia (en)`
  - `GitHub`
  - `DM Mathematics`
- **Legend**: Located at the bottom of the image, representing a color scale from `0` (light yellow) to `10` (dark red).

---

## **Layer Structure**
The image contains **12 heatmaps** organized into **6 pairs** of layers:
1. **Aux-Loss-Based Layer 1**
   - **Aux-Loss-Free Layer 1**
2. **Aux-Loss-Based Layer 2**
   - **Aux-Loss-Free Layer 2**
3. **Aux-Loss-Based Layer 3**
   - **Aux-Loss-Free Layer 3**
4. **Aux-Loss-Based Layer 4**
   - **Aux-Loss-Free Layer 4**
5. **Aux-Loss-Based Layer 5**
   - **Aux-Loss-Free Layer 5**
6. **Aux-Loss-Based Layer 6**
   - **Aux-Loss-Free Layer 6**

Each heatmap uses a **color gradient** to represent the relative expert load, with darker red indicating higher load and lighter yellow indicating lower load.

---

## **Key Trends and Data Points**
### **Aux-Loss-Based Layers**
- **Layer 1**:
  - Dark red clusters observed in `Wikipedia (en)` and `DM Mathematics` at `x ≈ 6-8`.
  - Light yellow dominates `GitHub` across the x-axis.
- **Layer 2**:
  - Dark red peaks in `Wikipedia (en)` at `x ≈ 4-6` and `DM Mathematics` at `x ≈ 10`.
  - `GitHub` shows moderate load (`x ≈ 2-4`).
- **Layer 3**:
  - Dark red in `Wikipedia (en)` at `x ≈ 8-10` and `DM Mathematics` at `x ≈ 6`.
  - `GitHub` has sparse dark red (`x ≈ 2`).
- **Layer 4**:
  - Dark red in `Wikipedia (en)` at `x ≈ 4-6` and `DM Mathematics` at `x ≈ 10`.
  - `GitHub` shows minimal load (`x ≈ 0-2`).
- **Layer 5**:
  - Dark red in `Wikipedia (en)` at `x ≈ 6-8` and `DM Mathematics` at `x ≈ 4`.
  - `GitHub` has moderate load (`x ≈ 2-4`).
- **Layer 6**:
  - Dark red in `Wikipedia (en)` at `x ≈ 2-4` and `DM Mathematics` at `x ≈ 10`.
  - `GitHub` shows sparse load (`x ≈ 0-2`).

### **Aux-Loss-Free Layers**
- **Layer 1**:
  - Dark red in `Wikipedia (en)` at `x ≈ 6-8` and `DM Mathematics` at `x ≈ 4`.
  - `GitHub` has minimal load (`x ≈ 0-2`).
- **Layer 2**:
  - Dark red in `Wikipedia (en)` at `x ≈ 4-6` and `DM Mathematics` at `x ≈ 10`.
  - `GitHub` shows moderate load (`x ≈ 2-4`).
- **Layer 3**:
  - Dark red in `Wikipedia (en)` at `x ≈ 8-10` and `DM Mathematics` at `x ≈ 6`.
  - `GitHub` has sparse load (`x ≈ 2`).
- **Layer 4**:
  - Dark red in `Wikipedia (en)` at `x ≈ 4-6` and `DM Mathematics` at `x ≈ 10`.
  - `GitHub` shows minimal load (`x ≈ 0-2`).
- **Layer 5**:
  - Dark red in `Wikipedia (en)` at `x ≈ 6-8` and `DM Mathematics` at `x ≈ 4`.
  - `GitHub` has moderate load (`x ≈ 2-4`).
- **Layer 6**:
  - Dark red in `Wikipedia (en)` at `x ≈ 2-4` and `DM Mathematics` at `x ≈ 10`.
  - `GitHub` shows sparse load (`x ≈ 0-2`).

---

## **Legend and Color Mapping**
- **Color Scale**:
  - `0` (light yellow) to `10` (dark red).
  - Intermediate values (e.g., `2`, `4`, `6`, `8`) correspond to progressively darker shades of red.
- **Spatial Grounding**: The legend is positioned at the **bottom** of the image, spanning the full width of the heatmaps.

---

## **Component Isolation**
1. **Header**: Layer titles (e.g., "Aux-Loss-Based Layer 1").
2. **Main Chart**: Heatmaps for each layer, with datasets on the y-axis and relative expert load on the x-axis.
3. **Footer**: Legend with color scale.

---

## **Cross-Reference Verification**
- **Legend Colors**: All dark red cells in the heatmaps correspond to values near `10`, while light yellow cells align with `0`.
- **Data Consistency**: No discrepancies observed between legend colors and heatmap intensities.

---

## **Summary**
The heatmaps illustrate the distribution of **relative expert load** across three datasets (`Wikipedia (en)`, `GitHub`, `DM Mathematics`) for six layers of two types: **Aux-Loss-Based** and **Aux-Loss-Free**. Dark red regions indicate higher expert load, while light yellow regions indicate lower load. The patterns suggest that `Wikipedia (en)` and `DM Mathematics` exhibit more pronounced load variations compared to `GitHub`, which generally shows lower or more uniform load across layers.

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

7270b5ff30db6197173b8eec

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1

EXPERT: nemotron-free VERSION 3