Image 80857b977a2c...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Analysis: Mixture-of-Depths Architecture and Routing Comparisons

This document provides a detailed technical extraction of the provided image, which illustrates the architecture of a "Mixture-of-Depths" (MoD) neural network and compares its routing decisions against "Vanilla Transformer" and "Early-Exit" models.

---

## 1. Architectural Diagram (Left Side)

The left portion of the image depicts the internal flow of two consecutive tokens, $x_i$ and $x_{i+1}$, through a transformer-like layer with conditional routing.

### Components and Flow
*   **Input:** Tokens are labeled as $x_i$ and $x_{i+1}$ at the bottom.
*   **Route Block (Light Green):** Each input passes through a "Route" module. Above this module is a small bar chart representing routing weights.
    *   **Token $x_i$:** Assigned a weight $w=0.41$.
    *   **Token $x_{i+1}$:** Assigned a weight $w=0.65$.
*   **Conditional Path (The Router):**
    *   The router acts as a switch. 
    *   For **$x_i$**, the path through the computational blocks (Normalize, Self-attention, Normalize, MLP) is **faded/greyed out**, indicating this token is "routed around" the blocks. It follows a bypass connection directly to the output.
    *   For **$x_{i+1}$**, the path through the computational blocks is **fully colored**, indicating this token "uses the block."
*   **Computational Blocks:**
    *   **Normalize (Red/Pink):** Two instances per layer.
    *   **Self-attention (Yellow):** Positioned after the first normalization.
    *   **MLP (Purple):** Positioned after the second normalization.
*   **Residual Connections and Operations:**
    *   **$\oplus$ (Addition):** Represents residual skip connections.
    *   **$\otimes$ (Multiplication):** The output of the MLP is multiplied by the routing weight (e.g., $w=0.65$ for $x_{i+1}$) before being added back to the residual stream.
*   **Vertical Progression:** An arrow pointing up labeled "layers" with vertical dots indicates that this structure repeats for subsequent layers.

---

## 2. Routing Decisions Heatmaps (Right Side)

The right side contains three heatmaps comparing how different architectures process sequences across layers.

### Legend (Bottom Right)
*   **Purple Square:** "Use block" (Computation is performed).
*   **Orange Square:** "Route around block" (Computation is skipped/bypassed).

### Main Heatmap: Mixture-of-Depths
*   **Title:** Mixture-of-Depths / Routing Decisions
*   **Y-Axis:** Layer (Vertical)
*   **X-Axis:** Sequence (Horizontal)
*   **Visual Trend:** The map shows a stochastic, "checkerboard-like" pattern of purple and orange. 
*   **Data Observation:** In any given layer (row), only a subset of the sequence (columns) is processed (purple), while the rest are bypassed (orange). This indicates a fixed compute budget per layer distributed dynamically across the sequence.

### Comparison Heatmap: Vanilla Transformer
*   **Title:** Vanilla Transformer
*   **Y-Axis:** Layer
*   **X-Axis:** Sequence
*   **Visual Trend:** The entire heatmap is solid **Purple**.
*   **Data Observation:** Every token in the sequence is processed by every block in every layer. There is no skipping.

### Comparison Heatmap: Early-Exit
*   **Title:** Early-Exit
*   **Y-Axis:** Layer
*   **X-Axis:** Sequence
*   **Visual Trend:** The bottom portion of the map is solid purple, while the top-right portion contains orange blocks.
*   **Data Observation:** Tokens are processed in early layers. As the sequence progresses through deeper layers (moving up the Y-axis), some tokens "exit" early and bypass the remaining computation (turning orange). Unlike MoD, once a token exits, it typically stays exited for the remaining layers.

---

## 3. Summary of Textual Labels

| Category | Extracted Text |
| :--- | :--- |
| **Main Titles** | Mixture-of-Depths, Routing Decisions |
| **Sub-Titles** | Vanilla Transformer, Early-Exit |
| **Axis Labels** | Layer, Sequence, layers |
| **Block Labels** | Route, Normalize, Self-attention, MLP |
| **Variables** | $x_i$, $x_{i+1}$, $w=0.41$, $w=0.65$ |
| **Legend** | Use block, Route around block |
| **Symbols** | $\oplus$ (Addition), $\otimes$ (Multiplication), $\vdots$ (Continuation) |
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

80857b977a2c93b91c9fd18d

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1