Image a594a505900d...

EXPERT: gemini-3-flash-free VERSION 1

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Document Extraction: DeepSeek Architecture Overview

This document provides a comprehensive technical breakdown of the provided architectural diagram, which illustrates the components of a Transformer Block, specifically detailing the **DeepSeekMoE** and **Multi-Head Latent Attention (MLA)** mechanisms.

---

## 1. High-Level Structure: Transformer Block $\times L$
The left side of the image depicts the standard macro-architecture of a Transformer layer, which is repeated $L$ times.

### Components and Flow:
*   **Input Path:** Data enters from the bottom.
*   **Sub-layer 1 (Attention):**
    *   **RMSNorm:** The input first passes through Root Mean Square Normalization.
    *   **Attention:** The normalized signal enters the Attention block (detailed in the MLA section).
    *   **Residual Connection:** The original input is added to the output of the Attention block via an element-wise addition operator ($\oplus$).
*   **Sub-layer 2 (Feed-Forward):**
    *   **RMSNorm:** The output of the first residual addition passes through a second normalization layer.
    *   **Feed-Forward Network:** The signal enters the FFN block (detailed in the DeepSeekMoE section).
    *   **Residual Connection:** The input to this sub-layer is added to the FFN output via an element-wise addition operator ($\oplus$).
*   **Output:** The final processed signal exits at the top.

---

## 2. Component Detail: DeepSeekMoE
This section expands on the "Feed-Forward Network" block. It describes a Mixture-of-Experts (MoE) architecture.

### Legend [Top Right]:
*   **Light Blue Box:** Routed Expert
*   **Light Green Box:** Shared Expert

### Data Flow and Components:
1.  **Input Hidden $\mathbf{u}_t$:** Represented as a vector of neurons.
2.  **Routing Mechanism:**
    *   The input is sent to a **Router**.
    *   The Router generates a distribution, and a **Top-$K_r$** selection is made (visualized by a bar chart).
    *   Dashed yellow lines indicate the selection of specific "Routed Experts."
3.  **Expert Processing:**
    *   **Shared Experts (Green):** Labeled $1$ through $N_s$. The input $\mathbf{u}_t$ is sent to all shared experts.
    *   **Routed Experts (Blue):** Labeled $1, 2, 3, 4, \dots, N_r-1, N_r$. Only the experts selected by the Top-$K_r$ gating receive the input.
4.  **Aggregation:**
    *   Outputs from all active experts (Shared and Routed) are multiplied by their respective gating weights ($\otimes$).
    *   The weighted outputs are summed ($\oplus$).
5.  **Output Hidden $\mathbf{h}'_t$:** The final aggregated vector.

---

## 3. Component Detail: Multi-Head Latent Attention (MLA)
This section expands on the "Attention" block, focusing on low-rank latent projections to reduce KV cache overhead.

### Legend [Top Right of MLA Box]:
*   **Diagonal Striped Pattern:** Cached During Inference

### Data Flow and Components:
1.  **Input Hidden $\mathbf{h}_t$:** The base input vector.
2.  **Latent Projections:**
    *   **Query Path:** $\mathbf{h}_t$ is projected into a **Latent $\mathbf{c}_t^Q$**.
    *   **Key-Value Path:** $\mathbf{h}_t$ is projected into a **Latent $\mathbf{c}_t^{KV}$**. Note that $\mathbf{c}_t^{KV}$ is marked as **Cached During Inference**.
3.  **Feature Extraction:**
    *   **From $\mathbf{c}_t^Q$:** Produces $\{\mathbf{q}_{t,i}^C\}$ (content query) and $\{\mathbf{q}_{t,i}^R\}$ (positional query via *apply RoPE*). These are concatenated to form $\{[\mathbf{q}_{t,i}^C; \mathbf{q}_{t,i}^R]\}$.
    *   **From $\mathbf{c}_t^{KV}$:** Produces $\{\mathbf{k}_{t,i}^C\}$ (content key) and $\{\mathbf{v}_{t,i}^C\}$ (content value).
    *   **Positional Key:** A separate path from $\mathbf{h}_t$ applies RoPE to generate $\mathbf{k}_t^R$ (marked as **Cached During Inference**).
    *   **Concatenation:** $\{\mathbf{k}_{t,i}^C\}$ and $\mathbf{k}_t^R$ are concatenated to form $\{[\mathbf{k}_{t,i}^C; \mathbf{k}_t^R]\}$.
4.  **Attention Mechanism:**
    *   The processed Query, Key, and Value sets are fed into the **Multi-Head Attention** block.
5.  **Output Hidden $\mathbf{u}_t$:** The final output vector of the attention mechanism.

---

## 4. Textual Transcriptions

### Labels and Variables:
*   **Transformer Block $\times L$**
*   **RMSNorm**
*   **Attention**
*   **Feed-Forward Network**
*   **DeepSeekMoE**
*   **Output Hidden $\mathbf{h}'_t$**
*   **Shared Expert ($N_s$)**
*   **Routed Expert ($N_r$)**
*   **Router**
*   **Top-$K_r$**
*   **Input Hidden $\mathbf{u}_t$**
*   **Multi-Head Latent Attention (MLA)**
*   **Cached During Inference**
*   **Multi-Head Attention**
*   **Latent $\mathbf{c}_t^Q$**
*   **Latent $\mathbf{c}_t^{KV}$**
*   **apply RoPE**
*   **concatenate**
*   **Input Hidden $\mathbf{h}_t$**

### Mathematical Notations:
*   $\{\mathbf{q}_{t,i}^C\}$ : Content Query
*   $\{\mathbf{q}_{t,i}^R\}$ : RoPE Query
*   $\{[\mathbf{q}_{t,i}^C; \mathbf{q}_{t,i}^R]\}$ : Concatenated Query
*   $\mathbf{k}_t^R$ : RoPE Key
*   $\{\mathbf{k}_{t,i}^C\}$ : Content Key
*   $\{[\mathbf{k}_{t,i}^C; \mathbf{k}_t^R]\}$ : Concatenated Key
*   $\{\mathbf{v}_{t,i}^C\}$ : Content Value
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

a594a505900d2d1d5545e14f

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 1