# Technical Document Extraction: Transformer Decoder Architecture Diagram
## 1. Overview
This image is a technical flow diagram illustrating the architecture of a neural network, specifically a Transformer-based decoder or generative model. It details the sequence of operations from input embedding to the final probability distribution output.
## 2. Component Segmentation
### Region A: Input Processing (Left)
* **Sine Wave Icon:** Located at the far left, representing the periodic nature of positional signals.
* **Positional Encoding (Green Block):** Provides information about the relative or absolute position of tokens in the sequence.
* **Embedding (Light Blue Block):** Converts input tokens into dense vectors.
* **Summation Operator (Circle with +):** Combines the output of the Embedding and Positional Encoding blocks.
### Region B: Core Processing Block (Center - Dashed Box)
This region is enclosed in a dashed rectangle labeled **"$N$ multi-head attention sub-blocks"**, indicating that the internal sequence repeats $N$ times.
1. **Residual Connection 1:** A path that bypasses the first sub-layers, connecting the input of the block to the first internal summation operator.
2. **Norm (Blue Block):** Layer normalization applied to the input.
3. **Masked Multi-Head Attention (Purple Block):** The primary attention mechanism, "masked" to prevent the model from attending to future tokens.
4. **Summation Operator (Circle with +):** Adds the residual connection to the output of the Masked Multi-Head Attention.
5. **Residual Connection 2:** A path that bypasses the second sub-layers, connecting the output of the first summation to the final summation of the block.
6. **Norm (Blue Block):** Second layer normalization.
7. **Feed-Forward (Orange Block):** A point-wise fully connected neural network.
8. **Summation Operator (Circle with +):** Adds the second residual connection to the output of the Feed-Forward block.
### Region C: Output Head (Right)
* **Norm (Blue Block):** A final layer normalization applied after the $N$ blocks.
* **Linear (Magenta Block):** A fully connected layer that projects the vector to the vocabulary size.
* **Softmax (Red Block):** Converts the linear output into a probability distribution over the vocabulary.
---
## 3. Data Flow and Logic
The diagram follows a linear left-to-right flow with internal loops (residual connections).
| Step | Component | Action/Trend |
| :--- | :--- | :--- |
| 1 | **Input** | Embedding and Positional Encoding are summed together. |
| 2 | **Entry to $N$ Blocks** | The combined vector enters the repeating sub-block structure. |
| 3 | **Attention Sub-layer** | Data is normalized, processed via Masked Multi-Head Attention, and then added back to the original input (Residual). |
| 4 | **Feed-Forward Sub-layer** | The result is normalized, processed via a Feed-Forward network, and added back to the sub-layer input (Residual). |
| 5 | **Repetition** | Steps 3 and 4 repeat $N$ times. |
| 6 | **Final Output** | The data undergoes a final Normalization, a Linear transformation, and a Softmax activation to produce the final result. |
---
## 4. Textual Transcriptions
### Labels and Blocks
* **Positional Encoding** (Green)
* **Embedding** (Light Blue)
* **Norm** (Blue - appears 3 times)
* **Masked Multi-Head Attention** (Purple)
* **Feed-Forward** (Orange)
* **Linear** (Magenta)
* **Softmax** (Red)
### Annotations
* **$N$ multi-head attention sub-blocks**: Text located at the top center of the dashed bounding box.
### Symbols
* **$\oplus$ (Summation)**: Represented by a circle containing a plus sign, appearing 3 times in the main flow.
* **$\sim$ (Sine Wave)**: Icon representing the mathematical basis for positional encoding.
* **$\rightarrow$ (Arrows)**: Indicate the directional flow of data.