## Diagram: Transformer Information Flow and Edge Types
### Overview
This image contains three side-by-side diagrams illustrating the flow of information through the layers of a neural network, specifically resembling a Transformer architecture. The diagrams demonstrate how different types of operations—Attention edges and Feed-Forward Network (FFN) edges—extract, move, and process information across tokens in a sequence to generate an output prediction.
### Components and Notation
* **Nodes (Blue Circles):** Represent hidden states or token representations at specific positions and layers.
* Notation: $x_t^l$ where $x$ is the state, subscript $t$ (or $t-1$, $t-3$, $s$) indicates the sequence position (time step), and superscript $l$ (or $l+1$, $l+2$) indicates the network layer.
* **Input/Output Boxes (Blue Rectangles):** Located at the very bottom (inputs) and very top (outputs) of each panel, containing the text tokens.
* **Green Arrows:** Represent "Attention edges," which move information between different token positions across layers.
* **Red Arrows:** Represent "FFN edges" (Feed-Forward Network), which process information within the same token position across layers.
* **Text Annotations:** Accompanying each arrow, detailing the operation:
* **$e$**: Edge notation (e.g., $\boldsymbol{e}_{t-1, t}^{l, k}$), indicating source position, target position, layer, and attention head index.
* **qk**: Represents the Query-Key matching mechanism (what the attention head is looking for).
* **q** / **k**: Explicitly separated Query and Key (seen in Panel 3).
* **o**: Represents the Output or value retrieved and moved by the edge. Text in light blue indicates specific token values.
---
### Detailed Analysis
The image is divided into three distinct vertical panels. Information flows from the bottom (inputs) to the top (outputs).
#### Panel 1 (Left): FFN as Key-Value Memory
* **Spatial Layout:** Bottom inputs are "capital", "China", "is". The target output at the top is "Beijing".
* **Layer $l$ (Bottom Nodes):** Three nodes corresponding to the inputs: $x_{t-3}^l$ (capital), $x_{t-1}^l$ (China), and $x_t^l$ (is).
* **Attention to Layer $l+1$:**
* A green arrow points from $x_{t-3}^l$ to $x_t^{l+1}$.
* Label: Attention edge $\boldsymbol{e}_{t-3, t}^{l, h}$
* **qk**: relation
* **o**: capital
* A green arrow points from $x_{t-1}^l$ to $x_t^{l+1}$.
* Label: Attention edge $\boldsymbol{e}_{t-1, t}^{l, k}$
* **qk**: topic
* **o**: China
* **FFN to Layer $l+2$:**
* A red arrow points straight up from $x_t^{l+1}$ to $x_t^{l+2}$.
* Label: FFN edge $\boldsymbol{e}_t^{l+1, m}$
* **qk**: (China, capital)
* **o**: Beijing
* **Output:** The node $x_t^{l+2}$ leads to the final output box: "Beijing".
#### Panel 2 (Middle): Attention for Knowledge Retrieval
* **Spatial Layout:** Bottom inputs are "capital", "China", "is". The target output at the top is "Beijing".
* **Layer $l$ (Bottom Nodes):** Three nodes: $x_{t-3}^l$, $x_{t-1}^l$, $x_t^l$.
* **Attention to Layer $l+1$:**
* A green arrow points from $x_{t-3}^l$ to $x_t^{l+1}$.
* Label: Attention edge $\boldsymbol{e}_{t-3, t}^{l, h}$
* **qk**: relation
* **o**: Paris, Beijing
* **Attention to Layer $l+3$:**
* A node $x_{t-1}^{l+2}$ exists in the middle-left. A green arrow points from $x_{t-1}^{l+2}$ to $x_t^{l+3}$.
* Label: Attention edge $\boldsymbol{e}_{t-1, t}^{l+2, k}$
* **qk**: country
* **o**: panda, Beijing
* **Output:** The node $x_t^{l+3}$ leads to the final output box: "Beijing".
#### Panel 3 (Right): Induction Head Mechanism
* **Spatial Layout:** Bottom inputs are abstract tokens: "[a]", "[b]", "[a]". The target output at the top is "[b]".
* **Layer $l$ (Bottom Nodes):** Three nodes: $x_{s-1}^l$ (above first [a]), $x_s^l$ (above [b]), $x_t^l$ (above second [a]).
* **Attention to Layer $l+1$:**
* A green arrow points from $x_{s-1}^l$ to $x_s^{l+1}$.
* Label: Attention edge $\boldsymbol{e}_{s-1, s}^{l, h}$
* **q**: previous position
* **k**: current position
* **o**: [a]
* **Intermediate Step:** Node $x_s^{l+1}$ connects vertically to $x_s^{l+2}$ (no explicit edge label, implying a pass-through or standard FFN processing without cross-token movement).
* **Attention to Layer $l+3$:**
* A green arrow points from $x_s^{l+2}$ to $x_t^{l+3}$.
* Label: Attention edge $\boldsymbol{e}_{s, t}^{l+2, k}$
* **qk**: [a]
* **o**: [b]
* **Output:** The node $x_t^{l+3}$ leads to the final output box: "[b]".
---
### Key Observations
1. **Color Coding:** Green is strictly used for cross-positional information routing (Attention), while red is used for same-position processing (FFN).
2. **Query-Key-Value Paradigm:** The annotations explicitly break down operations into what the node is looking for (`qk` or `q`/`k`) and what information is actually retrieved and passed forward (`o`).
3. **Abstract vs. Concrete:** Panels 1 and 2 use concrete linguistic examples ("China", "capital", "Beijing") to demonstrate factual recall. Panel 3 uses abstract variables (`[a]`, `[b]`) to demonstrate a structural pattern-matching mechanism.
### Interpretation
This diagram is a highly technical illustration from the field of **Mechanistic Interpretability** of Large Language Models (LLMs). It visualizes how different components of a Transformer model contribute to next-token prediction.
* **Panel 1 demonstrates FFNs as Factual Memories:** It shows that Attention heads gather context from previous tokens (moving "capital" and "China" to the current token position "is"). Once that context is gathered at position $t$, the Feed-Forward Network (the red arrow) acts as a key-value lookup. The "key" is the combined concept of `(China, capital)`, and the FFN "value" output is the factual answer `Beijing`.
* **Panel 2 demonstrates Attention as Factual Routing:** Alternatively, factual knowledge might be stored in the representations of earlier tokens. Here, the attention mechanism looks back at previous tokens to find overlapping concepts (e.g., looking for "country" and finding "panda, Beijing" at the "China" token), routing the correct factual answer forward to the current prediction node.
* **Panel 3 demonstrates an "Induction Head":** This is a well-documented phenomenon in LLMs used for in-context learning. The sequence is `[a] [b] ... [a]`. The model needs to predict what comes after the second `[a]`.
* The first attention edge (bottom) looks at the token `[a]` and associates it with the token immediately following it, `[b]`.
* When the model encounters the second `[a]` (at position $t$), the second attention edge (top) searches the past for previous instances of `[a]`. It finds the earlier `[a]`, retrieves the token that followed it (`[b]`), and copies `[b]` to the current position to make the prediction. This explains how LLMs learn to continue repeating patterns within a prompt.