## Heatmap Series: Qwen3 Model Layer Activation Patterns
### Overview
The image displays three horizontally arranged heatmaps, each visualizing activation patterns across layers and components for different sizes of the Qwen3 language model base variants. The heatmaps use a color gradient to represent numerical values, likely indicating activation intensity, importance, or some normalized metric.
### Components/Axes
**Titles (Top of each heatmap):**
1. Left: `Qwen3-1.7B-Base`
2. Center: `Qwen3-4B-Base`
3. Right: `Qwen3-8B-Base`
**Y-Axis (Vertical, Left side of each heatmap):**
* **Label:** `Layer`
* **Scale:** Represents the model's layer index, starting from 0 at the bottom.
* Qwen3-1.7B-Base: Layers 0 to 27.
* Qwen3-4B-Base: Layers 0 to 34.
* Qwen3-8B-Base: Layers 0 to 34.
**X-Axis (Horizontal, Bottom of each heatmap):**
* **Labels (Identical for all three heatmaps, from left to right):**
1. `mlp.down_proj`
2. `mlp.gate_proj`
3. `mlp.up_proj`
4. `self_attn.k_proj`
5. `self_attn.q_proj`
6. `self_attn.v_proj`
7. `self_attn.o_proj`
8. `post_attention_layernorm`
9. `input_layernorm`
10. `lm_head`
**Color Legend (Far right of the image):**
* A vertical color bar.
* **Scale:** Ranges from `0` (bottom, light green) to `1` (top, dark blue).
* **Interpretation:** The color of each cell in the heatmaps corresponds to a value on this scale. Darker blue indicates a value closer to 1, while lighter green indicates a value closer to 0.
### Detailed Analysis
**General Pattern Across All Models:**
* **Trend Verification:** For all three models, the leftmost columns (corresponding to MLP projection layers: `mlp.down_proj`, `mlp.gate_proj`, `mlp.up_proj`) show a strong vertical gradient. They are darkest blue (high value) in the lowest layers and gradually become lighter (lower value) in the highest layers.
* The middle columns (self-attention projections: `self_attn.k_proj`, `self_attn.q_proj`, `self_attn.v_proj`, `self_attn.o_proj`) show a more complex, patchy pattern with moderate values concentrated in the lower-to-middle layers.
* The rightmost columns (`post_attention_layernorm`, `input_layernorm`, `lm_head`) are uniformly very light green (values near 0) across all layers in all models.
**Model-Specific Details:**
1. **Qwen3-1.7B-Base (28 layers):**
* The highest values (darkest blue) are concentrated in the `mlp.down_proj` and `mlp.gate_proj` columns within layers 0-10.
* The `self_attn.q_proj` column shows a notable cluster of moderate-to-high values in layers 0-15.
* Activation values diminish significantly above layer 20 for most components.
2. **Qwen3-4B-Base (35 layers):**
* The pattern is similar to the 1.7B model but extended over more layers.
* The high-value region for MLP projections (`mlp.down_proj`, `mlp.gate_proj`) extends slightly higher, up to around layer 15.
* The self-attention components show a more dispersed pattern of moderate values in the lower half of the network.
3. **Qwen3-8B-Base (35 layers):**
* The high-value region for MLP projections is the most extensive, with dark blue cells persisting up to layer 20 in the `mlp.down_proj` column.
* The self-attention components, particularly `self_attn.q_proj`, show a broader distribution of moderate values across the lower 25 layers compared to the smaller models.
* The overall contrast between high-value (blue) and low-value (green) regions appears slightly more pronounced.
### Key Observations
1. **Component Hierarchy:** MLP projection layers (`down_proj`, `gate_proj`, `up_proj`) consistently exhibit the highest values, especially in the lower network layers, across all model sizes.
2. **Layer Progression:** There is a clear top-down gradient where activation/importance is highest in the initial processing layers and decreases toward the final layers.
3. **Normalization Invariance:** The layernorm components (`post_attention_layernorm`, `input_layernorm`) and the language modeling head (`lm_head`) show negligible values (near 0) throughout, suggesting they are not the focus of this particular metric.
4. **Scaling Effect:** As model size increases (1.7B -> 4B -> 8B), the region of high activation in the MLP layers extends to a higher layer index, indicating that larger models may distribute these computations deeper into the network.
### Interpretation
This visualization likely represents the **relative importance or activation magnitude** of different weight matrices within the Qwen3 transformer architecture. The data suggests a fundamental architectural insight:
* **Core Computational Load:** The dense MLP layers (`down_proj`, `gate_proj`, `up_proj`) are the primary sites of high-magnitude processing, particularly in the early stages of the network where raw input is being transformed into a more useful representation.
* **Attention's Role:** Self-attention mechanisms show significant but more distributed activity, indicating their role in integrating information across the sequence, which may be less concentrated in specific layers compared to the MLP transformations.
* **Scaling Law Manifestation:** The extension of high MLP activation into deeper layers in larger models could be a visual correlate of scaling laws, where increased model capacity allows for more sustained, complex processing throughout the network depth.
* **Architectural Constants:** The near-zero values for normalization layers and the LM head are expected, as these components typically perform scaling and final projection rather than being sites of high-magnitude feature transformation.
**In essence, the heatmaps provide a "fingerprint" of computational focus, showing that the Qwen3 models, regardless of size, rely heavily on early-layer MLP processing, with the spatial extent of this focus growing with model scale.**