## Heatmap Comparison: Benign vs. Jailbreak Activation Patterns Across Language Models
### Overview
The image displays a 3x3 grid of heatmaps comparing the internal activation patterns of three large language models (LLMs) under two conditions: "Benign" (standard prompts) and "Jailbreak" (adversarial prompts designed to bypass safety filters). The third column shows the "Difference" between the two conditions for each model. Each heatmap plots "Layer" (y-axis) against "Token Position" (x-axis), with color intensity representing a numerical value (likely activation magnitude or a related metric).
### Components/Axes
* **Grid Structure:** 3 rows (Models) x 3 columns (Conditions).
* **Row Labels (Left Side):**
* Row 1: **GPT-JT-6B**
* Row 2: **LLaMA-3.1-8B**
* Row 3: **Mistral-7B**
* **Column Headers (Top):**
* Column 1: **Benign**
* Column 2: **Jailbreak**
* Column 3: **Difference**
* **Axes (Per Heatmap):**
* **Y-axis:** Label: **Layer**. Scale: 0 at top, increasing downward. Ticks: 0, 4, 8, 12, 16, 20, 24, 28 (for LLaMA-3.1-8B and Mistral-7B). GPT-JT-6B scale ends at 24.
* **X-axis:** Label: **Token Position**. Scale: 0 at left, increasing rightward. Ticks: 0, 64, 128, 192, 256, 320, 384, 448.
* **Color Bars (Legends):** Located to the right of each individual heatmap.
* **GPT-JT-6B (Benign & Jailbreak):** Scale from ~4 (dark purple) to ~8 (bright yellow).
* **GPT-JT-6B (Difference):** Scale from -2 (dark blue) to +2 (dark red). Zero is white/light gray.
* **LLaMA-3.1-8B (Benign & Jailbreak):** Scale from 0 (dark purple) to 6 (bright yellow).
* **LLaMA-3.1-8B (Difference):** Scale from -0.2 (dark blue) to +0.2 (dark red). Zero is white.
* **Mistral-7B (Benign & Jailbreak):** Scale from -1 (dark purple) to 5 (bright yellow).
* **Mistral-7B (Difference):** Scale from -2.0 (dark blue) to +2.0 (dark red). Zero is white.
### Detailed Analysis
**1. GPT-JT-6B (Top Row)**
* **Benign Heatmap:** Shows a strong, consistent gradient. Values are lowest (dark purple, ~4) at Layer 0 across all token positions. Values increase steadily with layer depth, reaching the highest values (bright yellow, ~8) in the deepest layers (20-24). The pattern is uniform across token positions.
* **Jailbreak Heatmap:** Shows a similar but muted pattern. The gradient from low (purple) to high (green/yellow) with depth is present, but the overall intensity is lower. The deepest layers reach a medium green (~6-7), not the bright yellow seen in the Benign condition.
* **Difference Heatmap (Jailbreak - Benign):** Dominated by blue tones, indicating the Jailbreak condition has *lower* values than Benign across nearly all layers and token positions. The strongest negative difference (deepest blue, ~-2) occurs in the middle-to-deep layers (approx. 8-20). The difference is less pronounced in the very first and very last layers.
**2. LLaMA-3.1-8B (Middle Row)**
* **Benign Heatmap:** Shows a clear vertical gradient. Values are lowest (dark purple, 0) at Layer 0. They increase with depth, but the increase is not perfectly uniform. The highest values (yellow, ~6) appear in a band around layers 16-24. The pattern is largely consistent across token positions.
* **Jailbreak Heatmap:** Visually very similar to the Benign heatmap. The same vertical gradient and band of high activation in deep layers are present.
* **Difference Heatmap (Jailbreak - Benign):** Reveals subtle but systematic differences. The pattern is horizontally banded:
* **Early Layers (0-8):** Predominantly blue (negative difference, Jailbreak < Benign), with the strongest negative values (~-0.2) around layers 4-8.
* **Middle Layers (8-16):** A mix, with a notable band of red (positive difference, Jailbreak > Benign) around layers 10-14.
* **Deep Layers (16-28):** Strongly red (positive difference), with the highest values (~+0.2) concentrated in the deepest layers (24-28). This indicates Jailbreak activations are *higher* than Benign in the model's final layers.
**3. Mistral-7B (Bottom Row)**
* **Benign Heatmap:** Shows a smooth vertical gradient. Lowest values (dark purple, ~-1) at Layer 0, increasing to highest values (yellow, ~5) in the deepest layers (24-28). The pattern is uniform across token positions.
* **Jailbreak Heatmap:** Visually almost identical to the Benign heatmap. The same gradient and intensity are observed.
* **Difference Heatmap (Jailbreak - Benign):** Appears almost entirely white/very light orange, indicating near-zero difference across the entire layer-token space. The color bar ranges from -2 to +2, but the heatmap values are clustered very close to 0. There is a very faint, diffuse positive (light orange) tint in some middle-to-deep layers, but the magnitude is negligible compared to the scale.
### Key Observations
1. **Model-Specific Response to Jailbreak:** The three models exhibit fundamentally different internal activation responses to jailbreak prompts.
* **GPT-JT-6B:** Shows a global *suppression* of activations (blue Difference map).
* **LLaMA-3.1-8B:** Shows a *redistribution* of activations—suppressed in early layers, enhanced in deep layers (banded blue/red Difference map).
* **Mistral-7B:** Shows *minimal change* in activation patterns (near-white Difference map).
2. **Layer-Wise Sensitivity:** For LLaMA-3.1-8B, the most significant positive changes (Jailbreak > Benign) occur in the final layers, suggesting these layers are most affected by the adversarial prompt.
3. **Token Position Invariance:** Across all models and conditions, the activation patterns are remarkably consistent along the horizontal (Token Position) axis. The primary variation is vertical (Layer-wise).
### Interpretation
This visualization provides a "fingerprint" of how different LLM architectures process adversarial inputs at an internal, layer-by-layer level.
* **GPT-JT-6B's** uniform suppression suggests the jailbreak prompt may cause a general dampening of the model's standard processing pathway, potentially indicating a form of internal conflict or confusion.
* **LLaMA-3.1-8B's** layered response is particularly insightful. The suppression in early layers might reflect an attempt to filter or ignore the adversarial instruction, while the heightened activation in deep layers could indicate the model ultimately engaging with and processing the harmful content more intensely than a benign prompt. This aligns with theories that later layers handle more abstract, task-specific execution.
* **Mistral-7B's** near-identical maps suggest its internal representations are highly robust or invariant to the specific jailbreak technique used here. Its processing pathway does not significantly deviate from the benign case, which could imply stronger inherent safety alignment or a different failure mode not captured by this metric.
**Conclusion:** The "Difference" heatmap is a powerful diagnostic tool. It reveals that jailbreaking is not a monolithic phenomenon; its internal mechanistic impact varies dramatically across model families. LLaMA-3.1-8B shows the most structured and interpretable shift, while Mistral-7B appears most resistant *under these specific conditions*. This analysis moves beyond simply asking "did the jailbreak work?" to asking "how did the model's internal state change when it was attempted?"