## Heatmap Grid: Attention Pattern Analysis Across Language Models
### Overview
The image displays a 3x3 grid of heatmaps analyzing attention patterns in three different Large Language Models (LLMs) under two conditions ("Benign" and "Jailbreak") and the calculated difference between them. Each row corresponds to a specific model and layer, while each column represents a condition. The heatmaps visualize the attention scores between query tokens and key tokens.
### Components/Axes
**Global Structure:**
- **Rows (Models & Layers):**
1. Top Row: `GPT-JT-6B - Layer 7`
2. Middle Row: `LLaMA-3.1-8B - Layer 4`
3. Bottom Row: `Mistral-7B - Layer 18`
- **Columns (Conditions):**
1. Left Column: `Benign`
2. Middle Column: `Jailbreak`
3. Right Column: `Difference`
**Axes (Identical for all 9 heatmaps):**
- **X-axis (Bottom):** `Key Token`. Scale ranges from 0 to 448, with major tick marks at 0, 64, 128, 192, 256, 320, 384, 448.
- **Y-axis (Left):** `Query Token`. Scale ranges from 0 to 448, with major tick marks at 0, 64, 128, 192, 256, 320, 384, 448.
**Color Bars (Legends):**
- **For "Benign" and "Jailbreak" columns (Left & Middle):** A vertical color bar is positioned to the right of each heatmap. The scale represents attention scores, ranging from approximately **-10 (dark purple/blue)** to **0 (yellow)**. The gradient transitions from dark purple/blue (low/negative attention) through teal and green to yellow (high/zero attention).
- **For the "Difference" column (Right):** A vertical color bar is positioned to the right of each heatmap. The scale represents the change in attention score (Jailbreak minus Benign), ranging from approximately **-4 (dark blue)** to **+4 (dark red)**, with **0 (white/light gray)** at the center. The gradient transitions from dark blue (decrease) through light blue/white to dark red (increase).
### Detailed Analysis
**Row 1: GPT-JT-6B - Layer 7**
- **Benign (Top-Left):** The heatmap shows a strong triangular pattern. High attention scores (yellow/green, ~0 to -2) are concentrated along the main diagonal (where Query Token index ≈ Key Token index) and in the upper-left triangle (where Query Token index < Key Token index). The lower-right triangle (where Query Token index > Key Token index) is dominated by very low attention scores (dark purple, ~-8 to -10).
- **Jailbreak (Top-Middle):** The pattern is visually similar to the Benign condition, maintaining the same triangular structure. The intensity of the high-attention region (yellow/green) appears slightly more pronounced or extended along the diagonal compared to Benign.
- **Difference (Top-Right):** This heatmap is predominantly red, indicating a positive difference (Jailbreak > Benign) across most of the upper-left triangle and diagonal. The strongest increases (dark red, ~+4) are concentrated in the region where both Query and Key Token indices are low (approximately 0-128). The lower-right triangle shows minimal change (white/light gray, ~0).
**Row 2: LLaMA-3.1-8B - Layer 4**
- **Benign (Middle-Left):** Exhibits a similar triangular attention pattern to GPT-JT-6B. High attention (yellow/green) is in the upper-left triangle and along the diagonal. Low attention (dark purple) fills the lower-right triangle.
- **Jailbreak (Middle-Middle):** Again, the pattern is structurally identical to its Benign counterpart. The high-attention region appears slightly brighter or more extensive.
- **Difference (Middle-Right):** This map is also largely red, showing a widespread increase in attention scores under the Jailbreak condition. The increase is most significant (dark red) in the upper-left quadrant (low token indices). The lower-right triangle shows near-zero change.
**Row 3: Mistral-7B - Layer 18**
- **Benign (Bottom-Left):** The pattern differs notably. While still triangular, the high-attention region (yellow/green) is much broader and extends further into the matrix. The gradient from high to low attention is smoother. The lowest attention scores (dark blue/purple) are confined to the very bottom-right corner.
- **Jailbreak (Bottom-Middle):** The pattern changes significantly. The high-attention region (yellow) becomes more concentrated along the main diagonal and the top edge (low Key Token indices). A larger portion of the matrix, especially the central and lower-left areas, shifts to moderate attention scores (teal/green, ~-4 to -6).
- **Difference (Bottom-Right):** This heatmap shows a complex pattern. A large, central region (spanning roughly Query 128-384 and Key 0-256) is blue, indicating a *decrease* in attention under Jailbreak. The top-left corner and a strip along the bottom edge (high Query Token indices) show red, indicating an *increase*. The diagonal shows mixed or minimal change.
### Key Observations
1. **Consistent Triangular Structure:** All "Benign" and "Jailbreak" heatmaps display a causal attention pattern, where tokens primarily attend to themselves and previous tokens (upper-left triangle), with little to no attention to future tokens (lower-right triangle).
2. **Model-Specific Baseline:** GPT-JT-6B and LLaMA-3.1-8B show very similar baseline ("Benign") attention distributions in the selected layers. Mistral-7B's baseline attention in Layer 18 is more diffuse.
3. **Jailbreak Impact - General Increase:** For GPT-JT-6B and LLaMA-3.1-8B, the "Jailbreak" condition leads to a general, widespread *increase* in attention scores within the causally allowed region (upper-left triangle), most pronounced for early tokens.
4. **Jailbreak Impact - Mistral's Redistribution:** Mistral-7B shows a different response. The Jailbreak condition causes a *redistribution* of attention, not just a uniform increase. Attention decreases in a large central region and increases in specific areas (early tokens and late query tokens).
5. **Spatial Focus of Change:** In all "Difference" maps, the most significant changes (whether increase or decrease) are concentrated in the regions corresponding to lower token indices (top-left of the matrices).
### Interpretation
This visualization provides a technical, layer-specific view of how "jailbreaking" prompts alter the internal attention mechanisms of different LLMs.
* **What the data suggests:** The jailbreak technique appears to modify the model's focus. For GPT-JT and LLaMA, it generally amplifies attention within the standard causal window, potentially making the model more sensitive to or reliant on the context provided by earlier tokens in the sequence when processing a jailbreak prompt. For Mistral, the effect is more nuanced, suggesting a strategic reallocation of attention resources—perhaps suppressing certain internal relationships while enhancing others to bypass safety training.
* **How elements relate:** The "Difference" column is the critical analytical output, directly isolating the effect of the jailbreak from the model's baseline behavior. The consistency of the triangular structure confirms the underlying causal attention mask is unchanged; the jailbreak alters the *strength* of attention, not its fundamental direction.
* **Notable anomalies/outliers:** The stark contrast between the response of Mistral-7B (Layer 18) and the other two models is the primary anomaly. This could be due to differences in model architecture, the specific layer analyzed (18 vs. 4/7), or the jailbreak's effectiveness/mechanism on that model. The concentration of change in low-index tokens across all models is also notable, suggesting the initial context of the prompt is a critical battleground during jailbreak attempts.
* **Peircean investigative reading:** The heatmaps are an *index* of the model's internal state, pointing directly to a physical change (attention score) caused by the jailbreak stimulus. They are also a *symbol* representing a complex computational process. The pattern suggests that successful jailbreaking may not require a complete overhaul of the model's processing, but rather a subtle, targeted modulation of its existing attention patterns, particularly in early processing stages. This has implications for detection and defense strategies, which could focus on monitoring for these characteristic attention shifts.