## [Scatter Plot Comparison]: Top 10 Safety Heads on Undiff Attn. vs. Scaling Cont.
### Overview
The image displays two side-by-side scatter plots comparing the locations (by Layer and Head) of the top 10 "safety heads" for two different large language models (Llama-2-7b-chat and Vicuna-7b-v1.5) under two different experimental conditions. The left plot is titled "Top 10 Safety Heads on Undiff Attn." and the right plot is titled "Top 10 Safety Heads on Scaling Cont." Each plot uses a color scale to represent a metric called "Generalized Ships."
### Components/Axes
**Common Elements for Both Plots:**
* **X-axis:** Label: "Layer". Scale: 0 to 30, with major ticks every 2 units.
* **Y-axis:** Label: "Head". Scale: 0 to 30, with major ticks every 2 units.
* **Legend:** Located in the top-right corner of each plot.
* Purple Circle (●): "Llama-2-7b-chat"
* Yellow X (✕): "Vicuna-7b-v1.5"
* **Color Bar:** Located to the right of each plot, labeled "Generalized Ships". The scale and range differ between plots.
**Left Plot Specifics:**
* **Title:** "Top 10 Safety Heads on Undiff Attn."
* **Color Bar Scale:** Ranges from 0 (dark purple) to 70 (bright yellow). Ticks at 0, 10, 20, 30, 40, 50, 60, 70.
**Right Plot Specifics:**
* **Title:** "Top 10 Safety Heads on Scaling Cont."
* **Color Bar Scale:** Ranges from 0 (dark purple) to ~22 (bright yellow). Ticks at 0, 5, 10, 15, 20.
### Detailed Analysis
**Left Plot: "Undiff Attn."**
* **Llama-2-7b-chat (Purple Circles):** Points are clustered in the lower-left quadrant (early layers, lower heads) with a few outliers.
* (Layer ~1, Head ~1), Color: Dark purple (~5)
* (Layer ~2, Head ~15), Color: Dark purple (~5)
* (Layer ~2, Head ~26), Color: Dark purple (~5)
* (Layer ~2, Head ~29), Color: Dark purple (~5)
* (Layer ~3, Head ~2), Color: Dark purple (~5)
* (Layer ~3, Head ~6), Color: Dark purple (~5)
* (Layer ~3, Head ~8), Color: Dark purple (~5)
* (Layer ~4, Head ~7), Color: Dark purple (~5)
* (Layer ~28, Head ~26), Color: Dark purple (~5)
* **Vicuna-7b-v1.5 (Yellow X's):** Points are more spread across layers 0-8, with heads mostly below 10.
* (Layer ~1, Head ~8), Color: Yellow-green (~60)
* (Layer ~2, Head ~1), Color: Blue-green (~30)
* (Layer ~3, Head ~7), Color: Blue-green (~30)
* (Layer ~4, Head ~2), Color: Blue-green (~30)
* (Layer ~6, Head ~0), Color: Blue-green (~30)
* (Layer ~6, Head ~2), Color: Blue-green (~30)
* (Layer ~6, Head ~6), Color: Blue-green (~30)
* (Layer ~3, Head ~26), Color: Blue-green (~30) [Note: This point overlaps with a Llama circle.]
**Right Plot: "Scaling Cont."**
* **Llama-2-7b-chat (Purple Circles):** Points are distributed across layers 0-14, with a concentration in very early layers (0-1) and heads spanning a wide range.
* (Layer ~0, Head ~13), Color: Teal (~12)
* (Layer ~0, Head ~21), Color: Teal (~12)
* (Layer ~0, Head ~25), Color: Blue (~8)
* (Layer ~1, Head ~8), Color: Teal (~12)
* (Layer ~1, Head ~15), Color: Yellow (~20)
* (Layer ~1, Head ~22), Color: Teal (~12)
* (Layer ~1, Head ~27), Color: Blue (~8)
* (Layer ~13, Head ~1), Color: Blue (~8)
* (Layer ~13, Head ~4), Color: Teal (~12)
* (Layer ~14, Head ~23), Color: Blue (~8)
* **Vicuna-7b-v1.5 (Yellow X's):** Points are scattered, with a cluster around layers 4-5 and single points at layers 16 and 21.
* (Layer ~4, Head ~15), Color: Teal (~12)
* (Layer ~5, Head ~15), Color: Teal (~12)
* (Layer ~16, Head ~0), Color: Teal (~12)
* (Layer ~21, Head ~10), Color: Teal (~12)
### Key Observations
1. **Condition-Dependent Distribution:** The spatial distribution of top safety heads changes dramatically between the "Undiff Attn." and "Scaling Cont." conditions for both models.
2. **Model-Specific Patterns:**
* Under "Undiff Attn.", Llama's top heads are mostly in very early layers (1-4) with one late-layer outlier (28), while Vicuna's are in layers 1-8.
* Under "Scaling Cont.", Llama's heads are concentrated in the first two layers (0-1), while Vicuna's are more dispersed (layers 4, 5, 16, 21).
3. **"Generalized Ships" Metric:** The metric's value range is much higher for the "Undiff Attn." condition (up to 70) compared to "Scaling Cont." (up to ~20). This suggests the metric is sensitive to the experimental condition.
4. **Overlap:** In the left plot, a Vicuna point at (Layer ~3, Head ~26) overlaps with a Llama point, indicating both models identified a similar head as important under that condition.
### Interpretation
This visualization is likely from research on mechanistic interpretability or safety in LLMs. "Safety Heads" probably refers to specific attention heads within the model that are crucial for safe or aligned behavior. "Undiff Attn." (Undifferentiated Attention) and "Scaling Cont." (Scaling Context) are likely two different methods or probes used to identify these heads.
The data suggests that:
* **The location of influential "safety" mechanisms is not fixed** but depends heavily on the evaluation method ("Undiff Attn." vs. "Scaling Cont.").
* **Llama-2-7b-chat and Vicuna-7b-v1.5, despite potential architectural similarities, develop different internal circuits for safety.** Llama shows a strong early-layer focus under "Scaling Cont.", while Vicuna's important heads are more scattered.
* The "Generalized Ships" metric, whose meaning is not defined in the image, appears to be a stronger signal under the "Undiff Attn." condition. Its higher values there might indicate a more pronounced or easily detectable effect.
**In summary, the image demonstrates that the identification of "safety-critical" components in LLMs is highly contingent on the analytical lens applied, and different models learn different internal strategies for handling safety-related tasks.**