Image c3ea5651e214...

EXPERT: gemini-2.0-flash VERSION 1

RUNTIME: nugit/gemini/gemini-2.0-flash
INTEL_VERIFIED
## Heatmaps: GPT-2 Name-Copying Heads and (Suppression) Name-Copying Heads

### Overview
The image presents two heatmaps side-by-side, visualizing the "Name Copying score" for different heads and layers in a GPT-2 model. The left heatmap shows "Name-Copying heads," while the right heatmap shows "(Suppression) Name-Copying heads." Both heatmaps use a color gradient to represent the Name Copying score, ranging from dark purple (0.0) to bright yellow (1.0). The heatmaps also overlay markers indicating 'Name-Mover Heads' and 'Backup Name-Mover Heads' as classified by 'Interp. in the Wild'.

### Components/Axes

**Left Heatmap (GPT-2: Name-Copying heads):**

*   **Title:** GPT-2: Name-Copying heads
*   **Y-axis:** Head, with labels from 0 to 11.
*   **X-axis:** Layer, with labels from 0 to 11.
*   **Color Bar (Right Side):** Name Copying score, ranging from 0.0 to 1.0 in increments of 0.2.
*   **Legend (Top-Left):**
    *   'Interp. in the Wild' classifications
    *   'X' represents Name-Mover Heads
    *   Gray dot represents Backup Name-Mover Heads

**Right Heatmap (GPT-2: (Suppression) Name-Copying heads):**

*   **Title:** GPT-2: (Suppression) Name-Copying heads
*   **Y-axis:** Head, with labels from 0 to 11.
*   **X-axis:** Layer, with labels from 0 to 11.
*   **Color Bar (Right Side):** (Suppression) Name Copying score, ranging from 0.0 to 1.0 in increments of 0.2.
*   **Legend (Top-Left):**
    *   'Interp. in the Wild' classifications
    *   'X' represents (Negative) Name-Mover Heads
    *   Gray dot represents Backup (Negative) Name-Mover Heads

### Detailed Analysis

**Left Heatmap (GPT-2: Name-Copying heads):**

*   **Head 0:**
    *   Layer 9: Yellow, approximately 0.8-1.0.
    *   Layer 10: Yellow, approximately 0.8-1.0.
    *   Layer 11: Green, approximately 0.6-0.8.
*   **Head 1:**
    *   Layer 10: Green, approximately 0.6-0.8.
    *   Layer 11: Green, approximately 0.6-0.8.
*   **Head 2:**
    *   Layer 9: Yellow, approximately 0.8-1.0.
*   **Head 3:**
    *   Layer 4: Blue, approximately 0.2-0.4.
*   **Head 4:**
    *   Layer 1: Blue, approximately 0.2-0.4.
*   **Head 6:**
    *   Layer 0: Blue, approximately 0.2-0.4.
*   **Head 9:**
    *   Layer 6: Yellow, approximately 0.8-1.0.
    *   Layer 7: Yellow, approximately 0.8-1.0.
*   **Head 10:**
    *   Layer 8: Yellow, approximately 0.8-1.0.
    *   Layer 9: Green, approximately 0.6-0.8.
*   **Head 11:**
    *   Layer 9: Green, approximately 0.6-0.8.

**Name-Mover Heads (X markers):**

*   Head 0, Layer 10
*   Head 6, Layer 9
*   Head 9, Layer 6

**Backup Name-Mover Heads (Gray dots):**

*   Head 0, Layer 9
*   Head 1, Layer 10
*   Head 6, Layer 10
*   Head 9, Layer 7

**Right Heatmap (GPT-2: (Suppression) Name-Copying heads):**

*   **Head 4:**
    *   Layer 6: Blue, approximately 0.2-0.4.
*   **Head 6:**
    *   Layer 0: Blue, approximately 0.2-0.4.
*   **Head 9:**
    *   Layer 6: Green, approximately 0.4-0.6.
*   **Head 11:**
    *   Layer 5: Blue, approximately 0.0-0.2.

**(Negative) Name-Mover Heads (X markers):**

*   Head 9, Layer 10
*   Head 11, Layer 11

**Backup (Negative) Name-Mover Heads (Gray dots):**

*   Head 6, Layer 9

### Key Observations

*   The left heatmap shows more intense "Name Copying" activity (yellow and green) compared to the right heatmap, which is predominantly dark purple and blue.
*   Name-Mover Heads and Backup Name-Mover Heads are concentrated in specific head-layer combinations.
*   The suppression heatmap shows very little activity, with most values close to 0.

### Interpretation

The heatmaps visualize the extent to which different heads and layers in a GPT-2 model are involved in "Name Copying." The left heatmap indicates the baseline Name Copying activity, while the right heatmap shows the activity after a "suppression" intervention. The significant reduction in Name Copying scores in the right heatmap suggests that the suppression technique is effective in reducing this behavior. The markers for Name-Mover Heads and Backup Name-Mover Heads highlight specific areas of the model that are most relevant to this function, allowing for targeted analysis and intervention. The data suggests that specific heads and layers are more responsible for name-copying behavior than others.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

c3ea5651e21450ff2ce235d3

FOUND IN PAPERS

EXPERT: gemini-2.0-flash VERSION 1