Image b8cafae884b7...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free
INTEL_VERIFIED
## Bar Chart: Llama-2-7b-chat-hf Attack Success Rates by Ablated Head Numbers

### Overview
The chart compares attack success rates (ASR) across three datasets (`maliciousinstruct`, `jailbreakbench`, `advbench`) for a Llama-2-7b-chat-hf model when specific attention heads are ablated. The x-axis represents ablated head numbers (0–5), and the y-axis shows ASR (0–0.40). Each dataset is represented by a distinct color: yellow (`maliciousinstruct`), dark green (`jailbreakbench`), and dark gray (`advbench`).

### Components/Axes
- **X-axis**: "Ablated Head Numbers" (0–5), with head 0 having no visible data.
- **Y-axis**: "Attack Success Rate (ASR)" (0–0.40), scaled in increments of 0.05.
- **Legend**: Located in the top-right corner, mapping colors to datasets:
  - Yellow: `maliciousinstruct`
  - Dark green: `jailbreakbench`
  - Dark gray: `advbench`
- **Bars**: Grouped by ablated head number, with three bars per group (one per dataset).

### Detailed Analysis
- **Head 0**: All ASR values are near 0 (no visible bars).
- **Head 1**:
  - `maliciousinstruct`: ~0.21
  - `jailbreakbench`: ~0.31
  - `advbench`: ~0.33
- **Head 2**:
  - `maliciousinstruct`: ~0.23
  - `jailbreakbench`: ~0.30
  - `advbench`: ~0.35
- **Head 3**:
  - `maliciousinstruct`: ~0.18
  - `jailbreakbench`: ~0.19
  - `advbench`: ~0.25
- **Head 4**:
  - `maliciousinstruct`: ~0.16
  - `jailbreakbench`: ~0.17
  - `advbench`: ~0.23
- **Head 5**:
  - `maliciousinstruct`: ~0.17
  - `jailbreakbench`: ~0.26
  - `advbench`: ~0.22

### Key Observations
1. **Head 0**: No attack success observed for any dataset.
2. **Peak Performance**:
   - `advbench` achieves the highest ASR at head 2 (~0.35).
   - `jailbreakbench` peaks at head 1 (~0.31).
3. **Declining Trends**:
   - All datasets show reduced ASR after head 2, with sharper declines in heads 3–5.
   - `maliciousinstruct` consistently has the lowest ASR across all heads.
4. **Head 5 Anomaly**: `jailbreakbench` shows a slight recovery (~0.26) compared to heads 3–4.

### Interpretation
The data suggests that ablated heads 1 and 2 are critical for attack success, particularly for `advbench` and `jailbreakbench`. The sharp decline in ASR after head 2 indicates these heads may encode key information for adversarial robustness. `maliciousinstruct`’s lower ASR across all heads implies it is less sensitive to head ablation. The partial recovery in `jailbreakbench` at head 5 could indicate redundancy or alternative pathways in later layers. This analysis highlights the importance of specific attention heads in maintaining model security against targeted attacks.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

b8cafae884b7a9eb40461914

FOUND IN PAPERS

EXPERT: nemotron-free VERSION 1