## Heatmap Pair: AUROC for Projections a^T t
### Overview
The image displays two side-by-side heatmaps comparing the Area Under the Receiver Operating Characteristic curve (AUROC) for different combinations of training and test sets. The overall title is "AUROC for Projections a^T t". The left heatmap shows results when nothing is "Projected out," while the right heatmap shows results when "t_G and t_P" are projected out. A shared color bar on the far right indicates the AUROC scale, ranging from 0.0 (dark red) to 1.0 (bright yellow).
### Components/Axes
* **Main Title:** "AUROC for Projections a^T t"
* **Subplot Titles:**
* Left: "Projected out: None"
* Right: "Projected out: t_G and t_P"
* **Y-Axis (Both Heatmaps):** Labeled "Test Set". Categories from top to bottom:
1. `cities`
2. `neg_cities`
3. `facts`
4. `neg_facts`
5. `facts_conj`
6. `facts_disj`
* **X-Axis (Both Heatmaps):** Labeled "Train Set 'cities'". Categories from left to right:
1. `cities`
2. `+ neg_cities`
3. `+ cities_conj`
4. `+ cities_disj`
* **Color Bar (Legend):** Positioned vertically on the right edge of the image. Scale from 0.0 (bottom, dark red) to 1.0 (top, bright yellow). Ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
### Detailed Analysis
**Left Heatmap: Projected out: None**
This matrix shows generally high AUROC scores, especially for the `cities` and `neg_cities` test sets when trained on related sets.
| Test Set \ Train Set | `cities` | `+ neg_cities` | `+ cities_conj` | `+ cities_disj` |
| :--- | :--- | :--- | :--- | :--- |
| **`cities`** | 1.00 | 1.00 | 0.99 | 0.98 |
| **`neg_cities`** | 0.11 | 1.00 | 0.99 | 0.98 |
| **`facts`** | 0.85 | 0.95 | 0.94 | 0.94 |
| **`neg_facts`** | 0.44 | 0.81 | 0.69 | 0.71 |
| **`facts_conj`** | 0.56 | 0.73 | 0.70 | 0.71 |
| **`facts_disj`** | 0.51 | 0.59 | 0.58 | 0.59 |
**Right Heatmap: Projected out: t_G and t_P**
This matrix shows a significant reduction in AUROC scores across most categories, particularly for the fact-related test sets (`facts`, `neg_facts`, `facts_conj`, `facts_disj`), which now show values mostly in the 0.3-0.5 range (orange/red). The `cities` and `neg_cities` test sets retain relatively high scores.
| Test Set \ Train Set | `cities` | `+ neg_cities` | `+ cities_conj` | `+ cities_disj` |
| :--- | :--- | :--- | :--- | :--- |
| **`cities`** | 1.00 | 0.99 | 0.95 | 0.94 |
| **`neg_cities`** | 0.13 | 0.99 | 0.95 | 0.94 |
| **`facts`** | 0.41 | 0.31 | 0.41 | 0.39 |
| **`neg_facts`** | 0.55 | 0.50 | 0.47 | 0.49 |
| **`facts_conj`** | 0.38 | 0.43 | 0.53 | 0.55 |
| **`facts_disj`** | 0.39 | 0.41 | 0.49 | 0.51 |
### Key Observations
1. **Projection Impact:** The most striking observation is the dramatic decrease in AUROC for all test sets involving "facts" (`facts`, `neg_facts`, `facts_conj`, `facts_disj`) when `t_G` and `t_P` are projected out. Their scores drop from the 0.5-0.95 range to the 0.3-0.55 range.
2. **Robustness of `cities`/`neg_cities`:** The `cities` and `neg_cities` test sets maintain very high AUROC (≥0.94) in both conditions, except for the specific case where `neg_cities` is tested against a model trained only on `cities` (AUROC ~0.11-0.13). This indicates the model's core ability to distinguish city-related concepts is largely unaffected by projecting out `t_G` and `t_P`.
3. **Training Set Augmentation:** In the left heatmap, adding more data to the training set (`+ neg_cities`, `+ cities_conj`, `+ cities_disj`) generally improves or maintains performance for most test sets, with the notable exception of `neg_cities` tested on `cities` alone.
4. **Color Gradient Confirmation:** The visual color gradient aligns perfectly with the numerical values. Bright yellow cells correspond to values near 1.0, orange to mid-range values (~0.5-0.7), and dark red to low values (<0.2).
### Interpretation
This analysis investigates the role of specific projection vectors (`t_G` and `t_P`) in a model's ability to classify different types of data. The data suggests that `t_G` and `t_P` are **critical features for distinguishing fact-based information** (`facts`, `neg_facts`, etc.). When these features are removed (projected out), the model's performance on fact-related tasks collapses to near-random levels (AUROC ~0.5 is random guessing).
Conversely, the model's ability to classify city-related data (`cities`, `neg_cities`) appears to rely on different, more robust features that are not captured by `t_G` and `t_P`. The consistently high scores for these sets imply the model has learned a strong, separate representation for geographical or entity-based concepts.
The outlier—the very low AUROC for `neg_cities` vs. `cities` training—highlights a specific failure mode: a model trained only on positive city examples is very poor at identifying negative city examples, a gap that is immediately closed when negative examples are added to the training set (`+ neg_cities`).
In essence, the experiment demonstrates a **functional separation in the model's learned representations**: one set of features (`t_G`, `t_P`) is specialized for factual reasoning, while other, more persistent features handle entity classification. Removing the former cripples factual reasoning but leaves entity classification intact.