## Heatmap: BLEU Scores and Exact Match Rates Across Scenarios and Transformations
### Overview
The image presents two stacked heatmaps comparing performance metrics (BLEU scores and exact match percentages) across three scenarios (ID, CMP, OOD) and six transformation types (f1, f2, f1·f1, f1·f2, f2·f1, f2·f2). The top heatmap shows BLEU scores (0-1 scale), while the bottom shows exact match rates (0-100%).
### Components/Axes
**Axes:**
- **Y-axis (Scenarios):**
- Top: ID (identical data)
- Middle: CMP (comparison data)
- Bottom: OOD (out-of-distribution data)
- **X-axis (Transformations):**
- f1, f2, f1·f1, f1·f2, f2·f1, f2·f2
- **Color Scales:**
- Right (BLEU Scores): Red (1.0) → Blue (0.0)
- Bottom (Exact Match): Red (100%) → Blue (0%)
**Legend Placement:**
- BLEU score legend: Right side, top heatmap
- Exact match legend: Bottom heatmap, right side
### Detailed Analysis
**BLEU Scores (Top Heatmap):**
- **ID Scenario:** All transformations score 1.00 (perfect BLEU)
- **CMP Scenario:**
- f1: 0.71
- f2: 0.62
- f1·f1: 0.65
- f1·f2: 0.68
- f2·f1: 0.32
- f2·f2: 0.16
- **OOD Scenario:**
- f1·f1: 0.46
- f1·f2: 0.35
- f2·f1: 0.40
- f2·f2: 0.35
- All other transformations: 0.00
**Exact Match Rates (Bottom Heatmap):**
- **ID Scenario:** All transformations score 100%
- **CMP & OOD Scenarios:** All transformations score 0%
### Key Observations
1. **ID Scenario Dominance:** Perfect performance (BLEU=1.00, exact match=100%) across all transformations
2. **CMP Scenario Variability:**
- Highest BLEU for f1·f2 (0.68)
- Lowest BLEU for f2·f2 (0.16)
- No exact matches despite moderate BLEU scores
3. **OOD Scenario Limitations:**
- Only transformation combinations (f1·f1, f1·f2, f2·f1, f2·f2) show partial BLEU scores (0.35-0.46)
- No exact matches despite non-zero BLEU scores
4. **Transformation Impact:**
- Single-feature transformations (f1, f2) perform better than combinations in CMP/OOD
- f2·f2 combination shows worst performance in both metrics
### Interpretation
The data reveals a clear performance hierarchy: ID > CMP > OOD. While BLEU scores suggest some fluency preservation in CMP scenarios, the absence of exact matches indicates these transformations fail to maintain semantic integrity. The OOD scenario's near-zero exact matches despite moderate BLEU scores highlight a critical disconnect between surface-level fluency and factual accuracy in out-of-distribution contexts. The f2·f2 transformation emerges as particularly problematic, showing both the lowest BLEU and no exact matches in CMP scenarios. This suggests that complex feature interactions degrade performance more severely than single-feature transformations.