## Heatmap: BLEU Score and Edit Distance Across Scenarios and Temperatures
### Overview
The image is a dual-axis heatmap comparing performance metrics (BLEU Score and Edit Distance) across two scenarios ("Scenario" and "Scenario OOD CMP") at varying temperatures (1e-05 to 10.0). The heatmap uses a color gradient from blue (low values) to red (high values) to represent metric magnitudes.
### Components/Axes
- **Y-Axis (Left)**:
- Top Section: "Scenario" with subcategories:
- OOD POD CMP (0.687, 0.454, 0.277)
- Bottom Section: "Scenario OOD CMP" with subcategories:
- OOD POD CMP (0.133, 0.167, 0.299)
- **X-Axis (Bottom)**: Temperature values: 1e-05, 0.01, 0.1, 1.0, 5.0, 10.0
- **Color Legends (Right)**:
- **BLEU Score**: Red (high) to Blue (low), range 0.002–0.687
- **Edit Distance**: Red (high) to Blue (low), range 0.133–0.846
### Detailed Analysis
#### Top Section ("Scenario")
- **BLEU Score Trends**:
- **OOD POD CMP**:
- 1e-05: 0.687 (red)
- 0.01: 0.687 (red)
- 0.1: 0.687 (red)
- 1.0: 0.686 (red)
- 5.0: 0.019 (blue)
- 10.0: 0.002 (blue)
- **Trend**: Sharp decline in BLEU Score as temperature increases beyond 1.0.
#### Bottom Section ("Scenario OOD CMP")
- **Edit Distance Trends**:
- **OOD POD CMP**:
- 1e-05: 0.133 (blue)
- 0.01: 0.133 (blue)
- 0.1: 0.133 (blue)
- 1.0: 0.133 (blue)
- 5.0: 0.760 (red)
- 10.0: 0.830 (red)
- **Trend**: Gradual increase in Edit Distance with temperature, accelerating after 1.0.
### Key Observations
1. **BLEU Score Degradation**: In the "Scenario" section, BLEU Score drops dramatically at higher temperatures (5.0–10.0), suggesting model performance collapses under extreme conditions.
2. **Edit Distance Correlation**: In "Scenario OOD CMP", Edit Distance increases with temperature, indicating more edits are required as model confidence decreases.
3. **Scenario Comparison**: The "Scenario" section consistently shows higher BLEU Scores and lower Edit Distances than "Scenario OOD CMP", implying better baseline performance.
### Interpretation
- **Model Robustness**: The data highlights a critical temperature threshold (1.0) beyond which both metrics degrade significantly. This suggests models trained on "Scenario" are less robust to out-of-distribution (OOD) conditions compared to "Scenario OOD CMP".
- **Trade-off Analysis**: Lower temperatures (1e-05 to 1.0) optimize BLEU Score but may underfit OOD data, while higher temperatures (5.0–10.0) increase Edit Distance, reflecting overcorrection or noise amplification.
- **Practical Implications**: The stark contrast between scenarios underscores the need for temperature calibration in deployment environments to balance fluency (BLEU) and accuracy (Edit Distance).
### Spatial Grounding & Validation
- **Legend Alignment**: Red in the BLEU Score legend matches high values (0.687), while blue matches low values (0.002). Similarly, Edit Distance red (0.846) aligns with high values, and blue (0.133) with low values.
- **Data Consistency**: All numerical values in the heatmap correspond to the color gradient, with no mismatches observed.