## Heatmap Comparison: R1-Llama vs. R1-Qwen Performance (Pass@1)
### Overview
The image displays two side-by-side heatmaps comparing the performance of two models, "R1-Llama" and "R1-Qwen," across different configurations. Performance is measured by the "Pass@1" metric, visualized through a color gradient. The analysis explores how this metric changes with variations in "Local Window Size" and "Ratio."
### Components/Axes
* **Titles:** Two main titles are positioned at the top: "R1-Llama" (left heatmap) and "R1-Qwen" (right heatmap).
* **Y-Axis (Left):** Labeled "Local Window Size." It has four discrete, categorical values listed from top to bottom: 3000, 2000, 1000, 500.
* **X-Axis (Bottom):** Labeled "Ratio." It has five discrete, categorical values listed from left to right: 0.1, 0.2, 0.3, 0.4, 0.5.
* **Color Scale/Legend (Right):** A vertical color bar labeled "Pass@1" on its right side. The scale ranges from approximately 50 (light yellow) to 56 (dark blue). Tick marks are present at 50, 52, 54, and 56.
* **Data Grids:** Each heatmap is a 4-row by 5-column grid. Each cell contains a numerical value representing the Pass@1 score for a specific combination of Local Window Size and Ratio.
### Detailed Analysis
**R1-Llama Heatmap (Left):**
* **Row 1 (Local Window Size 3000):** Values from left to right (Ratio 0.1 to 0.5): 49.1, 50.1, 50.6, 50.7, 51.4. The color transitions from light yellow to light green.
* **Row 2 (Local Window Size 2000):** Values: 49.5, 51.7, 52.8, 52.5, 50.9. Colors range from light yellow to teal, with the highest value (52.8) at Ratio 0.3.
* **Row 3 (Local Window Size 1000):** Values: 49.9, 52.7, 51.0, 51.9, 51.7. Colors are a mix of light yellow and teal.
* **Row 4 (Local Window Size 500):** Values: 49.8, 52.1, 50.7, 50.8, 51.7. Colors are similar to Row 3.
**R1-Qwen Heatmap (Right):**
* **Row 1 (Local Window Size 3000):** Values: 53.9, 53.9, 53.2, 54.4, 53.8. Colors are shades of medium blue.
* **Row 2 (Local Window Size 2000):** Values: 52.4, 51.9, 54.6, 56.3, 53.7. This row contains the highest value in the entire chart (56.3 at Ratio 0.4), shown in dark blue.
* **Row 3 (Local Window Size 1000):** Values: 52.2, 54.4, 53.8, 53.3, 53.0. Colors are shades of blue.
* **Row 4 (Local Window Size 500):** Values: 51.5, 51.8, 52.0, 54.3, 54.6. Colors range from light blue to medium blue.
### Key Observations
1. **Overall Performance Gap:** The R1-Qwen model consistently achieves higher Pass@1 scores than the R1-Llama model across all tested configurations. The R1-Qwen cells are predominantly blue (scores >52), while R1-Llama cells are mostly yellow-green (scores <53).
2. **Peak Performance:** The absolute highest Pass@1 score (56.3) is achieved by R1-Qwen with a Local Window Size of 2000 and a Ratio of 0.4.
3. **Sensitivity to Parameters:**
* For **R1-Llama**, performance does not show a strong, consistent trend with increasing Ratio or decreasing Window Size. The highest scores are scattered (e.g., 52.8 at Size 2000/Ratio 0.3, 52.7 at Size 1000/Ratio 0.2).
* For **R1-Qwen**, there is a more noticeable pattern. Performance tends to be higher at moderate Ratios (0.3-0.5) compared to the lowest Ratio (0.1). The configuration of Size 2000/Ratio 0.4 is a clear outlier peak.
4. **Stability:** R1-Qwen's performance appears more stable across different Window Sizes for a given Ratio, especially at Ratios 0.4 and 0.5, where scores remain relatively high.
### Interpretation
This heatmap comparison provides a clear visual benchmark suggesting that the R1-Qwen model architecture or training methodology yields superior performance (as measured by Pass@1) compared to R1-Llama for the evaluated task. The data indicates that hyperparameter tuning has a significant impact, particularly for R1-Qwen, where a specific "sweet spot" (Size 2000, Ratio 0.4) is identified.
The lack of a simple linear trend in either model suggests a complex interaction between the Local Window Size and Ratio parameters. The investigation implies that simply increasing one parameter does not guarantee better performance; the optimal setting is configuration-dependent. For practical deployment, R1-Qwen is the preferable model based on this metric, and its configuration should be carefully tuned, with the Size 2000/Ratio 0.4 setting being a strong candidate for optimal results. The visualization effectively communicates that model choice and parameter selection are critical for maximizing Pass@1 performance.