## Heatmap: Baseline - Long-to-Short - Qwen-2.5 1.5B
### Overview
This image is a heatmap visualizing the accuracy (in percentage) of a model named "Qwen-2.5 1.5B" on a "Long-to-Short" task. The performance is broken down by two categorical variables: "Type" (y-axis) and "Length" (x-axis). The color intensity represents accuracy, with a scale from 0% (lightest) to 100% (darkest green).
### Components/Axes
* **Title:** "Baseline - Long-to-Short - Qwen-2.5 1.5B" (centered at the top).
* **Y-Axis (Vertical):** Labeled "Type". It contains 7 discrete categories, numbered 1 through 7 from top to bottom.
* **X-Axis (Horizontal):** Labeled "Length". It contains discrete numerical markers: 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11. Note that length 6 is absent from the axis.
* **Legend/Color Bar:** Located on the right side of the chart. It is a vertical gradient bar labeled "Accuracy (%)". The scale has tick marks at 0, 20, 40, 60, 80, and 100. The color transitions from a very light, almost white green (0%) to a deep, dark forest green (100%).
* **Data Cells:** The main chart area is a grid where each cell corresponds to a specific (Type, Length) pair. The cell's background color corresponds to the accuracy value, which is also printed as a number within the cell. Not all (Type, Length) combinations are present; the data is sparse.
### Detailed Analysis
The following table reconstructs the data from the heatmap. Empty cells indicate no data point for that (Type, Length) combination.
| Type | Length 0 | Length 1 | Length 2 | Length 3 | Length 4 | Length 5 | Length 7 | Length 8 | Length 9 | Length 10 | Length 11 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **1** | 0.0 | 0.0 | 18.7 | 28.3 | 44.7 | | | | | | |
| **2** | | 69.0 | 88.7 | 95.7 | 90.3 | 86.0 | | | | | |
| **3** | 0.0 | 53.7 | 75.0 | 81.7 | 73.7 | | | | | | |
| **4** | | 47.7 | 59.7 | 68.7 | 67.7 | 65.7 | | | | | |
| **5** | | | | | | | 46.0 | 50.7 | 55.3 | 63.0 | 60.7 |
| **6** | 0.3 | 78.7 | 97.0 | 96.3 | 96.3 | | | | | | |
| **7** | 0.0 | 18.7 | 53.7 | 73.3 | 78.7 | | | | | | |
**Trend Verification by Type:**
* **Type 1:** Accuracy starts at 0.0 for lengths 0-1, then shows a steady upward trend with increasing length (18.7 → 28.3 → 44.7).
* **Type 2:** Shows high accuracy overall. It increases sharply from length 1 (69.0) to a peak at length 3 (95.7), then slightly decreases at lengths 4 and 5.
* **Type 3:** Starts at 0.0 for length 0, jumps to 53.7 at length 1, peaks at length 3 (81.7), and then dips at length 4.
* **Type 4:** Shows a moderate, relatively stable accuracy across lengths 1-5, peaking at length 3 (68.7).
* **Type 5:** This type is isolated to longer lengths (7-11). It shows a gradual upward trend from length 7 (46.0) to a peak at length 10 (63.0), with a slight drop at length 11.
* **Type 6:** Exhibits very high accuracy. After a near-zero start at length 0 (0.3), it jumps to 78.7 at length 1 and maintains very high values (>96) for lengths 2-4.
* **Type 7:** Starts at 0.0 for length 0, then shows a consistent and strong upward trend with increasing length, reaching 78.7 at length 4.
### Key Observations
1. **Performance at Length 0:** Types 1, 3, and 7 have 0.0% accuracy at length 0. Type 6 has a negligible 0.3%. This suggests the model fails completely on these task types when the "Length" parameter is 0.
2. **High-Performing Types:** Type 6 is the strongest performer, achieving near-perfect accuracy (97.0%) at length 2 and maintaining >96% for longer lengths. Type 2 also shows excellent performance, peaking at 95.7%.
3. **Length Specialization:** Type 5 is unique, with data only for lengths 7 through 11. This may indicate a task category inherently associated with longer sequences.
4. **General Trend:** For most types (1, 3, 6, 7), accuracy improves as the "Length" value increases from 0 or 1. Performance often peaks around length 3 or 4 before plateauing or slightly declining.
5. **Sparse Data Grid:** The heatmap is not a complete matrix. The absence of data for certain (Type, Length) pairs (e.g., Type 1 at length 5, Type 2 at length 0) is a significant feature of the dataset.
### Interpretation
This heatmap provides a diagnostic view of the Qwen-2.5 1.5B model's capabilities on a "Long-to-Short" task, which likely involves condensing or summarizing information. The "Type" axis probably represents different categories or formats of this task (e.g., summarizing a paragraph vs. extracting a key phrase), while "Length" could refer to the input length, output length, or a complexity parameter.
The data suggests the model's performance is highly dependent on both the task type and the length parameter. The complete failure at Length 0 for several types indicates a fundamental limitation or a specific edge case in the model's design or training for those scenarios. The strong performance of Types 2 and 6 identifies them as areas of relative strength. The isolated data for Type 5 hints at a specialized subset of the task.
Overall, the chart reveals that the model is not uniformly proficient. Its accuracy is contingent on the specific combination of task type and length, with clear patterns of strength (high accuracy at moderate lengths for certain types) and weakness (failure at minimal lengths). This information would be crucial for developers to understand the model's boundaries and guide further fine-tuning or evaluation.