## Heatmap: Baseline - Short-to-Long - Qwen-2.5 3B
### Overview
This image is a heatmap visualizing the accuracy (in percentage) of a model named "Qwen-2.5 3B" on a "Short-to-Long" baseline task. The accuracy is plotted against two categorical variables: "Type" (y-axis, categories 1 through 7) and "Length" (x-axis, values from 5 to 19). The color intensity represents accuracy, with a scale from 0% (lightest) to 100% (darkest red). The data is sparse, with each "Type" row containing data only for a specific, non-overlapping range of "Length" values.
### Components/Axes
* **Title:** "Baseline - Short-to-Long - Qwen-2.5 3B" (centered at the top).
* **Y-Axis (Vertical):** Labeled "Type". Categories are numbered 1 through 7 from top to bottom.
* **X-Axis (Horizontal):** Labeled "Length". Tick marks and labels are provided for integer values from 5 to 19.
* **Color Bar/Legend:** Located on the right side of the chart. It is a vertical gradient bar labeled "Accuracy (%)". The scale runs from 0 at the bottom (lightest color) to 100 at the top (darkest red), with intermediate markers at 20, 40, 60, and 80.
* **Data Cells:** Each cell in the grid contains a numerical value representing the accuracy percentage for a specific (Type, Length) combination. The cell's background color corresponds to this value on the color bar.
### Detailed Analysis
The following table reconstructs the data from the heatmap. Each row corresponds to a "Type," and columns correspond to "Length." Empty cells indicate no data for that combination.
| Type | Length 5 | Length 6 | Length 7 | Length 8 | Length 9 | Length 10 | Length 11 | Length 12 | Length 13 | Length 14 | Length 15 | Length 16 | Length 17 | Length 18 | Length 19 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **1** | 70.0 | 58.7 | 60.0 | 51.7 | 37.0 | | | | | | | | | | |
| **2** | | 98.0 | 97.3 | 94.3 | 96.3 | 94.0 | | | | | | | | | |
| **3** | | | | | | | | | | | 85.0 | 76.7 | 80.7 | 80.7 | 73.7 |
| **4** | | | 82.7 | 78.7 | 77.0 | 66.0 | 56.3 | | | | | | | | |
| **5** | | | | | | | | | | | 72.3 | 74.3 | 61.3 | 59.3 | 55.7 |
| **6** | | | | | | | | | | 98.0 | 95.0 | 98.3 | 97.7 | 98.3 | |
| **7** | | | | | 81.3 | 75.7 | 73.3 | 67.3 | 64.7 | | | | | | |
**Trend Verification by Type:**
* **Type 1 (Lengths 5-9):** The line of data points slopes sharply downward. Accuracy starts at 70.0% and decreases to 37.0% as length increases.
* **Type 2 (Lengths 6-10):** The data points form a high, relatively flat line. Accuracy remains very high, ranging from 94.0% to 98.0%, with a minor dip at Length 8 (94.3%).
* **Type 3 (Lengths 15-19):** The trend is generally downward with a peak at the start. Accuracy begins at 85.0%, dips to 76.7%, recovers to 80.7%, and ends at 73.7%.
* **Type 4 (Lengths 7-11):** The line slopes downward. Accuracy declines steadily from 82.7% to 56.3%.
* **Type 5 (Lengths 15-19):** The trend shows a peak in the middle. Accuracy starts at 72.3%, rises to 74.3%, then falls to 55.7%.
* **Type 6 (Lengths 14-18):** The data points form a very high, stable line. Accuracy is consistently excellent, ranging from 95.0% to 98.3%.
* **Type 7 (Lengths 9-13):** The line slopes downward. Accuracy decreases from 81.3% to 64.7%.
### Key Observations
1. **Performance Stratification by Type:** There is a stark difference in baseline performance between Types. Types 2 and 6 achieve near-perfect accuracy (>94%) across their respective length ranges. In contrast, Types 1, 4, 5, and 7 show significant performance degradation as sequence length increases.
2. **Length Sensitivity:** For most Types (1, 3, 4, 5, 7), accuracy generally decreases as the "Length" value increases, indicating the task becomes harder for longer sequences. Type 2 is an exception, maintaining high accuracy.
3. **Data Sparsity:** Each Type is evaluated only on a specific, contiguous block of Lengths (e.g., Type 1 on 5-9, Type 6 on 14-18). This suggests the "Types" may represent different task categories or difficulty levels that are only relevant or tested within certain length ranges.
4. **Color-Accuracy Correlation:** The visual trend matches the numerical data. The darkest red cells (highest accuracy) are concentrated in the rows for Type 2 and Type 6. The lightest cells (lowest accuracy) appear at the end of the length range for Type 1 (37.0%).
### Interpretation
This heatmap provides a diagnostic view of the Qwen-2.5 3B model's capabilities on a specific "Short-to-Long" evaluation. The data suggests that the model's performance is highly dependent on both the *type* of task and the *length* of the input sequence.
* **Task-Specific Proficiency:** The model exhibits exceptional, robust performance on the tasks categorized as Type 2 and Type 6, regardless of length within the tested range. This indicates these task types are well within the model's capabilities.
* **Length Generalization Challenge:** For several other task types (1, 4, 7), the model shows a clear inability to maintain accuracy as sequences get longer. This is a common challenge in language models, often related to attention mechanisms or context window utilization. The steep drop in Type 1 (from 70% to 37%) is particularly notable.
* **Non-Linear Difficulty:** The performance on Type 3 and Type 5 does not follow a simple linear decline. The peak at intermediate lengths (e.g., Type 5 at Length 16) suggests there may be specific length ranges where the model's processing is optimal for those task types, or that the difficulty of the task itself varies non-monotonically with length.
* **Implication for "Short-to-Long" Generalization:** The overall pattern indicates that while the model can handle some tasks ("Types") with excellent generalization from short to long sequences, it struggles significantly with others. This highlights that "length generalization" is not a monolithic capability but is deeply intertwined with the nature of the underlying task. The evaluation successfully isolates which task categories are robust and which are brittle as sequence length scales.