## Heatmap: Few-shot - Core Generalization - GPT-4o
### Overview
This image is a heatmap visualizing the accuracy (in percentage) of the GPT-4o model on "Few-shot - Core Generalization" tasks. The performance is broken down by two categorical variables: "Type" (vertical axis, rows 1-7) and "Length" (horizontal axis, columns 0-19). The color intensity represents accuracy, with a scale from 0% (lightest blue/white) to 100% (darkest blue). The chart contains numerical data points within each cell, and some cells are empty, indicating no data for that specific Type-Length combination.
### Components/Axes
* **Title:** "Few-shot - Core Generalization - GPT-4o" (Top Center).
* **Vertical Axis (Y-axis):** Labeled "Type". Categories are numbered 1 through 7 from top to bottom.
* **Horizontal Axis (X-axis):** Labeled "Length". Categories are numbered 0 through 19 from left to right.
* **Color Bar/Legend:** Located on the right side. It is a vertical gradient bar labeled "Accuracy (%)". The scale runs from 0 at the bottom to 100 at the top, with tick marks at 0, 20, 40, 60, 80, and 100. Darker blue corresponds to higher accuracy.
* **Data Cells:** Each cell at the intersection of a Type and Length contains a numerical value representing the accuracy percentage. The background color of the cell corresponds to this value per the color bar.
### Detailed Analysis
The following table reconstructs the data from the heatmap. An empty cell indicates no data was recorded for that Type-Length pair.
| Type | Length 0 | Length 1 | Length 2 | Length 3 | Length 4 | Length 5 | Length 6 | Length 7 | Length 8 | Length 9 | Length 10 | Length 11 | Length 12 | Length 13 | Length 14 | Length 15 | Length 16 | Length 17 | Length 18 | Length 19 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **1** | 64.0 | 33.0 | 27.0 | 21.0 | 22.0 | 23.0 | 16.0 | 28.0 | 27.0 | 30.0 | | | | | | | | | | |
| **2** | | 73.0 | 89.0 | 91.0 | 86.0 | 84.0 | 81.0 | 78.0 | 74.0 | 63.0 | 66.0 | | | | | | | | | |
| **3** | 42.0 | 53.0 | 46.0 | 44.0 | 35.0 | 18.0 | 16.0 | 25.0 | 20.0 | 18.0 | 13.0 | 17.0 | 17.0 | 18.0 | 13.0 | 17.0 | 11.0 | 14.0 | 10.0 | 11.0 |
| **4** | | 68.0 | 67.0 | 64.0 | 47.0 | 45.0 | 29.0 | 30.0 | 37.0 | 40.0 | 41.0 | 35.0 | | | | | | | | |
| **5** | | | | | | | | 11.0 | 25.0 | 21.0 | 18.0 | 17.0 | 25.0 | 20.0 | 25.0 | 15.0 | 24.0 | 26.0 | 20.0 | 27.0 |
| **6** | 89.0 | 75.0 | 66.0 | 54.0 | 51.0 | 48.0 | 44.0 | 49.0 | 42.0 | 52.0 | 46.0 | 51.0 | 40.0 | 44.0 | 32.0 | 37.0 | 38.0 | 32.0 | 39.0 | |
| **7** | 91.0 | 76.0 | 63.0 | 53.0 | 41.0 | 36.0 | 34.0 | 33.0 | 39.0 | 26.0 | 33.0 | 34.0 | 32.0 | 26.0 | | | | | | |
**Trend Verification by Type:**
* **Type 1:** Starts at a moderate 64.0% (Length 0) and shows a general downward trend with fluctuations, ending at 30.0% (Length 9). The line slopes downward overall.
* **Type 2:** Begins at 73.0% (Length 1), peaks at 91.0% (Length 3), then follows a steady downward trend to 66.0% (Length 10). The line has an initial rise followed by a decline.
* **Type 3:** Starts at 42.0% (Length 0), has a brief rise to 53.0% (Length 1), then exhibits a consistent downward trend across all lengths, reaching a low of 10.0% (Length 18). The line slopes downward.
* **Type 4:** Starts at 68.0% (Length 1) and shows a general downward trend with a slight mid-range recovery, ending at 35.0% (Length 11). The line slopes downward.
* **Type 5:** Data begins at Length 7 (11.0%). The trend is relatively flat and low, fluctuating between 11.0% and 27.0% with no strong directional slope.
* **Type 6:** Starts very high at 89.0% (Length 0) and follows a clear downward trend with some volatility, ending at 39.0% (Length 18). The line slopes downward.
* **Type 7:** Starts at the highest observed value of 91.0% (Length 0) and shows a strong, consistent downward trend, ending at 26.0% (Length 13). The line slopes downward.
### Key Observations
1. **Performance Decay with Length:** For most Types (1, 2, 3, 4, 6, 7), there is a clear negative correlation between Length and Accuracy. As the Length increases, the model's accuracy generally decreases.
2. **High Initial Performance:** Types 2, 6, and 7 show very high accuracy (>89%) at the shortest measured lengths (Lengths 0-3).
3. **Low-Performance Cluster:** Type 5 and the latter half of Type 3 consistently show low accuracy, mostly below 30%.
4. **Data Sparsity:** The heatmap is not fully populated. Type 1 has no data beyond Length 9. Type 2 has no data at Length 0 or beyond Length 10. Type 4 has no data at Length 0 or beyond Length 11. Type 5 has no data before Length 7. Type 6 has no data at Length 19. Type 7 has no data beyond Length 13.
5. **Peak Accuracy:** The single highest accuracy value is 91.0%, achieved by both Type 2 (at Length 3) and Type 7 (at Length 0).
### Interpretation
This heatmap demonstrates that GPT-4o's ability to generalize in few-shot scenarios is highly dependent on both the specific "Type" of task and the "Length" parameter (which could represent sequence length, number of examples, or another complexity metric).
* **Core Finding:** The dominant trend is that performance degrades as Length increases. This suggests the model's core generalization capability is sensitive to scale or complexity; it performs best on shorter, presumably simpler, instances of a task type.
* **Task-Type Variability:** The significant difference in starting accuracy and decay rates between Types (e.g., Type 7 starting at 91% vs. Type 3 starting at 42%) indicates that some core generalization tasks are inherently easier for the model than others.
* **Practical Implication:** For applications relying on few-shot generalization, this data suggests that keeping the "Length" parameter low is crucial for maintaining high accuracy. The model may require different prompting strategies or fine-tuning for task types that show poor performance even at short lengths (like Type 3 and Type 5).
* **Anomaly:** Type 5's data starts only at Length 7 and shows a flat, low-accuracy trend. This could indicate a different experimental setup for this type or a category where the model fails to generalize until a certain length threshold is met, after which it performs poorly but consistently.