## Heatmap: Zero-shot Core Generalization Performance of GPT-4o
### Overview
This image is a heatmap visualizing the zero-shot accuracy (in percentage) of a model identified as "GPT-4o" on a "Core Generalization" task. The performance is broken down across two dimensions: "Type" (vertical axis, categories 1-7) and "Length" (horizontal axis, values 0-19). The color intensity represents accuracy, with a scale from 0% (lightest) to 100% (darkest blue).
### Components/Axes
* **Title:** "Zero-shot - Core Generalization - GPT-4o"
* **Vertical Axis (Y-axis):** Labeled "Type". Contains 7 discrete categories, numbered 1 through 7 from top to bottom.
* **Horizontal Axis (X-axis):** Labeled "Length". Contains 20 discrete values, numbered 0 through 19 from left to right.
* **Color Bar/Legend:** Located on the far right. It is a vertical gradient bar labeled "Accuracy (%)". The scale runs from 0 at the bottom to 100 at the top, with tick marks at 0, 20, 40, 60, 80, and 100. The color transitions from very light blue/white (0%) to dark blue (100%).
* **Data Grid:** The main body of the chart is a grid of cells. Each cell corresponds to a specific (Type, Length) pair. The cell's background color and the numerical value printed within it indicate the accuracy percentage. Some cells are blank, indicating no data or a value of 0 that is not displayed numerically.
### Detailed Analysis
The following table reconstructs the accuracy data for each Type across the available Lengths. Values are transcribed directly from the image. Blank cells are noted as "N/A".
| Type | Length 0 | Length 1 | Length 2 | Length 3 | Length 4 | Length 5 | Length 6 | Length 7 | Length 8 | Length 9 | Length 10 | Length 11 | Length 12 | Length 13 | Length 14 | Length 15 | Length 16 | Length 17 | Length 18 | Length 19 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **1** | 37.0 | 31.0 | 22.0 | 12.0 | 6.0 | 3.0 | 1.0 | 2.0 | 1.0 | 1.0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| **2** | N/A | 49.0 | 66.0 | 67.0 | 45.0 | 62.0 | 41.0 | 44.0 | 48.0 | 37.0 | 45.0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| **3** | 13.0 | 42.0 | 22.0 | 9.0 | 10.0 | 5.0 | 3.0 | 3.0 | 3.0 | 3.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
| **4** | N/A | 62.0 | 65.0 | 45.0 | 26.0 | 24.0 | 19.0 | 14.0 | 17.0 | 13.0 | 9.0 | 9.0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| **5** | N/A | N/A | N/A | N/A | N/A | N/A | N/A | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 4.0 |
| **6** | 22.0 | 59.0 | 35.0 | 24.0 | 15.0 | 20.0 | 17.0 | 5.0 | 8.0 | 15.0 | 8.0 | 14.0 | 6.0 | 6.0 | 11.0 | 8.0 | 5.0 | 7.0 | 2.0 | N/A |
| **7** | 39.0 | 36.0 | 26.0 | 26.0 | 17.0 | 18.0 | 4.0 | 11.0 | 10.0 | 7.0 | 2.0 | 2.0 | 5.0 | 2.0 | N/A | N/A | N/A | N/A | N/A | N/A |
### Key Observations
1. **Performance Decay with Length:** For most Types (especially 1, 4, 6, 7), accuracy shows a clear downward trend as "Length" increases. The highest values are typically found at the shortest lengths (0-3).
2. **Type-Specific Performance:**
* **Type 2** demonstrates the strongest and most consistent performance, maintaining accuracies between 37% and 67% across its measured lengths (1-10).
* **Type 5** shows near-total failure, with accuracies of 0% for almost all lengths (7-19), except for minor blips of 2.0% and 4.0% at lengths 17 and 19.
* **Type 3** has a wide range of lengths (0-19) but generally low accuracy, peaking at 42.0% at Length 1 and frequently dropping to 0-1%.
3. **Data Coverage:** The heatmap is not a complete rectangle. Different Types have data for different ranges of Lengths. Type 3 has the broadest coverage (Lengths 0-19), while Type 2 and Type 4 have narrower ranges.
4. **Peak Values:** The highest accuracy recorded is **67.0%** for **Type 2 at Length 3**. The second highest is **66.0%** for **Type 2 at Length 2**.
### Interpretation
This heatmap provides a diagnostic view of GPT-4o's zero-shot generalization capabilities on a specific core task. The data suggests the following:
* **Task Difficulty Scales with Length:** The predominant trend of decreasing accuracy with increasing "Length" indicates that the core generalization task becomes significantly harder for the model as the sequence or problem length grows. This is a common challenge in language model evaluation.
* **Heterogeneous Task Types:** The "Type" axis likely represents different sub-tasks or problem formats within the core generalization benchmark. The stark performance differences between types (e.g., Type 2 vs. Type 5) reveal that the model's zero-shot capability is highly dependent on the specific structure or nature of the problem. Type 2 appears to be a format the model handles well, while Type 5 is almost completely intractable for it in a zero-shot setting.
* **Zero-Shot Limitations:** The overall low-to-moderate accuracy values (mostly below 50%) for longer lengths and certain types highlight the limitations of zero-shot prompting for complex generalization. The model struggles to infer the correct pattern or solution without examples, especially as problem complexity (length) increases.
* **Benchmark Design:** The structure of the heatmap, with its varying data ranges per type, suggests the benchmark itself may have different length distributions for different problem types, or that some types are only defined for certain lengths.
In summary, the image reveals that GPT-4o's zero-shot core generalization is highly uneven: it is moderately effective for short problems of certain types (notably Type 2) but degrades rapidly with length and fails almost completely on other problem types (notably Type 5). This points to specific areas where the model's reasoning or pattern-matching abilities are robust and where they are critically lacking.