## Heatmap: Performance Delta Across Models and Mathematical Domains
### Overview
The image is a heatmap titled "Performance Delta Across Models and Mathematical Domains." It visualizes the percentage change in performance (delta from a base model) for eight different models or puzzle types across seven distinct mathematical domains. The data is presented in a grid where each cell's color and numerical value represent the performance delta. A color scale on the right provides a key for interpreting the values.
### Components/Axes
* **Title:** "Performance Delta Across Models and Mathematical Domains" (centered at the top).
* **Y-Axis (Rows - Mathematical Domains):** Seven categories are listed vertically on the left side:
1. Algebra and Number Theory
2. Analysis and Differential Equations
3. Applied and Computational Mathematics
4. Arithmetic
5. Foundations and Logic
6. Geometry and Topology
7. Probability, Statistics, and Discrete Mathematics
* **X-Axis (Columns - Models/Puzzles):** Eight categories are listed horizontally at the bottom, rotated at a 45-degree angle:
1. Sudoku
2. Nonogram
3. Cryptarithm
4. Magic Square
5. Zebra puzzle
6. Graph
7. Knight & Knaves
8. All-Game
* **Legend/Color Scale:** A vertical bar on the right side labeled "Delta from Base (%)". The scale ranges from approximately -5% (dark red) to +5% (dark green), with 0% represented by a pale yellow. Key markers are at -4, -2, 0, 2, and 4.
* **Data Cells:** Each cell in the 7x8 grid contains a percentage value (e.g., "-0.31%", "+3.77%") and is colored according to the legend.
### Detailed Analysis
The following table reconstructs the heatmap's data, with rows as Mathematical Domains and columns as Models/Puzzles. Values are the "Delta from Base (%)".
| Mathematical Domain / Model | Sudoku | Nonogram | Cryptarithm | Magic Square | Zebra puzzle | Graph | Knight & Knaves | All-Game |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Algebra and Number Theory** | -0.31% | -0.62% | +1.17% | +0.16% | -0.78% | +1.48% | +0.16% | +0.70% |
| **Analysis and Differential Equations** | -0.42% | +0.00% | +1.88% | +3.77% | +0.00% | +0.21% | +1.46% | +4.61% |
| **Applied and Computational Mathematics** | -2.63% | -2.10% | -3.68% | -5.26% | -3.16% | -2.10% | -1.58% | -1.58% |
| **Arithmetic** | -0.65% | -0.80% | -0.57% | +0.20% | -0.80% | -0.30% | -0.27% | -0.42% |
| **Foundations and Logic** | -5.88% | -5.88% | +0.00% | -5.88% | +0.00% | -5.88% | -5.88% | +0.00% |
| **Geometry and Topology** | +1.29% | +2.73% | +1.29% | +0.72% | +0.86% | +1.01% | +0.29% | -0.86% |
| **Probability, Statistics, and Discrete Mathematics** | +0.68% | +1.22% | +0.95% | -0.14% | +0.41% | -0.14% | -0.27% | -1.08% |
### Key Observations
1. **Strongest Positive Performance:** The largest positive delta is **+4.61%** for the "All-Game" model in the "Analysis and Differential Equations" domain (dark green cell). The "Magic Square" model also shows strong positive performance in this domain (+3.77%).
2. **Strongest Negative Performance:** The most significant negative deltas are **-5.88%**, occurring multiple times in the "Foundations and Logic" domain for the Sudoku, Nonogram, Magic Square, Graph, and Knight & Knaves models (dark red cells). The "Magic Square" model also shows a large negative delta of **-5.26%** in "Applied and Computational Mathematics."
3. **Domain-Wide Trends:**
* **Applied and Computational Mathematics:** Shows consistently negative deltas across all models, ranging from -1.58% to -5.26%. This domain appears to be the most challenging for all evaluated models relative to the base.
* **Foundations and Logic:** Exhibits a polarized pattern. Performance is either unchanged (+0.00%) or significantly worse (-5.88%) across models, with no intermediate values.
* **Analysis and Differential Equations:** Shows predominantly positive or neutral deltas, suggesting models generally perform better than the base in this domain.
* **Geometry and Topology:** Shows mostly positive deltas, with the "Nonogram" model achieving the highest gain in this domain (+2.73%).
4. **Model-Specific Trends:**
* **Magic Square:** Shows extreme variance, with the highest positive delta in one domain (+3.77% in Analysis) and the lowest negative delta in another (-5.26% in Applied Math).
* **All-Game:** Shows the single highest positive delta (+4.61%) but also a notable negative delta in "Probability, Statistics, and Discrete Mathematics" (-1.08%).
### Interpretation
This heatmap provides a comparative analysis of how different AI models (or puzzle-solving approaches) perform relative to a baseline across specialized mathematical fields.
* **What the data suggests:** The performance delta is highly domain- and model-dependent. No single model excels universally. The "Analysis and Differential Equations" domain appears to be an area of relative strength for the tested models, while "Applied and Computational Mathematics" is a universal area of weakness. The stark, binary results in "Foundations and Logic" suggest a possible categorical failure or a specific benchmarking quirk for those models in that domain.
* **How elements relate:** The color gradient allows for immediate visual identification of strengths (green) and weaknesses (red). The grid structure enables direct comparison: one can scan a row to see how a single model performs across all domains, or scan a column to see which models are best suited for a specific mathematical domain.
* **Notable anomalies:** The repeated -5.88% value in "Foundations and Logic" is a significant outlier. Such a precise, repeated negative value could indicate a systematic error, a ceiling/floor effect in the benchmark, or a fundamental limitation of those models in handling logical foundations. The complete absence of positive deltas in "Applied and Computational Mathematics" is another critical finding, highlighting a major challenge area for current models.