## Heatmap: AI Model Accuracy Across Math Subjects and Problem Types
### Overview
The image is a comparative heatmap visualizing the accuracy of five AI models (GPT-4o-mini, Gemini 2.0 Flash, Mistral Small 3.2 24B, Gemma 3 27B, and Llama 4 Maverick) across seven math subjects (Algebra, Count. & Prob., Geometry, Inter. Algebra, Number Theory, Prealgebra, Precalculus) and four problem types (PoT, CR, MACM, IIPC). Accuracy is represented via a color gradient (purple = low, yellow = high), with numerical values provided for each data point.
---
### Components/Axes
- **Y-Axis (Rows)**: Math subjects (Algebra, Count. & Prob., Geometry, Inter. Algebra, Number Theory, Prealgebra, Precalculus).
- **X-Axis (Columns)**: Problem types (PoT, CR, MACM, IIPC).
- **Legend**: Color gradient from purple (0%) to yellow (100%) representing accuracy percentages.
- **Sub-Charts**: Five distinct heatmaps, one per AI model, arranged in two rows (top: GPT-4o-mini, Gemini 2.0 Flash; bottom: Mistral Small 3.2 24B, Gemma 3 27B, Llama 4 Maverick).
---
### Detailed Analysis
#### GPT-4o-mini
- **Algebra**: 94.88% (PoT), 91.16% (CR), 89.30% (MACM), 95.35% (IIPC).
- **Count. & Prob.**: 82.46% (PoT), 77.25% (CR), 75.83% (MACM), 81.04% (IIPC).
- **Geometry**: 68.57% (PoT), 62.86% (CR), 63.81% (MACM), 67.62% (IIPC).
- **Inter. Algebra**: 73.95% (PoT), 63.26% (CR), 60.00% (MACM), 72.09% (IIPC).
- **Number Theory**: 85.15% (PoT), 88.61% (CR), 74.75% (MACM), 85.64% (IIPC).
- **Prealgebra**: 90.70% (PoT), 90.23% (CR), 87.38% (MACM), 93.02% (IIPC).
- **Precalculus**: 72.56% (PoT), 62.79% (CR), 57.67% (MACM), 72.09% (IIPC).
#### Gemini 2.0 Flash
- **Algebra**: 98.14% (PoT), 97.21% (CR), 96.73% (MACM), 99.53% (IIPC).
- **Count. & Prob.**: 93.36% (PoT), 88.15% (CR), 89.10% (MACM), 92.89% (IIPC).
- **Geometry**: 84.76% (PoT), 79.52% (CR), 77.14% (MACM), 84.29% (IIPC).
- **Inter. Algebra**: 91.63% (PoT), 89.30% (CR), 88.37% (MACM), 91.16% (IIPC).
- **Number Theory**: 92.08% (PoT), 96.04% (CR), 95.05% (MACM), 98.51% (IIPC).
- **Prealgebra**: 96.28% (PoT), 94.42% (CR), 94.42% (MACM), 97.67% (IIPC).
- **Precalculus**: 91.63% (PoT), 86.05% (CR), 90.23% (MACM), 94.88% (IIPC).
#### Mistral Small 3.2 24B
- **Algebra**: 97.67% (PoT), 95.35% (CR), 96.28% (MACM), 96.28% (IIPC).
- **Count. & Prob.**: 91.00% (PoT), 80.57% (CR), 81.99% (MACM), 91.00% (IIPC).
- **Geometry**: 80.00% (PoT), 71.90% (CR), 70.95% (MACM), 82.38% (IIPC).
- **Inter. Algebra**: 86.51% (PoT), 78.14% (CR), 78.14% (MACM), 88.84% (IIPC).
- **Number Theory**: 96.53% (PoT), 92.08% (CR), 88.61% (MACM), 94.55% (IIPC).
- **Prealgebra**: 93.02% (PoT), 90.70% (CR), 91.16% (MACM), 94.88% (IIPC).
- **Precalculus**: 82.79% (PoT), 76.74% (CR), 67.91% (MACM), 87.91% (IIPC).
#### Gemma 3 27B
- **Algebra**: 98.14% (PoT), 97.21% (CR), 97.67% (MACM), 98.60% (IIPC).
- **Count. & Prob.**: 87.20% (PoT), 82.94% (CR), 82.46% (MACM), 86.26% (IIPC).
- **Geometry**: 81.90% (PoT), 78.10% (CR), 76.19% (MACM), 82.38% (IIPC).
- **Inter. Algebra**: 83.72% (PoT), 82.79% (CR), 82.79% (MACM), 88.37% (IIPC).
- **Number Theory**: 91.09% (PoT), 90.59% (CR), 93.07% (MACM), 97.03% (IIPC).
- **Prealgebra**: 94.42% (PoT), 94.88% (CR), 92.56% (MACM), 96.28% (IIPC).
- **Precalculus**: 86.51% (PoT), 82.79% (CR), 82.33% (MACM), 85.12% (IIPC).
#### Llama 4 Maverick
- **Algebra**: 95.81% (PoT), 97.21% (CR), 98.14% (MACM), 98.60% (IIPC).
- **Count. & Prob.**: 91.00% (PoT), 91.00% (CR), 92.42% (MACM), 91.47% (IIPC).
- **Geometry**: 79.52% (PoT), 80.00% (CR), 75.24% (MACM), 80.48% (IIPC).
- **Inter. Algebra**: 83.72% (PoT), 80.00% (CR), 84.19% (MACM), 87.44% (IIPC).
- **Number Theory**: 91.09% (PoT), 94.06% (CR), 91.09% (MACM), 94.06% (IIPC).
- **Prealgebra**: 94.42% (PoT), 95.35% (CR), 94.42% (MACM), 96.74% (IIPC).
- **Precalculus**: 86.98% (PoT), 85.12% (CR), 85.12% (MACM), 89.77% (IIPC).
---
### Key Observations
1. **Llama 4 Maverick** consistently achieves the highest accuracy across most subjects and problem types, particularly in **Algebra (PoT: 98.60%)** and **Precalculus (IIPC: 89.77%)**.
2. **Geometry** is the weakest subject for all models, with GPT-4o-mini (68.57%) and Mistral Small 3.2 24B (80.00%) showing the lowest scores.
3. **PoT** problem type generally yields higher accuracy than other types (e.g., GPT-4o-mini: 94.88% vs. 60.00% for MACM in Inter. Algebra).
4. **Gemini 2.0 Flash** excels in **Algebra (99.53% IIPC)** and **Number Theory (98.51% IIPC)**, but struggles slightly in Geometry (84.76% PoT).
5. **Mistral Small 3.2 24B** has the lowest accuracy in **Precalculus (MACM: 67.91%)** but performs well in **Number Theory (96.53% PoT)**.
---
### Interpretation
The data suggests that **Llama 4 Maverick** is the most robust model overall, with minimal variance across problem types and subjects. **Gemini 2.0 Flash** and **GPT-4o-mini** also perform strongly, particularly in advanced topics like Algebra and Number Theory. However, **Geometry** remains a consistent weak point for all models, indicating potential gaps in spatial reasoning or visualization capabilities. The **PoT** problem type consistently outperforms others, suggesting that models are better at procedural tasks than complex, multi-step problems (e.g., MACM). This could reflect training data biases or architectural limitations in handling abstract mathematical concepts. The heatmap highlights opportunities for improvement in Geometry and Precalculus, particularly for models like GPT-4o-mini and Mistral Small 3.2 24B.