## Heatmap Grid: Model Performance Across Subjects and Tasks
### Overview
This image displays a grid of heatmaps, each representing the performance (accuracy) of a specific language model on various mathematical subjects and tasks. The models evaluated are GPT-4o-mini, Gemini 2.0 Flash, Mistral Small 3.2 24B, Gemma 3 27B, and Llama 4 Maverick. For each model, accuracy is shown across six subjects: Algebra, Count. & Prob., Geometry, Inter. Algebra, Number Theory, and Precalculus. The performance is further broken down by four tasks: PoT (Program of Thought), CR (Chain-of-Reasoning), MACM (Multi-step Arithmetic Chain-of-Thought), and IIPC (Instruction-following Prompt Completion). A color bar at the bottom indicates the accuracy scale from 0% to 100%.
### Components/Axes
**Overall Structure:**
The image is organized into five distinct heatmap sections, each titled with the name of a language model. These sections are arranged in a 2x2 grid for the top four models, with the fifth model (Llama 4 Maverick) positioned below the center.
**Individual Heatmap Components:**
* **Model Titles:** Located at the top of each heatmap section.
* GPT-4o-mini
* Gemini 2.0 Flash
* Mistral Small 3.2 24B
* Gemma 3 27B
* Llama 4 Maverick
* **Y-axis (Subjects):** Listed vertically on the left side of each heatmap.
* Algebra
* Count. & Prob.
* Geometry
* Inter. Algebra
* Number Theory
* Precalculus
* **X-axis (Tasks):** Listed horizontally at the bottom of each heatmap.
* PoT
* CR
* MACM
* IIPC
* **Color Bar (Legend):** Located at the bottom of the entire image.
* **Label:** "Accuracy (%)"
* **Scale:** Ranges from 0 to 100, with tick marks at 0, 20, 40, 60, 80, and 100.
* **Color Gradient:** A gradient from dark purple (low accuracy) through blue, teal, green, yellow, to bright yellow (high accuracy).
### Detailed Analysis
The following data represents the accuracy percentages for each model, subject, and task. The color of each cell corresponds to the accuracy indicated by the color bar.
**1. GPT-4o-mini**
| Subject | PoT | CR | MACM | IIPC |
|----------------|---------|---------|---------|---------|
| Algebra | 94.88 | 91.16 | 89.30 | 95.35 |
| Count. & Prob. | 82.46 | 77.25 | 75.83 | 81.04 |
| Geometry | 68.57 | 62.86 | 63.81 | 67.62 |
| Inter. Algebra | 73.95 | 63.26 | 60.00 | 72.09 |
| Number Theory | 85.15 | 88.61 | 74.75 | 85.64 |
| Precalculus | 90.70 | 90.23 | 87.38 | 93.02 |
**2. Gemini 2.0 Flash**
| Subject | PoT | CR | MACM | IIPC |
|----------------|---------|---------|---------|---------|
| Algebra | 98.14 | 97.21 | 96.73 | 99.53 |
| Count. & Prob. | 93.36 | 88.15 | 89.10 | 92.89 |
| Geometry | 84.76 | 79.52 | 77.14 | 84.29 |
| Inter. Algebra | 91.63 | 89.30 | 88.37 | 91.16 |
| Number Theory | 92.08 | 96.04 | 95.05 | 98.51 |
| Precalculus | 91.63 | 86.05 | 90.23 | 94.88 |
**3. Mistral Small 3.2 24B**
| Subject | PoT | CR | MACM | IIPC |
|----------------|---------|---------|---------|---------|
| Algebra | 97.67 | 95.35 | 96.28 | 96.28 |
| Count. & Prob. | 91.00 | 80.57 | 81.99 | 91.00 |
| Geometry | 80.00 | 71.90 | 70.95 | 82.38 |
| Inter. Algebra | 86.51 | 78.14 | 78.14 | 88.84 |
| Number Theory | 96.53 | 92.08 | 88.61 | 94.55 |
| Precalculus | 93.02 | 90.70 | 91.16 | 94.88 |
**4. Gemma 3 27B**
| Subject | PoT | CR | MACM | IIPC |
|----------------|---------|---------|---------|---------|
| Algebra | 98.14 | 97.21 | 97.67 | 98.60 |
| Count. & Prob. | 87.20 | 82.94 | 82.46 | 86.26 |
| Geometry | 81.90 | 78.10 | 76.19 | 82.38 |
| Inter. Algebra | 83.72 | 82.79 | 82.79 | 88.37 |
| Number Theory | 91.09 | 90.59 | 93.07 | 97.03 |
| Precalculus | 94.42 | 94.88 | 92.56 | 96.28 |
**5. Llama 4 Maverick**
| Subject | PoT | CR | MACM | IIPC |
|----------------|---------|---------|---------|---------|
| Algebra | 95.81 | 97.21 | 98.14 | 98.60 |
| Count. & Prob. | 91.00 | 91.00 | 92.42 | 91.47 |
| Geometry | 79.52 | 80.00 | 75.24 | 80.48 |
| Inter. Algebra | 83.72 | 80.00 | 84.19 | 87.44 |
| Number Theory | 91.09 | 94.06 | 91.09 | 94.06 |
| Precalculus | 94.42 | 95.35 | 94.42 | 96.74 |
### Key Observations
* **Top Performers:** Gemini 2.0 Flash and Llama 4 Maverick generally exhibit the highest accuracies, particularly in Algebra and Number Theory, often achieving scores above 95% and even approaching 100% in some cases (e.g., Gemini 2.0 Flash on Algebra IIPC).
* **Subject Difficulty:** Geometry appears to be the most challenging subject across all models, with accuracies consistently lower than other subjects, often falling in the 60-80% range. Count. & Prob. and Inter. Algebra also show moderate difficulty.
* **Task Performance:** The "IIPC" (Instruction-following Prompt Completion) task often yields higher accuracies for many models, suggesting it might be a more straightforward task or that models are better optimized for it. "MACM" (Multi-step Arithmetic Chain-of-Thought) sometimes shows lower scores, particularly in Geometry.
* **Model Strengths/Weaknesses:**
* GPT-4o-mini shows strong performance in Algebra and Precalculus but is weaker in Geometry.
* Gemini 2.0 Flash is a strong all-around performer, especially in Algebra and Number Theory.
* Mistral Small 3.2 24B performs well in Algebra and Number Theory but struggles more with Geometry.
* Gemma 3 27B is competitive, with high scores in Algebra and Number Theory, but also shows moderate performance in Geometry.
* Llama 4 Maverick excels in Algebra and shows strong performance in Precalculus and Number Theory, but its Geometry scores are comparable to other models.
* **Color Consistency:** The color mapping appears consistent across all heatmaps, with darker purples representing lower accuracies and bright yellows representing higher accuracies, aligning with the provided color bar.
### Interpretation
This grid of heatmaps provides a comparative analysis of the mathematical reasoning capabilities of several large language models. The data suggests a clear hierarchy in performance, with Gemini 2.0 Flash and Llama 4 Maverick generally outperforming GPT-4o-mini, Mistral Small 3.2 24B, and Gemma 3 27B on these specific mathematical tasks.
The consistent lower performance in Geometry across most models indicates that this subject might require more sophisticated spatial reasoning or a deeper understanding of geometric principles that current models are still developing. Conversely, subjects like Algebra and Number Theory, which are more amenable to symbolic manipulation and logical deduction, show higher accuracy.
The variation in performance across tasks (PoT, CR, MACM, IIPC) highlights the impact of prompt engineering and task formulation. The generally higher scores on IIPC suggest that models might be more adept at following direct instructions for completion rather than engaging in complex multi-step reasoning processes like MACM, although this is not universally true.
Overall, the data demonstrates the progress in LLM capabilities for mathematical problem-solving, while also pinpointing areas for future improvement, particularly in more complex or abstract reasoning domains like Geometry. The visual representation through heatmaps allows for quick identification of model strengths and weaknesses across different mathematical domains and task types.