## Horizontal Dumbbell Chart: AI Model Accuracy on Congruent vs. Incongruent Tasks
### Overview
This image is a horizontal dumbbell chart comparing the performance of 14 different large language models (LLMs) on two types of tasks: "Congruent (Easier)" and "Incongruent (Harder)". The chart visualizes the accuracy percentage for each task type per model and highlights the performance gap (delta) between them.
### Components/Axes
* **Chart Type:** Horizontal Dumbbell Chart (also known as a Connected Dot Plot).
* **Y-Axis (Vertical):** Lists 14 AI model names. From top to bottom:
1. Llama 3.2 3B Instruct
2. Llama 3.3 70B Instruct
3. Qwen3-Next 80B A3B Thinking
4. Llama 3.2 1B Instruct
5. Llama 3.1 8B Instruct
6. Gemini 2.5 Flash Lite
7. DeepSeek V3.1
8. Kimi-K2-Instruct
9. GLM-4.6
10. Gemini 2.5 Pro
11. Gemini 2.5 Flash
12. GPT-OSS-20B
13. Qwen3-Next 80B A3B Instruct
14. Gemma 3 27B IT
* **X-Axis (Horizontal):** Labeled "Accuracy (%)". Scale runs from 0 to 100 with major tick marks at 0, 20, 40, 60, 80, and 100. Text labels "Low Accuracy", "Medium", and "High Accuracy" are placed below the axis at approximately the 10%, 50%, and 90% positions, respectively.
* **Legend:** Positioned at the top center.
* A red circle is labeled "Incongruent (Harder)".
* A blue circle is labeled "Congruent (Easier)".
* **Data Representation:** For each model, a red dot (Incongruent) and a blue dot (Congruent) are plotted on the accuracy scale. A horizontal bar connects the two dots. The numerical difference (delta, Δ) between the two accuracy values is displayed in the middle of this bar.
* **Delta Color Coding:** The delta values are colored:
* **Red:** Indicates the Congruent (blue dot) accuracy is higher than the Incongruent (red dot) accuracy. This is the case for the top 8 models.
* **Green:** Indicates the Incongruent (red dot) accuracy is higher than the Congruent (blue dot) accuracy. This is the case for the bottom 2 models (Qwen3-Next 80B A3B Instruct and Gemma 3 27B IT).
* **Yellow/Orange:** Used for models with very small deltas (Gemini 2.5 Flash Lite, DeepSeek V3.1, Kimi-K2-Instruct).
* **Teal:** Used for models with extremely small deltas (GLM-4.6, Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-OSS-20B).
### Detailed Analysis
**Data Series & Values (Approximate, read from chart):**
| Model | Incongruent (Red Dot) Accuracy | Congruent (Blue Dot) Accuracy | Delta (Δ) | Delta Color |
| :--- | :--- | :--- | :--- | :--- |
| Llama 3.2 3B Instruct | ~35% | ~82% | Δ46.9% | Red |
| Llama 3.3 70B Instruct | ~53% | ~85% | Δ31.6% | Red |
| Qwen3-Next 80B A3B Thinking | ~58% | ~86% | Δ28.0% | Red |
| Llama 3.2 1B Instruct | ~42% | ~63% | Δ20.8% | Red |
| Llama 3.1 8B Instruct | ~57% | ~70% | Δ12.9% | Red |
| Gemini 2.5 Flash Lite | ~83% | ~96% | Δ12.5% | Yellow |
| DeepSeek V3.1 | ~92% | ~99% | Δ7.0% | Yellow |
| Kimi-K2-Instruct | ~91% | ~98% | Δ7.5% | Yellow |
| GLM-4.6 | ~97% | ~98% | Δ1.8% | Teal |
| Gemini 2.5 Pro | ~97% | ~98% | Δ1.4% | Teal |
| Gemini 2.5 Flash | ~97% | ~98% | Δ0.8% | Teal |
| GPT-OSS-20B | ~97% | ~98% | Δ0.8% | Teal |
| Qwen3-Next 80B A3B Instruct | ~85% | ~77% | Δ-7.9% | Green |
| Gemma 3 27B IT | ~74% | ~60% | Δ-13.7% | Green |
**Trend Verification:**
* **General Trend (Top 12 Models):** The blue dot (Congruent) is consistently to the right of the red dot (Incongruent), indicating higher accuracy on easier tasks. The connecting bar slopes upward from left (red) to right (blue). The size of the gap (delta) generally decreases as overall model accuracy increases.
* **Exception Trend (Bottom 2 Models):** For Qwen3-Next 80B A3B Instruct and Gemma 3 27B IT, the red dot (Incongruent) is to the right of the blue dot (Congruent), indicating higher accuracy on the harder task. The connecting bar slopes downward from left (blue) to right (red).
### Key Observations
1. **Performance Gap Correlation:** Models with lower overall accuracy (e.g., Llama 3.2 3B Instruct) exhibit the largest performance gaps between congruent and incongruent tasks (Δ46.9%). The highest-performing models (e.g., Gemini 2.5 Pro, GPT-OSS-20B) show near-negligible gaps (Δ<2%).
2. **High-Performance Cluster:** A cluster of models (GLM-4.6, Gemini 2.5 Pro/Flash, GPT-OSS-20B) all achieve very high accuracy (>95%) on both task types with minimal difference between them.
3. **Notable Outliers:**
* **Qwen3-Next 80B A3B Instruct** and **Gemma 3 27B IT** are the only models where performance on the "Harder" incongruent task is better than on the "Easier" congruent task.
* **Llama 3.2 3B Instruct** shows the most dramatic performance drop-off from congruent to incongruent tasks.
4. **Model Family Patterns:** Within the listed Llama models, the performance gap decreases as model size/capability increases (3B > 1B > 8B > 70B).
### Interpretation
This chart provides a nuanced view of model capability beyond simple accuracy benchmarks. It suggests that:
1. **Task Congruence is a Key Differentiator:** The "easiness" of a task (congruence) significantly impacts performance, especially for smaller or less capable models. This implies these models may rely more on pattern matching or surface-level cues that are disrupted in incongruent settings.
2. **Advanced Models Generalize Better:** The top-tier models demonstrate robust performance regardless of task congruence. Their minimal delta indicates a deeper, more flexible understanding that isn't as easily fooled by incongruent framing. This is a hallmark of more advanced reasoning capabilities.
3. **The Anomaly of Inverse Performance:** The two models that perform better on harder tasks present a fascinating anomaly. This could indicate specialized training data or architectural biases that inadvertently make them more adept at handling the specific type of incongruity tested, or it could suggest the "congruent" task formulation is somehow problematic for them. This warrants further investigation into the specific nature of the tasks.
4. **Benchmarking Implications:** Evaluating models solely on aggregate accuracy may mask important weaknesses. A model like Llama 3.2 3B Instruct appears to have reasonable accuracy on easier tasks (~82%) but fails significantly on harder ones (~35%), a critical flaw for real-world applications where inputs are often messy or contradictory. This chart argues for multi-faceted evaluation suites that stress-test model reasoning under different conditions.