## Stacked Bar Chart & Radar Chart: Model Performance Analysis
### Overview
The image contains two distinct charts analyzing the performance of a model (or models) on various question-answering datasets. The left chart is a stacked bar chart showing the count of questions categorized by the correctness of the model's answers on Multiple Choice Questions (MCQs) and Open-Ended/Short-form Questions (OSQs). The right chart is a radar chart comparing the overall accuracy percentages for MCQs and OSQs across the same datasets.
### Components/Axes
**Left Chart (Stacked Bar Chart):**
* **X-axis (Dataset):** Lists 9 datasets: MMLU, HellaSwag, Race, ARC, MedMCQA, WinoGrande, CommonsenseQA, PIQA, OpenbookQA.
* **Y-axis (Count):** Linear scale from 0 to 8000, with major ticks at 2000 intervals.
* **Legend (Top-Left):** Defines four stacked categories:
* Orange: Incorrect MCQs, Correct OSQs
* Green: Correct MCQs, Incorrect OSQs
* Red: Correct MCQs, Correct OSQs
* Grey: Incorrect MCQs, Incorrect OSQs
**Right Chart (Radar Chart):**
* **Radial Axes:** Represent the 9 datasets, arranged clockwise: HellaSwag, CommonsenseQA, ARC, WinoGrande, Race, PIQA, OpenbookQA, MMLU, MedMCQA.
* **Concentric Circles:** Represent accuracy values, with labeled rings at 0.2, 0.4, 0.6, and 0.8 (20%, 40%, 60%, 80%).
* **Legend (Top-Right):**
* Red Line & Shaded Area: MCQs Accuracies
* Green Line & Shaded Area: OSQs Accuracies
### Detailed Analysis
**Stacked Bar Chart Data (Approximate Counts):**
* **MMLU:** Total ~7800. Breakdown (bottom to top): Orange ~1200, Green ~1800, Red ~3200, Grey ~1600.
* **HellaSwag:** Total ~3900. Breakdown: Orange ~200, Green ~1900, Red ~800, Grey ~1000.
* **Race:** Total ~3500. Breakdown: Orange ~100, Green ~1100, Red ~2000, Grey ~300.
* **ARC:** Total ~3200. Breakdown: Orange ~100, Green ~800, Red ~2200, Grey ~100.
* **MedMCQA:** Total ~2300. Breakdown: Orange ~200, Green ~700, Red ~700, Grey ~700.
* **WinoGrande:** Total ~1200. Breakdown: Orange ~100, Green ~300, Red ~600, Grey ~200.
* **CommonsenseQA:** Total ~700. Breakdown: Orange ~50, Green ~150, Red ~400, Grey ~100.
* **PIQA:** Total ~650. Breakdown: Orange ~50, Green ~250, Red ~300, Grey ~50.
* **OpenbookQA:** Total ~450. Breakdown: Orange ~50, Green ~150, Red ~200, Grey ~50.
**Radar Chart Data (Approximate Accuracies):**
* **MCQs (Red Line):** HellaSwag ~0.75, CommonsenseQA ~0.70, ARC ~0.75, WinoGrande ~0.65, Race ~0.80, PIQA ~0.70, OpenbookQA ~0.65, MMLU ~0.65, MedMCQA ~0.60.
* **OSQs (Green Line):** HellaSwag ~0.55, CommonsenseQA ~0.50, ARC ~0.60, WinoGrande ~0.55, Race ~0.65, PIQA ~0.55, OpenbookQA ~0.50, MMLU ~0.50, MedMCQA ~0.45.
### Key Observations
1. **Dataset Size Disparity:** MMLU is the largest dataset by a significant margin (~7800 questions), while OpenbookQA is the smallest (~450).
2. **Performance Consistency:** The "Correct MCQs, Correct OSQs" (Red) segment is consistently the largest or second-largest segment across all datasets, indicating the model often gets both question types right.
3. **MCQ vs. OSQ Accuracy Gap:** The radar chart shows a consistent pattern where MCQ accuracy (red area) is higher than OSQ accuracy (green area) for every single dataset. The gap appears largest for MedMCQA and MMLU.
4. **Highest/Lowest Accuracy:** The model achieves its highest MCQ accuracy on Race (~80%) and its lowest on MedMCQA (~60%). For OSQs, the highest is on Race (~65%) and the lowest on MedMCQA (~45%).
5. **Correlation in Performance:** Datasets where the model performs well on MCQs (e.g., Race, ARC) also tend to show relatively better performance on OSQs, though the gap remains.
### Interpretation
This visualization provides a multi-faceted view of model performance across diverse benchmarks. The stacked bar chart reveals the *composition* of errors and successes, showing not just overall volume but the relationship between performance on two distinct task formats (MCQ vs. OSQ) within the same dataset. The radar chart provides a direct, normalized *comparison* of accuracy rates.
The data suggests a few key insights:
1. **Task Format Matters:** The consistent accuracy gap indicates the model finds generating correct short-form answers (OSQs) more challenging than selecting from given options (MCQs). This is a common pattern in language models, where recognition (MCQ) often outperforms recall/generation (OSQ).
2. **Dataset Difficulty:** Datasets like MedMCQA and MMLU appear to be more challenging for the model on both task formats, as indicated by lower accuracy scores and a larger proportion of "Incorrect/Incorrect" (grey) segments in the bar chart.
3. **Error Analysis Potential:** The bar chart's four-category breakdown is particularly useful for diagnostic analysis. For instance, a large "Correct MCQs, Incorrect OSQs" (green) segment would suggest the model has knowledge but struggles to articulate it, while a large "Incorrect MCQs, Correct OSQs" (orange) segment would be unusual and might indicate issues with the MCQ format or options.
4. **Complementary Views:** The two charts are complementary. The bar chart shows absolute counts (influenced by dataset size), while the radar chart shows normalized rates. A dataset can have a large red bar (many questions got both right) but a moderate accuracy point if the total dataset size is very large.
In summary, the model demonstrates competent but uneven performance across benchmarks, with a clear and persistent advantage on multiple-choice tasks over open-ended generation tasks. The analysis highlights specific datasets where performance lags, guiding potential areas for model improvement or data curation.