## Bar Charts: Comparison of LLMs Across Datasets
### Overview
The image contains three grouped bar charts comparing the performance of various large language models (LLMs) across three datasets: HotpotQA, GSM8k, and GPQA. Each chart focuses on a different category of LLMs: open-source, closed-source, and instruction-based vs. reasoning-based models. Scores range from 0 to 100 on the y-axis, with datasets labeled on the x-axis.
---
### Components/Axes
- **X-Axis (Datasets)**:
- HotpotQA (leftmost group)
- GSM8k (middle group)
- GPQA (rightmost group)
- **Y-Axis (Scores)**:
- Scale from 0 to 100, with increments of 20.
- **Legends**:
1. **Open-source LLMs** (left chart):
- LLaMA 3.1-8B (green)
- LLaMA 3.1-70B (yellow)
- Qwen2.5-72B (purple)
- OWEN2.5-72B (red)
2. **Closed-source LLMs** (middle chart):
- Qwen2.5-72B (red)
- Claude3.5 (blue)
- GPT-3.5 (orange)
- GPT-4o (green)
3. **Instruction-based vs. Reasoning LLMs** (right chart):
- Qwen2.5-72B (red)
- GPT-4o (green)
- QWQ-32B (pink)
- DeepSeek-V3 (purple)
---
### Detailed Analysis
#### Open-source LLMs (Left Chart)
- **HotpotQA**:
- LLaMA 3.1-70B: ~85
- Qwen2.5-72B: ~83
- LLaMA 3.1-8B: ~78
- OWEN2.5-72B: ~82
- **GSM8k**:
- LLaMA 3.1-70B: ~90
- Qwen2.5-72B: ~92
- LLaMA 3.1-8B: ~81
- OWEN2.5-72B: ~84
- **GPQA**:
- LLaMA 3.1-70B: ~15
- Qwen2.5-72B: ~10
- LLaMA 3.1-8B: ~8
- OWEN2.5-72B: ~12
#### Closed-source LLMs (Middle Chart)
- **HotpotQA**:
- Qwen2.5-72B: ~85
- GPT-4o: ~88
- Claude3.5: ~86
- GPT-3.5: ~82
- **GSM8k**:
- Qwen2.5-72B: ~95
- GPT-4o: ~90
- Claude3.5: ~88
- GPT-3.5: ~80
- **GPQA**:
- Qwen2.5-72B: ~10
- GPT-4o: ~25
- Claude3.5: ~28
- GPT-3.5: ~22
#### Instruction-based vs. Reasoning LLMs (Right Chart)
- **HotpotQA**:
- Qwen2.5-72B: ~85
- GPT-4o: ~88
- QWQ-32B: ~62
- DeepSeek-V3: ~90
- **GSM8k**:
- Qwen2.5-72B: ~95
- GPT-4o: ~90
- QWQ-32B: ~85
- DeepSeek-V3: ~92
- **GPQA**:
- Qwen2.5-72B: ~10
- GPT-4o: ~25
- QWQ-32B: ~12
- DeepSeek-V3: ~30
---
### Key Observations
1. **Open-source LLMs**:
- LLaMA 3.1-70B and Qwen2.5-72B dominate HotpotQA and GSM8k but perform poorly on GPQA.
- OWEN2.5-72B underperforms compared to other models across all datasets.
2. **Closed-source LLMs**:
- Qwen2.5-72B and GPT-4o consistently achieve the highest scores in HotpotQA and GSM8k.
- GPT-3.5 lags behind in all datasets, particularly in GPQA.
3. **Instruction-based vs. Reasoning LLMs**:
- Instruction-based models (Qwen2.5-72B, GPT-4o) outperform reasoning-based models (QWQ-32B, DeepSeek-V3) in HotpotQA and GSM8k.
- DeepSeek-V3 shows stronger performance than QWQ-32B in GPQA.
---
### Interpretation
- **Model Size Matters**: Larger models (e.g., LLaMA 3.1-70B, Qwen2.5-72B) generally achieve higher scores, especially in complex reasoning tasks (GSM8k).
- **Instruction-tuning Advantage**: Instruction-based models (Qwen2.5-72B, GPT-4o) outperform reasoning-focused models (DeepSeek-V3, QWQ-32B) in most cases, suggesting better alignment with task-specific prompts.
- **GPQA as a Bottleneck**: All models struggle with GPQA, indicating it may test niche or highly specialized knowledge not well-represented in training data.
- **Qwen2.5-72B as a Standout**: This model consistently ranks among the top performers across all datasets and categories, highlighting its versatility.
The data underscores the trade-offs between open-source and closed-source models, with closed-source models like GPT-4o and Qwen2.5-72B often leading in performance. However, open-source models like LLaMA 3.1-70B remain competitive, particularly in reasoning tasks. The disparity in GPQA scores suggests room for improvement in models' ability to handle specialized or domain-specific queries.