## Bar Chart: Comparison of LLMs Across Datasets
### Overview
The image presents a comparative analysis of large language models (LLMs) across three datasets (HotpotQA, GSM8k, GPQA) using three categories of models: Open-source LLMs, Closed-source LLMs, and Instruction-based vs. Reasoning LLMs. Scores range from 0 to 100, with vertical bars representing performance metrics.
---
### Components/Axes
- **X-Axis (Datasets)**:
- HotpotQA (leftmost)
- GSM8k (middle)
- GPQA (rightmost)
- **Y-Axis (Scores)**:
- Scale from 0 to 100 in increments of 20.
- **Legends**:
- **Open-source LLMs**:
- LLaMA3.1-8B (teal)
- LLaMA3.1-70B (yellow)
- Gwen2.5-72B (red)
- **Closed-source LLMs**:
- Qwen2.5-72B (red)
- Claude3.5 (blue)
- GPT-3.5 (orange)
- GPT-4o (green)
- **Instruction-based vs. Reasoning LLMs**:
- Qwen2.5-72B (red)
- GPT-4o (green)
- QWQ-32B (pink)
- DeepSeek-V3 (purple)
---
### Detailed Analysis
#### Open-source LLMs
- **HotpotQA**:
- LLaMA3.1-8B: ~70
- LLaMA3.1-70B: ~85
- Gwen2.5-72B: ~90
- **GSM8k**:
- LLaMA3.1-8B: ~80
- LLaMA3.1-70B: ~95
- Gwen2.5-72B: ~95
- **GPQA**:
- LLaMA3.1-8B: ~10
- LLaMA3.1-70B: ~25
- Gwen2.5-72B: ~15
#### Closed-source LLMs
- **HotpotQA**:
- Qwen2.5-72B: ~90
- Claude3.5: ~95
- GPT-3.5: ~90
- GPT-4o: ~95
- **GSM8k**:
- Qwen2.5-72B: ~95
- Claude3.5: ~95
- GPT-3.5: ~90
- GPT-4o: ~95
- **GPQA**:
- Qwen2.5-72B: ~15
- Claude3.5: ~15
- GPT-3.5: ~15
- GPT-4o: ~10
#### Instruction-based vs. Reasoning LLMs
- **HotpotQA**:
- Qwen2.5-72B: ~90
- GPT-4o: ~95
- QWQ-32B: ~65
- DeepSeek-V3: ~95
- **GSM8k**:
- Qwen2.5-72B: ~95
- GPT-4o: ~95
- QWQ-32B: ~90
- DeepSeek-V3: ~95
- **GPQA**:
- Qwen2.5-72B: ~15
- GPT-4o: ~15
- QWQ-32B: ~10
- DeepSeek-V3: ~25
---
### Key Observations
1. **Open-source LLMs**:
- Strong performance on HotpotQA and GSM8k (knowledge-based tasks).
- Poor performance on GPQA (reasoning tasks), with scores below 30 for all models.
2. **Closed-source LLMs**:
- Consistently high scores across all datasets (85–95 range).
- GPT-4o and Claude3.5 dominate in GSM8k and GPQA.
3. **Instruction-based vs. Reasoning LLMs**:
- Instruction-based models (Qwen2.5-72B, GPT-4o) excel in HotpotQA and GSM8k.
- Reasoning models (QWQ-32B, DeepSeek-V3) underperform on GPQA, except DeepSeek-V3 (~25).
4. **Outliers**:
- DeepSeek-V3 is the only model with a notable score (~25) on GPQA, suggesting limited reasoning capability compared to others.
---
### Interpretation
- **Model Type Impact**: Closed-source models (e.g., GPT-4o, Claude3.5) outperform open-source and instruction-based models in reasoning tasks (GPQA), highlighting their robustness.
- **Task-Specific Strengths**:
- Instruction-based models (Qwen2.5-72B, GPT-4o) dominate knowledge-based tasks (HotpotQA, GSM8k).
- Reasoning models (QWQ-32B, DeepSeek-V3) struggle with GPQA, indicating a gap in complex problem-solving.
- **Open-source Limitations**: LLaMA3.1-70B and Gwen2.5-72B perform poorly on GPQA, suggesting open-source models may lack specialized reasoning architectures.
- **DeepSeek-V3 Anomaly**: Its ~25 score on GPQA is an outlier, possibly due to unique training data or architecture not reflected in other models.
This analysis underscores the trade-offs between model accessibility (open-source) and performance (closed-source) in LLM applications.