## Bar Chart: Comparison of LLMs Across Datasets
### Overview
The image presents a comparative bar chart analyzing the performance of various large language models (LLMs) across three datasets: **HotpotQA**, **GSM8k**, and **GPQA**. The chart is divided into three sections:
1. **Open-source LLMs** (left)
2. **Closed-source LLMs** (center)
3. **Instruction-based vs. Reasoning LLMs** (right)
Each section compares model scores (0–100) for the datasets, with distinct color-coded legends for model identification.
---
### Components/Axes
- **X-axis**: Datasets (**HotpotQA**, **GSM8k**, **GPQA**)
- **Y-axis**: Scores (0–100)
- **Legends**:
- **Open-source LLMs**:
- LLaMA3.1-8B (green)
- LLaMA3.1-70B (yellow)
- Qwen2.5-7B (purple)
- Qwen2.5-72B (red)
- **Closed-source LLMs**:
- Qwen2.5-72B (red)
- Claude3.5 (blue)
- GPT-3.5 (orange)
- GPT-4o (green)
- **Instruction-based vs. Reasoning LLMs**:
- Qwen2.5-72B (red)
- GPT-4o (green)
- QWQ-32B (pink)
- DeepSeek-V3 (purple)
---
### Detailed Analysis
#### Open-source LLMs
- **HotpotQA**:
- LLaMA3.1-8B: ~65
- LLaMA3.1-70B: ~88
- Qwen2.5-7B: ~75
- Qwen2.5-72B: ~88
- **GSM8k**:
- LLaMA3.1-8B: ~80
- LLaMA3.1-70B: ~88
- Qwen2.5-7B: ~90
- Qwen2.5-72B: ~95
- **GPQA**:
- LLaMA3.1-8B: ~15
- LLaMA3.1-70B: ~38
- Qwen2.5-7B: ~35
- Qwen2.5-72B: ~32
#### Closed-source LLMs
- **HotpotQA**:
- Qwen2.5-72B: ~92
- Claude3.5: ~90
- GPT-3.5: ~85
- GPT-4o: ~95
- **GSM8k**:
- Qwen2.5-72B: ~95
- Claude3.5: ~93
- GPT-3.5: ~88
- GPT-4o: ~97
- **GPQA**:
- Qwen2.5-72B: ~30
- Claude3.5: ~42
- GPT-3.5: ~45
- GPT-4o: ~44
#### Instruction-based vs. Reasoning LLMs
- **HotpotQA**:
- Qwen2.5-72B: ~90
- GPT-4o: ~92
- QWQ-32B: ~75
- DeepSeek-V3: ~95
- **GSM8k**:
- Qwen2.5-72B: ~94
- GPT-4o: ~96
- QWQ-32B: ~88
- DeepSeek-V3: ~97
- **GPQA**:
- Qwen2.5-72B: ~35
- GPT-4o: ~45
- QWQ-32B: ~25
- DeepSeek-V3: ~50
---
### Key Observations
1. **Open-source LLMs**:
- Larger models (e.g., LLaMA3.1-70B, Qwen2.5-72B) outperform smaller variants (e.g., LLaMA3.1-8B, Qwen2.5-7B) across datasets.
- **GPQA** scores are significantly lower for all open-source models, indicating poor performance on complex reasoning tasks.
2. **Closed-source LLMs**:
- **GPT-4o** and **Qwen2.5-72B** dominate in **GSM8k** (reasoning) and **HotpotQA** (knowledge-intensive tasks).
- **GPQA** scores remain low for all closed-source models, suggesting limitations in general-purpose reasoning.
3. **Instruction-based vs. Reasoning LLMs**:
- **DeepSeek-V3** excels in **GPQA** (~50), outperforming others in this dataset.
- **QWQ-32B** underperforms in **GPQA** (~25) but shows moderate results in **HotpotQA** (~75).
---
### Interpretation
- **Closed-source models** consistently outperform open-source models, particularly in **GSM8k** (reasoning) and **HotpotQA** (knowledge retrieval).
- **Instruction-based models** (e.g., Qwen2.5-72B, GPT-4o) demonstrate superior performance in **GSM8k** and **HotpotQA**, highlighting their effectiveness in structured reasoning and knowledge tasks.
- **DeepSeek-V3** stands out in **GPQA**, suggesting specialized optimization for general-purpose reasoning.
- **Open-source models** struggle with **GPQA**, indicating a gap in handling complex, multi-step reasoning compared to closed-source alternatives.
This analysis underscores the performance disparity between open-source and closed-source LLMs, with the latter excelling in advanced reasoning tasks. Instruction-based models like Qwen2.5-72B and GPT-4o emerge as leaders in structured tasks, while DeepSeek-V3 shows promise in general reasoning.