## Bar Chart Grid: Model Performance Across Datasets and Temperature Settings
### Overview
The image displays a 7x4 grid of bar charts comparing the performance of four language models (Mistral-Small-24B, Llama3-1-8B, Llama3-2-3B, Mistral-Nemo) across seven datasets (GSM8K, TruthfulQA, CoQA, SQuADv2, TriviaQA, HaluevalQA, NQOpen) at two temperature settings (0.1 and 1.0). Each chart uses three color-coded bars to represent counts of "Hallucination" (red), "Non-Hallucination" (green), and "Rejected" (gray) responses.
### Components/Axes
- **X-axis**: Labeled "temperature" with values 0.1 (left) and 1.0 (right).
- **Y-axis**: Labeled "Count" with logarithmic scaling (0 to 12,000).
- **Legend**: Located at the bottom center, with:
- Red = Hallucination
- Green = Non-Hallucination
- Gray = Rejected
- **Grid Structure**:
- Rows represent datasets (top to bottom: GSM8K, TruthfulQA, CoQA, SQuADv2, TriviaQA, HaluevalQA, NQOpen).
- Columns represent models (left to right: Mistral-Small-24B, Llama3-1-8B, Llama3-2-3B, Mistral-Nemo).
### Detailed Analysis
#### Dataset: GSM8K
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~100), Non-Hallucination (~1,100), Rejected (~50).
- Temperature 1.0: Hallucination (~150), Non-Hallucination (~1,050), Rejected (~70).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~200), Non-Hallucination (~1,000), Rejected (~30).
- Temperature 1.0: Hallucination (~250), Non-Hallucination (~950), Rejected (~40).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~180), Non-Hallucination (~1,020), Rejected (~60).
- Temperature 1.0: Hallucination (~220), Non-Hallucination (~980), Rejected (~50).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~120), Non-Hallucination (~1,080), Rejected (~40).
- Temperature 1.0: Hallucination (~160), Non-Hallucination (~1,020), Rejected (~50).
#### Dataset: TruthfulQA
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~250), Non-Hallucination (~300), Rejected (~400).
- Temperature 1.0: Hallucination (~300), Non-Hallucination (~280), Rejected (~350).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~500), Non-Hallucination (~200), Rejected (~150).
- Temperature 1.0: Hallucination (~550), Non-Hallucination (~180), Rejected (~120).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~480), Non-Hallucination (~220), Rejected (~130).
- Temperature 1.0: Hallucination (~520), Non-Hallucination (~210), Rejected (~110).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~350), Non-Hallucination (~320), Rejected (~330).
- Temperature 1.0: Hallucination (~380), Non-Hallucination (~300), Rejected (~310).
#### Dataset: CoQA
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~2,000), Non-Hallucination (~5,500), Rejected (~50).
- Temperature 1.0: Hallucination (~2,200), Non-Hallucination (~5,300), Rejected (~60).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~2,100), Non-Hallucination (~5,400), Rejected (~40).
- Temperature 1.0: Hallucination (~2,300), Non-Hallucination (~5,200), Rejected (~50).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~2,050), Non-Hallucination (~5,450), Rejected (~55).
- Temperature 1.0: Hallucination (~2,250), Non-Hallucination (~5,150), Rejected (~65).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~1,800), Non-Hallucination (~5,700), Rejected (~45).
- Temperature 1.0: Hallucination (~1,900), Non-Hallucination (~5,600), Rejected (~55).
#### Dataset: SQuADv2
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~1,000), Non-Hallucination (~3,000), Rejected (~200).
- Temperature 1.0: Hallucination (~1,200), Non-Hallucination (~2,800), Rejected (~180).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~1,100), Non-Hallucination (~2,900), Rejected (~190).
- Temperature 1.0: Hallucination (~1,300), Non-Hallucination (~2,700), Rejected (~170).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~1,050), Non-Hallucination (~2,950), Rejected (~210).
- Temperature 1.0: Hallucination (~1,250), Non-Hallucination (~2,650), Rejected (~190).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~900), Non-Hallucination (~3,100), Rejected (~180).
- Temperature 1.0: Hallucination (~1,000), Non-Hallucination (~3,000), Rejected (~170).
#### Dataset: TriviaQA
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~3,000), Non-Hallucination (~3,500), Rejected (~200).
- Temperature 1.0: Hallucination (~3,200), Non-Hallucination (~3,300), Rejected (~190).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~3,500), Non-Hallucination (~3,000), Rejected (~150).
- Temperature 1.0: Hallucination (~3,700), Non-Hallucination (~2,900), Rejected (~140).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~3,400), Non-Hallucination (~3,100), Rejected (~160).
- Temperature 1.0: Hallucination (~3,600), Non-Hallucination (~2,800), Rejected (~150).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~2,800), Non-Hallucination (~3,700), Rejected (~170).
- Temperature 1.0: Hallucination (~2,900), Non-Hallucination (~3,600), Rejected (~160).
#### Dataset: HaluevalQA
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~1,500), Non-Hallucination (~2,500), Rejected (~300).
- Temperature 1.0: Hallucination (~1,700), Non-Hallucination (~2,300), Rejected (~280).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~1,600), Non-Hallucination (~2,400), Rejected (~290).
- Temperature 1.0: Hallucination (~1,800), Non-Hallucination (~2,200), Rejected (~270).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~1,550), Non-Hallucination (~2,450), Rejected (~310).
- Temperature 1.0: Hallucination (~1,750), Non-Hallucination (~2,150), Rejected (~290).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~1,300), Non-Hallucination (~2,700), Rejected (~280).
- Temperature 1.0: Hallucination (~1,400), Non-Hallucination (~2,600), Rejected (~270).
#### Dataset: NQOpen
- **Mistral-Small-24B**:
- Temperature 0.1: Hallucination (~1,000), Non-Hallucination (~1,500), Rejected (~200).
- Temperature 1.0: Hallucination (~1,100), Non-Hallucination (~1,400), Rejected (~190).
- **Llama3-1-8B**:
- Temperature 0.1: Hallucination (~1,200), Non-Hallucination (~1,300), Rejected (~180).
- Temperature 1.0: Hallucination (~1,300), Non-Hallucination (~1,200), Rejected (~170).
- **Llama3-2-3B**:
- Temperature 0.1: Hallucination (~1,150), Non-Hallucination (~1,350), Rejected (~190).
- Temperature 1.0: Hallucination (~1,250), Non-Hallucination (~1,250), Rejected (~180).
- **Mistral-Nemo**:
- Temperature 0.1: Hallucination (~900), Non-Hallucination (~1,600), Rejected (~210).
- Temperature 1.0: Hallucination (~1,000), Non-Hallucination (~1,500), Rejected (~200).
### Key Observations
1. **Temperature Sensitivity**:
- Lower temperature (0.1) generally increases hallucination rates for Llama3 models (e.g., TruthfulQA: Llama3-1-8B hallucination jumps from ~500 to ~550 at 1.0).
- Mistral-Nemo shows minimal hallucination increases across temperatures (e.g., CoQA: ~1,800 to ~1,900).
2. **Model Robustness**:
- Mistral-Nemo consistently has the lowest hallucination rates (e.g., SQuADv2: ~900 at 0.1 vs. Llama3-1-8B’s ~1,100).
- Llama3-2-3B often has higher non-hallucination counts than Llama3-1-8B (e.g., TriviaQA: ~3,100 vs. ~3,000 at 0.1).
3. **Rejected Responses**:
- Rejected counts are highest for Mistral-Small-24B in TruthfulQA (~400 at 0.1) and lowest for Llama3-1-8B in CoQA (~40 at 0.1).
### Interpretation
The data suggests that temperature settings significantly impact model behavior, with lower temperatures (0.1) often increasing hallucination rates for Llama3 models. Mistral-Nemo demonstrates superior robustness, maintaining lower hallucination rates across datasets. The "Rejected" category indicates instances where models abstained from answering, possibly reflecting confidence thresholds. These trends highlight trade-offs between creativity (higher temperature) and factual accuracy (lower temperature), with model architecture playing a critical role in performance.