## Violin Plot: Token Probability Distribution Across Model Variants
### Overview
The chart compares token probability distributions for two language models (LLaMA-3-8B and Mistral-7B-v0.3) across three categories: Factual Associations, Associated Hallucinations, and Unassociated Hallucinations. Violin plots visualize the probability density of token occurrences, with medians marked by horizontal lines.
### Components/Axes
- **X-axis**: Model variants (LLaMA-3-8B, Mistral-7B-v0.3)
- **Y-axis**: Token Probability (0.0–1.0)
- **Legend**:
- Green: Factual Associations
- Blue: Associated Hallucinations
- Red: Unassociated Hallucinations
- **Spatial Grounding**:
- Legend positioned at bottom center
- Model labels centered below each pair of violin plots
- Y-axis label left-aligned vertically
### Detailed Analysis
1. **LLaMA-3-8B**:
- **Factual Associations (Green)**:
- Median ≈ 0.4 (IQR: 0.3–0.5)
- Distribution skewed toward lower probabilities
- **Associated Hallucinations (Blue)**:
- Median ≈ 0.3 (IQR: 0.2–0.4)
- Narrower spread than factual associations
- **Unassociated Hallucinations (Red)**:
- Median ≈ 0.1 (IQR: 0.05–0.15)
- Tightest distribution
2. **Mistral-7B-v0.3**:
- **Factual Associations (Green)**:
- Median ≈ 0.5 (IQR: 0.4–0.6)
- Broader spread than LLaMA
- **Associated Hallucinations (Blue)**:
- Median ≈ 0.4 (IQR: 0.3–0.5)
- Similar spread to factual associations
- **Unassociated Hallucinations (Red)**:
- Median ≈ 0.1 (IQR: 0.05–0.15)
- Identical to LLaMA
### Key Observations
- Mistral-7B-v0.3 demonstrates **higher median token probabilities** for both factual associations (+25%) and associated hallucinations (+33%) compared to LLaMA-3-8B.
- **Unassociated hallucinations** show no significant difference between models (both ≈0.1 median).
- Mistral's distributions exhibit **greater variability** (wider violins) across all categories, suggesting less consistent token generation.
### Interpretation
The data indicates Mistral-7B-v0.3 outperforms LLaMA-3-8B in maintaining factual associations while reducing associated hallucinations. The similar unassociated hallucination rates suggest both models struggle equally with spurious token generation unrelated to input context. The wider distribution in Mistral's factual associations may reflect improved contextual understanding but could also indicate overconfidence in certain outputs. These findings align with Mistral's architectural optimizations for reasoning tasks, though the increased variability warrants further investigation into output reliability.