## Line Graphs: Model Performance Across Formal and Functional Competence
### Overview
The image contains four grouped line graphs comparing model performance across formal and functional competence tasks. Each graph represents a different Pythia model size (1B, 2.8B, 6.9B, and 5 Models) with two subplots per graph: formal competence (top) and functional competence (bottom). The graphs show normalized accuracy against the number of tokens, with shaded regions indicating 95% confidence intervals.
### Components/Axes
- **X-axis**: Number of Tokens (0 to 256, logarithmic scale)
- **Y-axis**: Normalized Accuracy (-0.1 to 0.8)
- **Legends**:
- **Formal Competence**:
- BLiMP (light blue circles)
- SyntaxGym (light blue crosses)
- **Functional Competence**:
- ARC-Easy (dark blue circles)
- PIQA (dark blue crosses)
- Social-IQA (dark blue squares)
- ARC Challenge (dark blue diamonds)
- HellaSwag (dark blue stars)
- WinoGrande (dark blue plus signs)
- **Shading**: Light blue regions represent 95% confidence intervals.
### Detailed Analysis
#### Formal Competence (Top Subplots)
- **Pythia-1B (a)**:
- BLiMP and SyntaxGym start at ~0.1 accuracy, plateauing at ~0.7-0.8 after ~100 tokens.
- Confidence intervals narrow significantly after 100 tokens.
- **Pythia-2.8B (b)**:
- Similar trend to 1B but with slightly higher initial accuracy (~0.2 vs. 0.1).
- Plateaus at ~0.7-0.8 with tighter confidence intervals.
- **Pythia-6.9B (c)**:
- Rapid rise to ~0.7 accuracy by ~50 tokens, plateauing at ~0.8.
- Confidence intervals remain narrow throughout.
- **Pythia (5 Models) (d)**:
- Combines results from multiple models, showing ~0.7 accuracy by ~100 tokens.
- Confidence intervals widen slightly compared to single-model graphs.
#### Functional Competence (Bottom Subplots)
- **Pythia-1B (a)**:
- ARC-Easy and PIQA start near 0, peaking at ~0.3-0.4 after ~200 tokens.
- Social-IQA and ARC Challenge show negative accuracy (-0.1 to 0.1) initially.
- HellaSwag and WinoGrande plateau at ~0.2-0.3.
- **Pythia-2.8B (b)**:
- ARC-Easy and PIQA reach ~0.4-0.5, with Social-IQA and ARC Challenge improving to ~0.1-0.2.
- HellaSwag and WinoGrande plateau at ~0.3-0.4.
- **Pythia-6.9B (c)**:
- ARC-Easy and PIQA peak at ~0.5-0.6, with Social-IQA and ARC Challenge reaching ~0.3-0.4.
- HellaSwag and WinoGrande plateau at ~0.4-0.5.
- **Pythia (5 Models) (d)**:
- Combines results, showing ~0.4-0.5 accuracy for most tasks.
- Confidence intervals are wider, indicating higher variability.
### Key Observations
1. **Formal vs. Functional Tasks**:
- Formal tasks (BLiMP, SyntaxGym) consistently outperform functional tasks across all models.
- Functional tasks show greater variability, with some models (e.g., Social-IQA, ARC Challenge) performing poorly in smaller models.
2. **Model Size Impact**:
- Larger models (6.9B) achieve higher accuracy in both task types compared to smaller models (1B, 2.8B).
- The "5 Models" graph (d) suggests ensemble approaches improve functional task performance but with increased computational cost (5.6% training time noted).
3. **Confidence Intervals**:
- Functional tasks exhibit wider confidence intervals, indicating less reliable performance estimates.
- Formal tasks show tighter intervals, suggesting more stable results.
### Interpretation
The data demonstrates that larger language models (e.g., Pythia-6.9B) excel in formal linguistic tasks (e.g., syntax, grammar) but struggle with functional reasoning (e.g., commonsense, logic). Functional tasks require more tokens to reach stable performance, and smaller models often fail to achieve meaningful accuracy. The "5 Models" graph highlights a trade-off between performance gains and training efficiency, as combining models improves functional task results but increases computational overhead. The shaded confidence intervals emphasize the uncertainty in functional task evaluations, suggesting these tasks may require more robust evaluation frameworks.