## Multi-Panel Line Chart: Pythia Model Performance Across Training Tokens
### Overview
This image contains eight line charts arranged in a 2x4 grid, displaying the performance of different-sized Pythia language models on various benchmarks as a function of training data (number of tokens). The charts are grouped into two rows representing two broad categories of model capability: "Formal Competence" (top row) and "Functional Competence" (bottom row). The columns correspond to different model sizes: (a) Pythia-1B, (b) Pythia-2.8B, (c) Pythia-6.9B, and (d) an aggregate of 5 Pythia models. A comprehensive legend is provided at the bottom.
### Components/Axes
* **Chart Titles (Column Headers):**
* (a) Pythia-1B
* (b) Pythia-2.8B
* (c) Pythia-6.9B
* (d) Pythia (5 Models)
* **Row Labels (Y-axis Titles for each row):**
* Top Row: "Formal Competence"
* Bottom Row: "Functional Competence"
* **Axes (for all 8 charts):**
* **X-axis:** "Number of Tokens". Scale is logarithmic, with major tick marks at: 0, 2M, 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1B, 2B, 4B, 8B, 16B, 32B, 64B, 100B, 128B, 144B, 160B, 176B, 192B, 208B, 224B, 256B, 288B.
* **Y-axis:** "Normalized Accuracy". Scale is linear, ranging from approximately -0.1 to 0.8 or 0.9 depending on the chart.
* **Legend (Bottom of image):**
* **Formal Competence:**
* Light blue circle marker: **BLiMP**
* Light blue 'x' marker: **SyntaxGym**
* **Functional Competence:**
* Medium blue circle marker: **ARC-Easy**
* Medium blue 'x' marker: **PIQA**
* Medium blue square marker: **Social-IQA**
* Dark blue diamond marker: **ARC Challenge**
* Dark blue star/asterisk marker: **HellaSwag**
* Dark blue plus '+' marker: **WinoGrande**
* **Annotation (Panel d, bottom row):** A bracket spanning from ~0 to ~16B tokens with the text "5.6% of training time".
### Detailed Analysis
**Top Row - Formal Competence (BLiMP & SyntaxGym):**
* **Trend:** For all model sizes (1B, 2.8B, 6.9B, and the 5-model aggregate), both BLiMP and SyntaxGym show a very similar pattern. Performance remains low and relatively flat (around 0.1-0.25 normalized accuracy) for training token counts up to approximately 512M-1B tokens.
* **Key Transition:** Between 1B and 4B tokens, there is a sharp, near-vertical increase in accuracy for both benchmarks.
* **Plateau:** After ~4B tokens, performance plateaus. BLiMP plateaus at a higher level (~0.65-0.7) than SyntaxGym (~0.8-0.85). This plateau is consistent across all model sizes.
* **Data Points (Approximate Plateau Values):**
* **BLiMP:** ~0.65 (1B), ~0.66 (2.8B), ~0.65 (6.9B), ~0.65 (5 Models).
* **SyntaxGym:** ~0.82 (1B), ~0.83 (2.8B), ~0.84 (6.9B), ~0.83 (5 Models).
**Bottom Row - Functional Competence (ARC-Easy, PIQA, Social-IQA, ARC Challenge, HellaSwag, WinoGrande):**
* **General Trend:** All six benchmarks show a more gradual and varied learning curve compared to the formal competence tasks. Performance generally improves with more training tokens, but the rate and final level differ significantly by task.
* **Task Hierarchy (by final performance):**
1. **ARC-Easy & PIQA (Top Performers):** These two tasks (medium blue circle and 'x') show the strongest and most consistent improvement. They start near 0, begin a steady rise around 256M-512M tokens, and continue climbing, reaching ~0.4-0.5 normalized accuracy by 288B tokens. Their curves are closely aligned.
2. **Social-IQA (Mid-tier):** The medium blue square line shows moderate improvement, starting near 0 and rising to approximately 0.1-0.2 by 288B tokens.
3. **ARC Challenge, HellaSwag, WinoGrande (Lower Performers):** These three tasks (dark blue diamond, star, plus) show the slowest growth. They often start at or below zero normalized accuracy. They begin to rise noticeably only after 1B-2B tokens and reach final values between ~0.0 and ~0.25, with significant variance between tasks and model sizes. HellaSwag (star) often shows the lowest performance.
* **Model Size Comparison:** Larger models (2.8B, 6.9B) generally achieve slightly higher final accuracy on these functional tasks than the 1B model, but the overall shape of the learning curves is consistent.
* **5-Model Aggregate (Panel d):** This chart includes shaded error bands, indicating variance across the five models. The "5.6% of training time" annotation highlights that the initial, low-performance phase constitutes a small fraction of the total training budget before significant gains are observed.
### Key Observations
1. **Phase Transition:** A dramatic, synchronized phase transition occurs for *all* benchmarks (both formal and functional) between 512M and 4B training tokens. This suggests a critical point in training where fundamental capabilities are acquired.
2. **Competence Dichotomy:** There is a clear separation between "Formal Competence" (linguistic syntax/grammar tasks like BLiMP, SyntaxGym) and "Functional Competence" (reasoning/knowledge tasks like ARC, PIQA). Formal competence is mastered quickly and to a high level after the phase transition, while functional competence improves more gradually and plateaus at lower levels.
3. **Task Difficulty Spectrum:** Within functional competence, a clear hierarchy of difficulty is evident, with ARC-Easy/PIQA being "easier" than Social-IQA, which is easier than ARC Challenge/HellaSwag/WinoGrande.
4. **Scalability:** The patterns are remarkably consistent across model sizes (1B to 6.9B parameters), indicating that these learning dynamics are a property of the training process and data, not just model scale. Larger models show modest performance gains but follow the same trajectory.
### Interpretation
The data demonstrates a fundamental characteristic of large language model training: **capability acquisition is not linear**. Models spend a significant portion of early training (the first ~5.6% of tokens, up to ~16B) in a low-competence state, building basic statistical regularities. Then, a rapid phase transition occurs where core linguistic (formal) and reasoning (functional) abilities emerge almost simultaneously across a wide range of benchmarks.
The stark difference between the high, flat plateaus of formal competence and the lower, still-rising curves of functional competence suggests that mastering syntactic structure is a prerequisite that is achieved relatively "easily" once sufficient data is seen. In contrast, the knowledge and complex reasoning required for functional tasks are harder to acquire and may continue to improve with even more data or different training approaches. The consistency across model sizes implies these are robust phenomena in the scaling of transformer-based LMs trained on natural language corpora. The charts effectively argue that "more data" leads to a predictable, non-linear unlocking of capabilities, with different skill sets emerging on different timelines.