## Chart: Pythia Model Performance
### Overview
The image presents four line charts comparing the performance of Pythia language models of varying sizes (1B, 2.8B, 6.9B, and an average of 5 models) on formal and functional competence tasks. The charts display normalized accuracy against the number of tokens processed during training.
### Components/Axes
* **Titles:**
* (a) Pythia-1B
* (b) Pythia-2.8B
* (c) Pythia-6.9B
* (d) Pythia (5 Models)
* **Y-Axis (Left):**
* Label: "Formal Competence" (Top Row), "Functional Competence" (Bottom Row)
* Sub-Label: "Normalized Accuracy"
* Scale: 0.0 to 0.8 (Top Row), -0.1 to 0.5 (Bottom Row), with increments of 0.1.
* **X-Axis (Bottom):**
* Label: "Number of Tokens"
* Scale: 0 to 256B, with markers at intervals of 2, 4, 8, 12, 16, 24, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256 (all in Billions - 'B').
* **Legend (Bottom):**
* **Formal Competence:**
* BLiMP (Light Blue, Circle Marker)
* SyntaxGym (Light Blue, X Marker)
* **Functional Competence:**
* ARC-Easy (Dark Blue, Circle Marker)
* PIQA (Light Blue, X Marker)
* Social-IQA (Dark Blue, Square Marker)
* ARC Challenge (Dark Blue, Diamond Marker)
* HellaSwag (Dark Blue, Triangle Marker)
* WinoGrande (Dark Blue, Plus Marker)
### Detailed Analysis
**Chart (a) Pythia-1B:**
* **Formal Competence:**
* BLiMP (Light Blue, Circle): Starts at approximately 0.05, remains relatively flat until 32B tokens, then increases to approximately 0.65 and plateaus.
* SyntaxGym (Light Blue, X): Starts at approximately 0.25, dips slightly around 16B tokens, then increases sharply to approximately 0.78 and plateaus.
* **Functional Competence:**
* ARC-Easy (Dark Blue, Circle): Starts at approximately 0.0, increases to approximately 0.42 by 256B tokens.
* PIQA (Light Blue, X): Starts at approximately 0.0, increases to approximately 0.40 by 256B tokens.
* Social-IQA (Dark Blue, Square): Starts at approximately 0.0, increases to approximately 0.10 by 256B tokens.
* ARC Challenge (Dark Blue, Diamond): Starts at approximately -0.05, increases to approximately 0.15 by 256B tokens.
* HellaSwag (Dark Blue, Triangle): Starts at approximately 0.0, increases to approximately 0.10 by 256B tokens.
* WinoGrande (Dark Blue, Plus): Starts at approximately 0.0, increases to approximately 0.10 by 256B tokens.
**Chart (b) Pythia-2.8B:**
* **Formal Competence:**
* BLiMP (Light Blue, Circle): Starts at approximately 0.0, remains relatively flat until 32B tokens, then increases to approximately 0.70 and plateaus.
* SyntaxGym (Light Blue, X): Starts at approximately 0.25, dips slightly around 16B tokens, then increases sharply to approximately 0.80 and plateaus.
* **Functional Competence:**
* ARC-Easy (Dark Blue, Circle): Starts at approximately 0.0, increases to approximately 0.50 by 256B tokens.
* PIQA (Light Blue, X): Starts at approximately 0.0, increases to approximately 0.45 by 256B tokens.
* Social-IQA (Dark Blue, Square): Starts at approximately 0.0, increases to approximately 0.20 by 256B tokens.
* ARC Challenge (Dark Blue, Diamond): Starts at approximately -0.05, increases to approximately 0.30 by 256B tokens.
* HellaSwag (Dark Blue, Triangle): Starts at approximately 0.0, increases to approximately 0.20 by 256B tokens.
* WinoGrande (Dark Blue, Plus): Starts at approximately 0.0, increases to approximately 0.20 by 256B tokens.
**Chart (c) Pythia-6.9B:**
* **Formal Competence:**
* BLiMP (Light Blue, Circle): Starts at approximately 0.15, remains relatively flat until 32B tokens, then increases to approximately 0.65 and plateaus.
* SyntaxGym (Light Blue, X): Starts at approximately 0.25, dips slightly around 16B tokens, then increases sharply to approximately 0.82 and plateaus.
* **Functional Competence:**
* ARC-Easy (Dark Blue, Circle): Starts at approximately 0.0, increases to approximately 0.52 by 256B tokens.
* PIQA (Light Blue, X): Starts at approximately 0.0, increases to approximately 0.48 by 256B tokens.
* Social-IQA (Dark Blue, Square): Starts at approximately 0.0, increases to approximately 0.22 by 256B tokens.
* ARC Challenge (Dark Blue, Diamond): Starts at approximately -0.05, increases to approximately 0.32 by 256B tokens.
* HellaSwag (Dark Blue, Triangle): Starts at approximately 0.0, increases to approximately 0.22 by 256B tokens.
* WinoGrande (Dark Blue, Plus): Starts at approximately 0.0, increases to approximately 0.22 by 256B tokens.
**Chart (d) Pythia (5 Models):**
* **Formal Competence:**
* BLiMP (Light Blue, Circle): Starts at approximately 0.05, remains relatively flat until 32B tokens, then increases to approximately 0.65 and plateaus.
* SyntaxGym (Light Blue, X): Starts at approximately 0.25, dips slightly around 16B tokens, then increases sharply to approximately 0.82 and plateaus.
* **Functional Competence:**
* ARC-Easy (Dark Blue, Circle): The shaded region represents the range of performance across the 5 models. The average performance starts at approximately 0.0, increases to approximately 0.45 by 256B tokens.
* PIQA (Light Blue, X): The shaded region represents the range of performance across the 5 models. The average performance starts at approximately 0.0, increases to approximately 0.42 by 256B tokens.
* Social-IQA (Dark Blue, Square): The shaded region represents the range of performance across the 5 models. The average performance starts at approximately 0.0, increases to approximately 0.15 by 256B tokens.
* ARC Challenge (Dark Blue, Diamond): The shaded region represents the range of performance across the 5 models. The average performance starts at approximately -0.05, increases to approximately 0.20 by 256B tokens.
* HellaSwag (Dark Blue, Triangle): The shaded region represents the range of performance across the 5 models. The average performance starts at approximately 0.0, increases to approximately 0.15 by 256B tokens.
* WinoGrande (Dark Blue, Plus): The shaded region represents the range of performance across the 5 models. The average performance starts at approximately 0.0, increases to approximately 0.15 by 256B tokens.
* **Annotation:**
* "5.6% of training time" is indicated by a bracket above the x-axis, spanning from 0 to approximately 16B tokens.
### Key Observations
* **Formal Competence:** BLiMP and SyntaxGym consistently show a sharp increase in normalized accuracy after processing 32B tokens across all model sizes. SyntaxGym generally achieves higher accuracy than BLiMP.
* **Functional Competence:** The performance on functional competence tasks varies significantly. ARC-Easy and PIQA generally outperform Social-IQA, ARC Challenge, HellaSwag, and WinoGrande.
* **Model Size:** Increasing the model size (from 1B to 6.9B) generally improves the performance on functional competence tasks, as evidenced by the higher normalized accuracy achieved by larger models.
* **Training Time:** The annotation on chart (d) indicates that the initial 5.6% of training time (up to 16B tokens) corresponds to a period of relatively low performance, especially for formal competence tasks.
### Interpretation
The charts demonstrate the impact of model size and training progress on the performance of Pythia language models. The sharp increase in formal competence after 32B tokens suggests a critical learning phase. The varying performance across different functional competence tasks highlights the models' strengths and weaknesses in specific areas of reasoning and understanding. The shaded regions in chart (d) provide insight into the variability of performance across different models within the Pythia family. Overall, the data suggests that larger models and longer training times lead to improved performance, but the specific task significantly influences the achieved accuracy.