## Line Chart with Confidence Intervals: Brain Alignment vs. Training Tokens for Three Model Sizes
### Overview
The image displays three horizontally aligned line charts, each representing a different model size (14M, 70M, 160M parameters). Each chart plots "Brain Alignment (Pearson's r)" on the y-axis against the "Number of Tokens" (on a logarithmic scale) on the x-axis. Two data series are shown in each plot: "Language Network" (green line with circle markers) and "V1" (purple line with 'x' markers), each accompanied by a shaded region representing uncertainty or confidence intervals. The charts collectively illustrate how the alignment of model representations with two distinct brain regions evolves as a function of training data quantity and model scale.
### Components/Axes
* **Titles:** Three subplot titles are positioned at the top center of each panel: **14M**, **70M**, and **160M**.
* **Y-Axis (All Panels):**
* **Label:** "Brain Alignment (Pearson's r)"
* **Scale:** Linear, ranging from -0.025 to 0.150.
* **Major Ticks:** -0.025, 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150.
* **X-Axis (All Panels):**
* **Label:** "Number of Tokens"
* **Scale:** Logarithmic (base 2 progression for the early points).
* **Tick Labels (Identical for all panels):**
| Scale | Tick Labels |
| :--- | :--- |
| Logarithmic (base 2) | 0, 2M, 4M, 8M, 16M, 32M, 64M, 128M, 256M, 512M, 1B, 2B, 4B, 8B, 16B, 20B, 40B, 60B, 80B, 100B, 120B, 140B, 160B, 180B, 200B, 220B, 240B, 260B, 280B, 286B |
* **Legend:** Positioned at the bottom center of the entire figure, below the three charts.
* **Title:** "Region"
* **Series 1:** A green line with a circle marker labeled "Language Network".
* **Series 2:** A purple line with an 'x' marker labeled "V1".
* **Vertical Reference Line:** A solid black vertical line is drawn at the **16B** token mark in each of the three subplots.
### Detailed Analysis
**1. 14M Parameter Model (Left Panel):**
* **Language Network (Green):** Starts at ~0.060 at 0 tokens. Shows a slight, gradual decline until 512M tokens (~0.050). Experiences a sharp increase starting at 1B tokens, crossing 0.100 by 4B tokens. Peaks at ~0.125 around 60B-80B tokens, then fluctuates slightly between 0.115 and 0.125 for the remainder of the training. The shaded green confidence band is widest in the early training phase (0-512M) and narrows significantly after the sharp rise.
* **V1 (Purple):** Remains consistently low, fluctuating between approximately -0.005 and 0.040 throughout training. Shows no clear upward trend. The highest point is ~0.040 at 256M tokens. The shaded purple band is relatively wide compared to the mean value, indicating high variance or uncertainty.
**2. 70M Parameter Model (Center Panel):**
* **Language Network (Green):** Starts at ~0.050. Remains flat until 128M tokens. Begins a steep ascent at 256M tokens, reaching ~0.100 by 2B tokens. Continues a steadier climb, surpassing 0.125 by 100B tokens and ending near 0.130 at 286B tokens. The confidence band is narrowest during the steep ascent phase.
* **V1 (Purple):** Similar to the 14M model, stays low and flat, mostly between 0.000 and 0.030. Shows minor fluctuations without a sustained increase.
**3. 160M Parameter Model (Right Panel):**
* **Language Network (Green):** Starts at ~0.050. Shows a slight dip around 64M-128M tokens (~0.040). Begins a rapid increase at 256M tokens, reaching ~0.115 by 4B tokens. Plateaus between 0.110 and 0.120 from 16B tokens onward. The confidence band is notably wide during the initial dip and the plateau phase.
* **V1 (Purple):** Again, shows a flat trend, hovering between 0.000 and 0.030. A slight dip to ~0.000 occurs at 512M tokens.
**Cross-Panel Trend Verification:**
* **Language Network Trend:** In all three models, the green line exhibits a characteristic "S-curve" or phase transition: a flat or slightly declining early phase, followed by a steep increase starting between 128M and 1B tokens, and finally a plateau or slower growth phase. The final alignment value is highest for the 70M model (~0.130) and slightly lower for the 14M and 160M models (~0.120-0.125).
* **V1 Trend:** The purple line is consistently flat and near zero across all model sizes and training durations, showing no meaningful alignment with the V1 visual cortex region.
### Key Observations
1. **Divergent Alignment:** There is a stark and consistent divergence between alignment with the Language Network (which grows significantly) and alignment with V1 (which remains negligible).
2. **Critical Token Threshold:** The most rapid improvement in Language Network alignment occurs after a model has been trained on a substantial amount of data (between 128M and 4B tokens, depending on the model). The vertical line at 16B tokens appears to mark a point where alignment has largely stabilized for the 14M and 160M models.
3. **Model Size Effect:** The 70M parameter model achieves the highest final alignment score. The 160M model does not outperform the 70M model, suggesting a non-linear relationship between model size and brain alignment for this metric.
4. **Uncertainty Patterns:** The confidence intervals for the Language Network are widest during periods of rapid change (the steep ascent) and in the very early training stages, suggesting greater variability in model representations during these phases.
### Interpretation
The data strongly suggests that as language models are trained on more data, their internal representations become increasingly similar to those found in the human brain's language network, but show no such similarity to the primary visual cortex (V1). This implies that the models are learning something functionally analogous to human language processing, rather than general visual processing.
The observed "phase transition" in alignment—where performance rapidly improves after a critical amount of training—is a key finding. It indicates that the development of brain-like language representations is not a gradual, linear process but may require a sufficient scale of both model parameters and training data to emerge. The fact that the 70M model outperforms the 160M model at the end of training is an important anomaly; it could indicate that for this specific alignment metric, simply increasing model size beyond a point yields diminishing returns, or that the 160M model's training trajectory diverged in a way that was less optimal for matching brain data.
The consistently low alignment with V1 acts as a crucial control, demonstrating that the high alignment with the language network is specific and meaningful, not an artifact of the measurement technique. Overall, the charts provide evidence that the computational principles learned by scaled language models during training spontaneously converge, to a measurable degree, with the representational patterns of the human language system.