## Scatter Plot Charts: Scaling Laws for Neural Network Parameters and Training Steps
### Overview
The image contains two side-by-side scatter plot charts on a white background. Both charts analyze relationships between computational resources (measured in PF-days) and model characteristics (parameters or training steps) on logarithmic scales. The charts appear to illustrate scaling laws, likely in the context of machine learning model training.
### Components/Axes
**Left Chart:**
* **Chart Type:** Scatter plot with logarithmic axes.
* **X-axis:** Label: `Compute (PF-days), non-embedding`. Scale: Logarithmic, ranging from approximately `10^-8` to `10^-1`.
* **Y-axis:** Label: `Parameters (non-embedding)`. Scale: Logarithmic, ranging from `10^2` to `10^8`.
* **Legend (Top-Left):**
* Blue dashed line: `N = (1.3 * 10^9) * C_min^0.73`
* Orange dashed line: `N = (1.6 * 10^9) * C_min^0.88`
* **Data Series:**
* Blue dots: Represent data points corresponding to the blue dashed line model.
* Orange dots: Represent data points corresponding to the orange dashed line model.
**Right Chart:**
* **Chart Type:** Scatter plot with one logarithmic axis.
* **X-axis:** Label: `Compute (PF-days), excluding embeddings`. Scale: Logarithmic, ranging from approximately `10^-8` to `10^-1`.
* **Y-axis:** Label: `Steps`. Scale: Linear, ranging from `0` to `15000`.
* **Legend (Top-Left):**
* Blue line with circle markers: `S_min (adjusted)`
* Blue dashed line: `S_min = (5.4 * 10^3) * C_min^0.03`
* Orange line with circle markers: `S (fixed-batch)`
* **Data Series:**
* Blue line with markers (`S_min (adjusted)`): Shows a generally increasing trend with significant local variability and spikes.
* Orange line with markers (`S (fixed-batch)`): Shows a clear, steeply increasing trend, especially at higher compute values.
### Detailed Analysis
**Left Chart (Parameters vs. Compute):**
* **Trend Verification:** Both data series show a strong, positive, linear correlation on the log-log plot, indicating a power-law relationship. The blue series has a slightly shallower slope than the orange series.
* **Data Points & Equations:**
* The blue dots closely follow the trend line defined by the equation `N = (1.3 * 10^9) * C_min^0.73`. The exponent `0.73` indicates that model parameter count scales with compute to the power of 0.73 for this series.
* The orange dots closely follow the trend line defined by the equation `N = (1.6 * 10^9) * C_min^0.88`. The exponent `0.88` indicates a stronger scaling relationship for this series.
* At the lowest compute (`~10^-8` PF-days), the orange series predicts slightly fewer parameters than the blue series. Due to its steeper slope, the orange series surpasses the blue series at approximately `10^-5` PF-days and predicts significantly more parameters at the highest compute (`~10^-1` PF-days).
**Right Chart (Steps vs. Compute):**
* **Trend Verification:** The `S (fixed-batch)` (orange) series shows a clear, accelerating upward trend. The `S_min (adjusted)` (blue) series shows a very shallow upward trend with high variance.
* **Data Points & Equations:**
* The blue dashed trend line for `S_min` is defined by `S_min = (5.4 * 10^3) * C_min^0.03`. The very small exponent (`0.03`) suggests the optimal number of training steps (`S_min`) is nearly independent of the total compute budget (`C_min`), increasing only marginally.
* The `S_min (adjusted)` (blue line) data points scatter around this shallow trend line but exhibit notable spikes (e.g., near `10^-5` and `10^-4` PF-days).
* The `S (fixed-batch)` (orange line) data points start near the `S_min` values at low compute but diverge dramatically upward as compute increases. At `~10^-1` PF-days, the fixed-batch steps (`~17,000`) are more than three times the adjusted optimal steps (`~5,000`).
### Key Observations
1. **Power-Law Scaling:** The left chart confirms a power-law relationship between model size (parameters) and compute, a fundamental observation in neural scaling laws.
2. **Divergent Step Scaling:** The right chart reveals a critical insight: while the *optimal* training steps (`S_min`) scale very weakly with compute (exponent `0.03`), using a *fixed batch size* (`S (fixed-batch)`) forces a much stronger increase in steps (visually, an exponent much greater than 0.03).
3. **Efficiency Gap:** The growing vertical gap between the orange and blue lines in the right chart represents an efficiency loss. As more compute is allocated, a fixed-batch strategy requires disproportionately more training steps compared to an optimally adjusted strategy.
4. **Variability in Optimal Steps:** The spikes in the `S_min (adjusted)` data suggest that the optimal number of steps may be sensitive to specific configurations or exhibit non-monotonic behavior at certain compute scales.
### Interpretation
These charts together illustrate a core tension in scaling neural networks. The left chart shows the predictable, favorable scaling of model capacity with increased compute. However, the right chart exposes a crucial operational constraint: to fully utilize that increased compute for training a larger model, one cannot simply increase the batch size linearly. The `S_min` trend suggests that the optimal training duration (in steps) remains relatively constant, meaning the batch size must increase almost proportionally with the total compute budget to maintain efficiency.
The `S (fixed-batch)` line demonstrates the consequence of failing to adjust the batch size: training becomes inefficient, requiring many more steps (and thus more time) to consume the same compute budget, likely leading to suboptimal model performance. The data argues for dynamic batch size scaling as a critical component of efficient large-scale model training. The near-zero exponent for `S_min` is a key quantitative finding, implying that for optimal efficiency, training step count should be held roughly constant as models and compute scale up, with batch size being the primary lever for utilizing additional resources.