2001.08361v1

Model: nemotron-free

## Scaling Laws for Neural Language Models ## Jared Kaplan ∗ Johns Hopkins University, OpenAI jaredk@jhu.edu Sam McCandlish ## ∗ OpenAI sam@openai.com Tom Henighan OpenAI henighan@openai.com Tom B. Brown OpenAI tom@openai.com Benjamin Chess OpenAI bchess@openai.com Rewon Child OpenAI rewon@openai.com Scott Gray OpenAI scott@openai.com Alec Radford OpenAI alec@openai.com Jeffrey Wu OpenAI jeffwu@openai.com Dario Amodei OpenAI damodei@openai.com ## Abstract Westudy empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sampleefficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence. ∗ Equal contribution. Contributions: Jared Kaplan and Sam McCandlish led the research. Tom Henighan contributed the LSTM experiments. Tom Brown, Rewon Child, and Scott Gray, and Alec Radford developed the optimized Transformer implementation. Jeff Wu, Benjamin Chess, and Alec Radford developed the text datasets. Dario Amodei provided guidance throughout the project. ## Contents | 1 | Introduction | 2 | |------------|--------------------------------------------------|-----| | 2 | Background and Methods | 6 | | 3 | Empirical Results and Basic Power Laws | 7 | | 4 | Charting the Infinite Data Limit and Overfitting | 10 | | 5 | Scaling Laws with Model Size and Training Time | 12 | | 6 | Optimal Allocation of the Compute Budget | 14 | | 7 | Related Work | 18 | | 8 | Discussion | 18 | | Appendices | Appendices | 20 | | A | Summary of Power Laws | 20 | | B | Empirical Model of Compute-Efficient Frontier | 20 | | C | Caveats | 22 | | D | Supplemental Figures | 23 | ## 1 Introduction Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world's text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models [RNSS18, DCLT18, YDY + 19, LOG + 19, RSR + 19] approaching human-level performance on many specific tasks [WPN + 19], including the composition of coherent multiparagraph prompted text samples [RWC + 19]. One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors, focusing on the Transformer architecture [VSP + 17, LSP + 18]. The high ceiling and low floor for performance on language tasks allows us to study trends over more than seven orders of magnitude in scale. Throughout we will observe precise power-law scalings for performance as a function of training time, context length, dataset size, model size, and compute budget. ## 1.1 Summary Our key findings for Transformer language models are are as follows: 2 Here we display predicted compute when using a sufficiently small batch size. See Figure 13 for comparison to the purely empirical data. Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute 2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. <details> <summary>Image 1 Details</summary> ![369532d8](/v1/image/369532d800da3533bae7b37a93a047112adafcccd22b0e861033e27216422d7e) ### Visual Description ## Line Graphs: Test Loss vs Compute, Dataset Size, and Parameters ### Overview The image contains three line graphs comparing **Test Loss** against three variables: **Compute (PF-days, non-embedding)**, **Dataset Size (tokens)**, and **Parameters (non-embedding)**. Each graph includes a legend with mathematical equations describing the relationship between the variables and Test Loss. The graphs show downward trends, indicating that Test Loss decreases as the respective variables increase. --- ### Components/Axes #### Left Graph: Compute vs Test Loss - **X-axis**: Compute (PF-days, non-embedding) - Scale: Logarithmic (10⁻⁹ to 10¹) - **Y-axis**: Test Loss - Scale: Linear (2 to 7) - **Legend**: - Dashed orange line: `L = (C_min/2.3·10⁸)⁻⁰.⁰⁵⁰` - Multiple blue lines (density increases with Compute) #### Middle Graph: Dataset Size vs Test Loss - **X-axis**: Dataset Size (tokens) - Scale: Logarithmic (10⁸ to 10⁹) - **Y-axis**: Test Loss - Scale: Linear (2.7 to 4.2) - **Legend**: - Solid blue line: `L = (D/5.4·10¹³)⁻⁰.⁰⁹⁵` #### Right Graph: Parameters vs Test Loss - **X-axis**: Parameters (non-embedding) - Scale: Logarithmic (10⁵ to 10⁹) - **Y-axis**: Test Loss - Scale: Linear (2.4 to 5.6) - **Legend**: - Solid blue line: `L = (N/8.8·10¹³)⁻⁰.⁰⁷⁶` --- ### Detailed Analysis #### Left Graph: Compute vs Test Loss - **Trend**: Test Loss decreases as Compute increases. - **Data Points**: - Dashed orange line (theoretical model): - At 10⁻⁹ Compute: ~6.5 Test Loss - At 10¹ Compute: ~2.5 Test Loss - Blue lines (empirical data): - At 10⁻⁹ Compute: ~6.0–6.5 Test Loss - At 10¹ Compute: ~2.5–3.0 Test Loss - **Key Detail**: The orange line aligns closely with the densest cluster of blue lines, suggesting the equation approximates the trend. #### Middle Graph: Dataset Size vs Test Loss - **Trend**: Test Loss decreases as Dataset Size increases. - **Data Points**: - Solid blue line (theoretical model): - At 10⁸ tokens: ~3.9 Test Loss - At 10⁹ tokens: ~2.7 Test Loss - Empirical data points: - At 10⁸ tokens: ~3.6–3.9 Test Loss - At 10⁹ tokens: ~2.7–3.0 Test Loss - **Key Detail**: The solid line fits the data points tightly, confirming the equation’s accuracy. #### Right Graph: Parameters vs Test Loss - **Trend**: Test Loss decreases as Parameters increase. - **Data Points**: - Solid blue line (theoretical model): - At 10⁵ parameters: ~5.6 Test Loss - At 10⁹ parameters: ~2.4 Test Loss - Empirical data points: - At 10⁵ parameters: ~5.0–5.6 Test Loss - At 10⁹ parameters: ~2.4–2.8 Test Loss - **Key Detail**: The solid line closely matches the data, validating the equation. --- ### Key Observations 1. **Consistent Trends**: All three graphs show a clear inverse relationship between Test Loss and their respective variables. 2. **Theoretical vs. Empirical**: - The dashed orange line (left graph) and solid blue lines (middle/right graphs) align with empirical data, suggesting the equations model real-world behavior. 3. **Variability**: The left graph’s blue lines show greater spread at lower Compute values, possibly due to experimental noise or smaller sample sizes. --- ### Interpretation The data demonstrates that **Test Loss improves predictably** with increases in Compute, Dataset Size, or Parameters, following power-law relationships. The equations in the legends likely represent **scaling laws** common in machine learning, where performance gains diminish at higher resource levels. - **Compute**: The left graph’s orange line (`L = (C_min/2.3·10⁸)⁻⁰.⁰⁵⁰`) suggests Test Loss scales with the inverse square root of Compute. - **Dataset Size**: The middle graph’s blue line (`L = (D/5.4·10¹³)⁻⁰.⁰⁹⁵`) indicates Test Loss scales with the inverse 0.95th root of Dataset Size. - **Parameters**: The right graph’s blue line (`L = (N/8.8·10¹³)⁻⁰.⁰⁷⁶`) shows Test Loss scales with the inverse 0.76th root of Parameters. These relationships imply that optimizing these variables can systematically reduce Test Loss, though diminishing returns may occur at extreme scales (e.g., very high Compute or Parameters). The variability in the left graph highlights the importance of experimental consistency in low-resource regimes. </details> Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D , and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors N,D,C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3) Universality of overfitting: Performance improves predictably as long as we scale up N and D in tandem, but enters a regime of diminishing returns if either N or D is held fixed while the other increases. The performance penalty depends predictably on the ratio N 0 . 74 /D , meaning that every time we increase the model size 8x, we only need to increase the data by roughly 5x to avoid a penalty. (Section 4) Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5) Transfer improves with test performance: When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss - in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2) Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4). Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D , we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as D ∼ C 0 . 27 with training compute. (Section 6) Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million tokens at convergence for the largest models we can train. (Section 5.1) Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models. Figure 2 We show a series of language model training runs, with models ranging in size from 10 3 to 10 9 parameters (excluding embeddings). <details> <summary>Image 2 Details</summary> ![1321ea19](/v1/image/1321ea1969fbcaabdbf573c98b58f13082a9b137ba10c0fd1e82f80e427f6c50) ### Visual Description ## Line Charts: Model Performance vs. Compute and Data Efficiency ### Overview The image contains two side-by-side line charts comparing model performance (test loss) across different parameter sizes (10³ to 10⁹ parameters) under varying computational and data constraints. The left chart focuses on data efficiency (tokens processed), while the right chart emphasizes computational efficiency (PF-days). Both charts demonstrate how larger models achieve better performance with fewer resources. ### Components/Axes **Left Chart: Test Loss vs. Tokens Processed** - **X-axis**: Tokens Processed (log scale: 10⁷ to 10¹¹) - **Y-axis**: Test Loss (linear scale: 4 to 10) - **Legend**: Line color gradient indicates parameter count (purple = 10³, yellow = 10⁹) - **Annotations**: - Arrow pointing to "10³ Params" near the top-left - Arrow pointing to "10⁹ Params" near the bottom-left **Right Chart: Test Loss vs. Compute (PF-days)** - **X-axis**: Compute (PF-days, log scale: 10⁻⁹ to 10⁰) - **Y-axis**: Test Loss (linear scale: 4 to 10) - **Legend**: Same color gradient as left chart (parameters) - **Annotations**: - Arrow pointing to "Compute-efficient training stops far short of convergence" near the bottom-right ### Detailed Analysis **Left Chart Trends**: - All lines slope downward, indicating improved performance (lower test loss) as tokens processed increase. - Larger models (yellow/green lines) achieve lower test loss at earlier token counts compared to smaller models (purple/blue lines). - Example: At ~10⁹ tokens, 10⁹-parameter models (yellow) reach ~6 test loss, while 10³-parameter models (purple) require ~10¹¹ tokens to reach ~8 test loss. **Right Chart Trends**: - All lines slope downward, showing improved performance with increased compute. - Larger models (yellow/green) achieve lower test loss at lower compute budgets compared to smaller models. - Example: At ~10⁻³ PF-days, 10⁹-parameter models reach ~5 test loss, while 10³-parameter models require ~10⁻⁶ PF-days to reach ~7 test loss. ### Key Observations 1. **Data Efficiency**: Larger models (10⁹ params) require ~100x fewer tokens than smaller models (10³ params) to achieve comparable test loss. 2. **Compute Efficiency**: Larger models achieve similar performance with ~1000x less compute than smaller models. 3. **Convergence Gap**: Compute-efficient training for large models stops far from convergence (annotated in right chart), suggesting diminishing returns at high compute levels. ### Interpretation The charts demonstrate a clear trade-off between model size, data efficiency, and computational efficiency: - **Larger models** (10⁹ params) outperform smaller models in both data and compute efficiency, achieving lower test loss with fewer resources. - The optimal model size appears to scale with available compute and data, as indicated by the smooth parameter growth in the right chart. - The convergence gap in the right chart suggests that while larger models are more efficient, they may not fully exploit available compute, leaving room for further optimization. This analysis aligns with the principles of scaling laws in machine learning, where model capacity and resource allocation jointly determine performance outcomes. </details> Figure 3 As more compute becomes available, we can choose how much to allocate towards training larger models, using larger batches, and training for more steps. We illustrate this for a billion-fold increase in compute. For optimally compute-efficient training, most of the increase should go towards increased model size. A relatively small increase in data is needed to avoid reuse. Of the increase in data, most can be used to increase parallelism through larger batch sizes, with only a very small increase in serial training time required. <details> <summary>Image 3 Details</summary> ![e8fb4a72](/v1/image/e8fb4a72812089e59414032ea0456c18c0d4cc9708f5e9ab685541b378750f8d) ### Visual Description ## Line Chart: Multiplicative Contribution vs Compute (PF-days) ### Overview The chart illustrates the relationship between computational resources (PF-days) and multiplicative contributions across three scenarios: negligible serial steps, batch size scaling, and model size scaling. It uses logarithmic scales for both axes to emphasize exponential trends. ### Components/Axes - **X-axis (Compute)**: Logarithmic scale from 10⁻⁸ to 10⁰ PF-days. - **Y-axis (Multiplicative Contribution)**: Logarithmic scale from 10⁰ to 10⁸. - **Legend**: - Green: "Minimum serial steps increases negligibly" - Orange: "<10x Serial Steps" and "100x Batch Size" - Blue: ">1,000,000x Model Size" - **Annotations**: - "Data requirements grow relatively slowly" (right side) - "Optimal model size increases very quickly" (right side) - "Minimum serial steps increases negligibly" (top-left) ### Detailed Analysis 1. **Lines and Trends**: - **Green Line**: Nearly flat, indicating minimal change in multiplicative contribution as compute increases. Peaks at ~10⁸ near 10⁰ PF-days. - **Orange Line**: Moderate upward slope, representing contributions from "<10x Serial Steps" and "100x Batch Size." Ranges from ~10² to ~10⁶. - **Blue Line**: Steep upward slope, showing exponential growth for ">1,000,000x Model Size." Dominates at higher compute levels (~10⁸ at 10⁰ PF-days). 2. **Shaded Areas**: - Orange-shaded region under the orange line highlights contributions from batch size scaling. - Blue-shaded region under the blue line emphasizes model size scaling dominance. 3. **Axis Markers**: - X-axis: 10⁻⁸, 10⁻⁶, 10⁻⁴, 10⁻², 10⁰. - Y-axis: 10⁰, 10², 10⁴, 10⁶, 10⁸. ### Key Observations - **Dominance of Model Size**: The blue line's steep slope suggests that increasing model size (>1Mx) drives multiplicative contributions far more than batch size or serial steps. - **Negligible Serial Steps**: The green line's flatness implies serial steps have minimal impact across compute scales. - **Data vs. Compute**: Annotations confirm data requirements grow slowly, while optimal model size scales rapidly with compute. ### Interpretation The chart demonstrates that computational efficiency is primarily driven by model scale rather than batch size or serial step optimizations. As compute resources grow, larger models (>1Mx) become exponentially more impactful, while smaller-scale optimizations (e.g., batch size) contribute less. This aligns with trends in AI/ML where model parallelism and scale often outweigh incremental hardware improvements. The logarithmic axes underscore the exponential nature of these relationships, suggesting diminishing returns for compute investments below 10⁻² PF-days. </details> ## 1.2 Summary of Scaling Laws The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters N , the dataset size D , or the optimally allocated compute budget C min (see Figure 1): 1. For models with a limited number of parameters, trained to convergence on sufficiently large datasets: $$L ( N ) = ( N _ { c } / N ) ^ { \alpha _ { N } } \, ; \quad \alpha _ { N } \sim 0 . 0 7 6 , \quad N _ { c } \sim 8 . 8 \times 1 0 ^ { 1 3 } \, ( n o n - e m b d i n g p a r a m e t e r s ) \quad ( 1 . 1 )$$ 2. For large models trained with a limited dataset with early stopping: $$L ( D ) = \left ( D _ { c } / D \right ) ^ { \alpha _ { D } } ; \, \alpha _ { D } \sim 0 . 0 9 5 , \quad D _ { c } \sim 5 . 4 \times 1 0 ^ { 1 3 } \left ( t o k e n s \right ) \quad \ \ ( 1 . 2 )$$ 3. When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal 3 use of compute): $$\begin{array} { r l } & { L ( C _ { \min } ) = \left ( C _ { c } ^ { \min } / C _ { \min } \right ) ^ { \alpha _ { c } ^ { \min } } ; \, \alpha _ { C } ^ { \min } \sim 0 . 0 5 0 , \quad C _ { c } ^ { \min } \sim 3 . 1 \times 1 0 ^ { 8 } \left ( P F - d a y s \right ) \quad ( 1 . 3 ) } \end{array}$$ 3 We also observe an empirical power-law trend with the training compute C (Figure 1) while training at fixed batch size, but it is the trend with C min that should be used to make predictions. They are related by equation (5.5). <details> <summary>Image 4 Details</summary> ![0bc2fc9b](/v1/image/0bc2fc9b5b763cb2ad34db67f8682c84e4fd45c08bf6ec53ea4669ecaebe3036) ### Visual Description ## Line Charts: Loss vs Model/Dataset Size and Loss vs Model Size/Training Steps ### Overview The image contains two side-by-side line charts analyzing the relationship between model performance (measured as loss) and two key variables: dataset size and training steps. The left chart examines how loss decreases with increasing dataset size for models of varying parameter counts. The right chart explores how loss decreases with increasing training steps (Estimated S_min) for models of different parameter sizes, visualized through a color gradient. ### Components/Axes **Left Chart: Loss vs Model and Dataset Size** - **X-axis**: Tokens in Dataset (log scale, 10⁷ to 10¹⁰) - **Y-axis**: Loss (linear scale, 2.4 to 4.5) - **Legend**: - Yellow: 708M parameters - Green: 302M parameters - Teal: 85M parameters - Blue: 3M parameters - Dark Blue: 25M parameters - Purple: 393.2K parameters - **Line Styles**: Dotted lines for all series; markers (dots) at data points. **Right Chart: Loss vs Model Size and Training Steps** - **X-axis**: Estimated S_min (log scale, 10⁴ to 10⁵) - **Y-axis**: Loss (linear scale, 2.4 to 4.4) - **Color Gradient Legend**: - Yellow (10⁸ parameters) to Purple (10⁶ parameters) - **Line Styles**: Solid lines without markers; grouped by parameter size via color. ### Detailed Analysis **Left Chart Trends**: - All lines show a downward trend: loss decreases as dataset size increases. - Larger models (e.g., 708M, 302M) start with higher loss but achieve lower final loss at larger dataset sizes. - Smaller models (e.g., 393.2K) maintain higher loss across all dataset sizes but show gradual improvement. - Example: At 10¹⁰ tokens, 708M parameters achieve ~2.7 loss, while 393.2K parameters reach ~3.1 loss. **Right Chart Trends**: - All lines slope downward: loss decreases as S_min increases. - Higher-parameter models (yellow) consistently achieve lower loss across all S_min values. - Diminishing returns observed: the steepest loss reduction occurs at lower S_min values (e.g., 10⁴ to 10⁵), with flattening trends at higher S_min. - Example: For 10⁸ parameters (yellow), loss drops from ~3.6 at S_min=10⁴ to ~2.8 at S_min=10⁵. ### Key Observations 1. **Dataset Size Impact**: Larger models benefit more from increased dataset size, achieving lower loss at scale. 2. **Training Steps Impact**: Higher S_min reduces loss, but the effect plateaus for all models, with smaller models showing less improvement. 3. **Parameter-Size Correlation**: Models with more parameters (yellow) outperform smaller models (purple) across both dataset size and training steps. 4. **Diminishing Returns**: Both charts show that gains in performance (loss reduction) slow as variables (tokens/S_min) increase. ### Interpretation The data suggests that model performance (lower loss) is strongly influenced by both dataset size and training steps, with larger models leveraging these resources more effectively. The left chart emphasizes the importance of scaling datasets for complex models, while the right chart highlights the value of extended training, particularly for high-parameter models. However, the diminishing returns in the right chart imply that beyond a certain point, additional training steps yield minimal benefits, especially for smaller models. This underscores a trade-off: investing in larger models or datasets may be more impactful than prolonging training for smaller architectures. The color gradient in the right chart visually reinforces the parameter-size hierarchy, making it easier to compare performance across models. </details> ⋂}⌋˜{√(]{(〈∐√∐√˜√ min Figure 4 Left : The early-stopped test loss L ( N,D ) varies predictably with the dataset size D and model size N according to Equation (1.5). Right : After an initial transient period, learning curves for all model sizes N can be fit with Equation (1.6), which is parameterized in terms of S min , the number of steps when training at large batch size (details in Section 5.1). These relations hold across eight orders of magnitude in C min , six orders of magnitude in N , and over two orders of magnitude in D . They depend very weakly on model shape and other Transformer hyperparameters (depth, width, number of self-attention heads), with specific numerical values associated with the Webtext2 training set [RWC + 19]. The power laws α N , α D , α min C specify the degree of performance improvement expected as we scale up N , D , or C min ; for example, doubling the number of parameters yields a loss that is smaller by a factor 2 -α N = 0 . 95 . The precise numerical values of N c , C min c , and D c depend on the vocabulary size and tokenization and hence do not have a fundamental meaning. The critical batch size, which determines the speed/efficiency tradeoff for data parallelism ([MKAT18]), also roughly obeys a power law in L :  Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset size sublinearly according to D ∝ N α N α D ∼ N 0 . 74 . In fact, we find that there is a single equation combining (1.1) and (1.2) that governs the simultaneous dependence on N and D and governs the degree of overfitting: $$L ( N , D ) = \left [ \left ( \frac { N _ { c } } { N } \right ) ^ { \frac { \alpha _ { N } } { \alpha _ { D } } } + \frac { D _ { c } } { D } \right ] ^ { \alpha _ { D } }$$ with fits pictured on the left in figure 4. We conjecture that this functional form may also parameterize the trained log-likelihood for other generative modeling tasks. When training a given model for a finite number of parameter update steps S in the infinite data limit, after an initial transient period, the learning curves can be accurately fit by (see the right of figure 4) $$L ( N , S ) = \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \left ( \frac { S _ { c } } { S _ { \min } ( S ) } \right ) ^ { \alpha _ { S } }$$ where S c ≈ 2 . 1 × 10 3 and α S ≈ 0 . 76 , and S min ( S ) is the minimum possible number of optimization steps (parameter updates) estimated using Equation (5.4). When training within a fixed compute budget C , but with no other constraints, Equation (1.6) leads to the prediction that the optimal model size N , optimal batch size B , optimal number of steps S , and dataset size D should grow as $$\begin{array} { r } { N \, \infty \, C ^ { \alpha _ { C } ^ { \min } / \alpha _ { N } } , \quad B \, \infty \, C ^ { \alpha _ { C } ^ { \min } / \alpha _ { B } } , \quad S \, \infty \, C ^ { \alpha _ { C } ^ { \min } / \alpha _ { S } } , \quad D = B \cdot S \quad ( 1 . 7 ) } \end{array}$$ with  which closely matches the empirically optimal results N ∝ C 0 . 73 min , B ∝ C 0 . 24 min , and S ∝ C 0 . 03 min . As the computational budget C increases, it should be spent primarily on larger models, without dramatic increases in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become increasingly sample efficient. In practice, researchers typically train smaller models for longer than would be maximally compute-efficient because of hardware constraints. Optimal performance depends on total compute as a power law (see Equation (1.3)). We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their implications for training time, and a breakdown of our results per token. We also make some brief comparisons to LSTMs and recurrent Transformers [DGV + 18]. ## 1.3 Notation We use the following notation: - L - the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in some cases we report the loss for specific tokens within the context. - N - the number of model parameters, excluding all vocabulary and positional embeddings - C ≈ 6 NBS - an estimate of the total non-embedding training compute, where B is the batch size, and S is the number of training steps (ie parameter updates). We quote numerical values in PF-days, where one PF-day = 10 15 × 24 × 3600 = 8 . 64 × 10 19 floating point operations. - D - the dataset size in tokens - B crit - the critical batch size [MKAT18], defined and discussed in Section 5.1. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency. - C min - an estimate of the minimum amount of non-embedding compute to reach a given value of the loss. This is the training compute that would be used if the model were trained at a batch size much less than the critical batch size. - S min - an estimate of the minimal number of training steps needed to reach a given value of the loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size. - α X - power-law exponents for the scaling of the loss as L ( X ) ∝ 1 /X α X where X can be any of N,D,C,S,B,C min . ## 2 Background and Methods We train language models on WebText2, an extended version of the WebText [RWC + 19] dataset, tokenized using byte-pair encoding [SHB15] with a vocabulary size n vocab = 50257 . We optimize the autoregressive log-likelihood (i.e. cross-entropy loss) averaged over a 1024-token context, which is also our principal performance metric. We record the loss on the WebText2 test distribution and on a selection of other text distributions. We primarily train decoder-only [LSP + 18, RNSS18] Transformer [VSP + 17] models, though we also train LSTM models and Universal Transformers [DGV + 18] for comparison. ## 2.1 Parameter and Compute Scaling of Transformers We parameterize the Transformer architecture using hyperparameters n layer (number of layers), d model (dimension of the residual stream), d ff (dimension of the intermediate feed-forward layer), d attn (dimension of the attention output), and n heads (number of attention heads per layer). We include n ctx tokens in the input context, with n ctx = 1024 except where otherwise noted. We use N to denote the model size, which we define as the number of non-embedding parameters  where we have excluded biases and other sub-leading terms. Our models also have n vocab d model parameters in an embedding matrix, and use n ctx d model parameters for positional embeddings, but we do not include these when discussing the 'model size' N ; we will see that this produces significantly cleaner scaling laws. Evaluating a forward pass of the Transformer involves roughly $$C _ { f o r w a r d } \approx 2 N + 2 n _ { l a y e r } n _ { c t x } d _ { m o d e l } \quad ( 2 . 2 )$$ add-multiply operations, where the factor of two comes from the multiply-accumulate operation used in matrix multiplication. A more detailed per-operation parameter and compute count is included in Table 1. Table 1 Parameter counts and compute (forward pass) estimates for a Transformer model. Sub-leading terms such as nonlinearities, biases, and layer normalization are omitted. | Operation | Parameters | FLOPs per Token | |-----------------------|------------------------------------------|-----------------------------------------| | Embed | ( n vocab + n ctx ) d model | 4 d model | | Attention: QKV | n layer d model 3 d attn | 2 n layer d model 3 d attn | | Attention: Mask | - | 2 n layer n ctx d attn | | Attention: Project | n layer d attn d model | 2 n layer d attn d embd | | Feedforward | n layer 2 d model d ff | 2 n layer 2 d model d ff | | De-embed | - | 2 d model n vocab | | Total (Non-Embedding) | N = 2 d model n layer (2 d attn + d ff ) | C forward = 2 N +2 n layer n ctx d attn | For contexts and models with d model > n ctx / 12 , the context-dependent computational cost per token is a relatively small fraction of the total compute. Since we primarily study models where d model n ctx / 12 , we do not include context-dependent terms in our training compute estimate. Accounting for the backwards pass (approximately twice the compute as the forwards pass), we then define the estimated non-embedding compute as C ≈ 6 N floating point operators per training token. ## 2.2 Training Procedures Unless otherwise noted, we train models with the Adam optimizer [KB14] for a fixed 2 . 5 × 10 5 steps with a batch size of 512 sequences of 1024 tokens. Due to memory constraints, our largest models (more than 1B parameters) were trained with Adafactor [SS18]. We experimented with a variety of learning rates and schedules, as discussed in Appendix D.6. We found that results at convergence were largely independent of learning rate schedule. Unless otherwise noted, all training runs included in our data used a learning rate schedule with a 3000 step linear warmup followed by a cosine decay to zero. ## 2.3 Datasets We train our models on an extended version of the WebText dataset described in [RWC + 19]. The original WebText dataset was a web scrape of outbound links from Reddit through December 2017 which received at least 3 karma. In the second version, WebText2, we added outbound Reddit links from the period of January to October 2018, also with a minimum of 3 karma. The karma threshold served as a heuristic for whether people found the link interesting or useful. The text of the new links was extracted with the Newspaper3k python library. In total, the dataset consists of 20.3M documents containing 96 GB of text and 1 . 62 × 10 10 words (as defined by wc ). We then apply the reversible tokenizer described in [RWC + 19], which yields 2 . 29 × 10 10 tokens. We reserve 6 . 6 × 10 8 of these tokens for use as a test set, and we also test on similarlyprepared samples of Books Corpus [ZKZ + 15], Common Crawl [Fou], English Wikipedia, and a collection of publicly-available Internet Books. ## 3 Empirical Results and Basic Power Laws To characterize language model scaling we train a wide variety of models, varying a number of factors including: - Model size (ranging in size from 768 to 1.5 billion non-embedding parameters) - Dataset size (ranging from 22 million to 23 billion tokens) - Shape (including depth, width, attention heads, and feed-forward dimension) - Context length (1024 for most runs, though we also experiment with shorter contexts) - Batch size ( 2 19 for most runs, but we also vary it to measure the critical batch size) <details> <summary>Image 5 Details</summary> ![34209e53](/v1/image/34209e5367aa1596f85478eef1537837e61591e306a3a8506a2ade16f80d3641) ### Visual Description ## Line Charts: Model Architecture Parameters vs. Loss Increase ### Overview The image contains three line charts comparing different transformer model architecture parameters (feed-forward ratio, aspect ratio, attention head dimension) against loss increase percentages. Each chart explores how scaling specific architectural components affects model performance, with annotations highlighting key insights about computational trade-offs and architectural equivalence. ### Components/Axes 1. **Left Chart** - **X-axis**: Feed-Forward Ratio (dff / dmodel) [log scale: 10⁰ to 10¹] - **Y-axis**: Loss Increase (%) [0% to 10%] - **Legend**: Top-right corner - Blue line: `nhead = 8` - Orange line: `dmodel/nhead = 64` 2. **Middle Chart** - **X-axis**: Aspect Ratio (dmodel / nlayer) [log scale: 10¹ to 10³] - **Y-axis**: Loss Increase (%) [0% to 10%] - **Legend**: Top-right corner - Blue line: `50M Parameters` - Orange line: `274M Parameters` - Green line: `1.5B Parameters` - **Annotation**: "A wide range of architectures achieve similar performance" (highlighted at x=10²) 3. **Right Chart** - **X-axis**: Attention Head Dimension (dmodel / nhead) [log scale: 10¹ to 10²] - **Y-axis**: Loss Increase (%) [0% to 10%] - **Legend**: Top-right corner - Blue line: `dmodel = 256` - Orange line: `dmodel = 512` - Green line: `dmodel = 1024` - **Annotation**: "22% additional compute compensates for 1% loss increase" (highlighted at x=10¹) ### Detailed Analysis #### Left Chart: Feed-Forward Ratio vs. Loss Increase - **Trend**: Both lines show a sharp upward trend as the feed-forward ratio increases. - **Data Points**: - Blue (`nhead = 8`): Starts at ~0.5% loss (x=10⁰), peaks at ~8% (x=10¹). - Orange (`dmodel/nhead = 64`): Starts at ~0.5% loss (x=10⁰), peaks at ~9% (x=10¹). #### Middle Chart: Aspect Ratio vs. Loss Increase - **Trend**: All lines exhibit a U-shaped curve, dipping to near 0% loss before rising again. - **Data Points**: - Blue (`50M`): Starts at ~2% (x=10¹), dips to ~0% (x=10¹.5), rises to ~4% (x=10²). - Orange (`274M`): Starts at ~1% (x=10¹), dips to ~0% (x=10¹.5), rises to ~8% (x=10²). - Green (`1.5B`): Starts at ~3% (x=10¹), dips to ~0% (x=10¹.5), rises to ~3% (x=10²). #### Right Chart: Attention Head Dimension vs. Loss Increase - **Trend**: All lines show a gradual increase in loss as the attention head dimension grows. - **Data Points**: - Blue (`dmodel = 256`): Starts at ~1% (x=10¹), dips to ~0% (x=10¹.5), rises to ~2% (x=10²). - Orange (`dmodel = 512`): Starts at ~0.5% (x=10¹), dips to ~0% (x=10¹.5), rises to ~1.5% (x=10²). - Green (`dmodel = 1024`): Starts at ~0.5% (x=10¹), dips to ~0% (x=10¹.5), rises to ~1% (x=10²). ### Key Observations 1. **Scaling Trade-offs**: Increasing architectural ratios (feed-forward, aspect, attention head) correlates with higher loss, but the relationship is non-linear. 2. **Architectural Equivalence**: The middle chart’s annotation suggests that models of vastly different sizes (50M to 1.5B parameters) can achieve comparable performance at specific aspect ratios. 3. **Compute Compensation**: The right chart’s annotation implies that doubling the model size (e.g., 256 → 512) requires 22% more compute to offset a 1% loss increase. 4. **Optimal Ratios**: All charts show a "valley" in loss at intermediate ratios (e.g., x=10¹.5), suggesting optimal architectural configurations. ### Interpretation The data demonstrates that transformer model performance is highly sensitive to architectural scaling. While larger models (e.g., 1.5B parameters) may achieve similar loss profiles to smaller ones (50M), they require disproportionate computational resources. The "valley" in loss curves indicates that intermediate parameter ratios (e.g., aspect ratio ~10¹.5) offer the best efficiency. The right chart’s annotation underscores the computational cost of scaling: a 22% increase in compute is needed to mitigate a 1% loss increase when doubling model size. This highlights a critical trade-off between model capacity and efficiency, suggesting that architectural optimization (rather than brute-force scaling) may be more effective for maintaining performance. </details> 50M Parameters 25M Parameters Figure 5 Performance depends very mildly on model shape when the total number of non-embedding parameters N is held fixed. The loss varies only a few percent over a wide range of shapes. Small differences in parameter counts are compensated for by using the fit to L ( N ) as a baseline. Aspect ratio in particular can vary by a factor of 40 while only slightly impacting performance; an ( n layer , d model ) = (6 , 4288) reaches a loss within 3% of the (48 , 1600) model used in [RWC + 19]. <details> <summary>Image 6 Details</summary> ![e0979d54](/v1/image/e0979d544b10e84981f4038b51764587ab5c23804fd7aab2ad971c1ef1e66d4f) ### Visual Description ## Line Graphs: Test Loss vs. Parameters (Embedding vs. Non-Embedding) ### Overview The image contains two side-by-side line graphs comparing test loss performance across different neural network layer configurations. The left graph shows results for models with embedding layers, while the right graph shows results for models without embedding layers. Both graphs plot test loss (y-axis) against the number of parameters (x-axis, logarithmic scale). ### Components/Axes **Left Graph (With Embedding):** - **X-axis**: Parameters (with embedding) [10⁶ to 10⁹] - **Y-axis**: Test Loss [2 to 7] - **Legend**: - 0 Layer (dark blue) - 1 Layer (purple) - 2 Layers (pink) - 3 Layers (red) - 6 Layers (orange) - >6 Layers (yellow) **Right Graph (Non-Embedding):** - **X-axis**: Parameters (non-embedding) [10³ to 10⁹] - **Y-axis**: Test Loss [2 to 7] - **Legend**: - 1 Layer (dark blue) - 2 Layers (purple) - 3 Layers (pink) - 6 Layers (orange) - >6 Layers (yellow) ### Detailed Analysis **Left Graph Trends:** - 0 Layer (dark blue): Flat line at ~6.8 test loss (no parameter dependence) - 1 Layer (purple): Steep decline from ~6.5 to ~3.2 as parameters increase - 2 Layers (pink): Gradual decline from ~6.0 to ~2.8 - 3 Layers (red): Moderate decline from ~5.5 to ~2.5 - 6 Layers (orange): Steady decline from ~5.0 to ~2.2 - >6 Layers (yellow): Slight decline from ~4.8 to ~2.1 **Right Graph Trends:** - 1 Layer (dark blue): Sharp decline from ~6.5 to ~3.0 - 2 Layers (purple): Steep decline from ~6.0 to ~2.8 - 3 Layers (pink): Gradual decline from ~5.5 to ~2.5 - 6 Layers (orange): Steady decline from ~5.0 to ~2.2 - >6 Layers (yellow): Slight decline from ~4.8 to ~2.1 ### Key Observations 1. **Embedding vs. Non-Embedding**: - The non-embedding graph starts at 1 Layer (no 0 Layer baseline) - Both graphs show similar performance trends for ≥1 layers - Embedding models require significantly more parameters (10⁶ vs. 10³ baseline) 2. **Layer Complexity**: - Adding layers reduces test loss, but diminishing returns occur after 6 layers - >6 Layers show minimal improvement despite increased parameter count 3. **Parameter Efficiency**: - Non-embedding models achieve similar performance with 100-1000x fewer parameters - Embedding models require ~10⁶ parameters to reach ~2.2 test loss vs. ~10⁵ parameters for non-embedding ### Interpretation The data demonstrates that: - **Embedding layers** significantly increase parameter requirements but enable more complex architectures - **Layer count** has a logarithmic relationship with test loss improvement - **Diminishing returns** occur after 6 layers in both configurations - **Non-embedding models** achieve comparable performance with fewer parameters, suggesting embeddings may introduce unnecessary complexity for simpler tasks The flat 0 Layer line in the embedding graph indicates a potential baseline model (e.g., a simple classifier) that doesn't benefit from additional parameters. The convergence of >6 Layers lines suggests architectural saturation points where adding more layers provides minimal benefit. </details> Figure 6 Left: When we include embedding parameters, performance appears to depend strongly on the number of layers in addition to the number of parameters. Right: When we exclude embedding parameters, the performance of models with different depths converge to a single trend. Only models with fewer than 2 layers or with extreme depth-to-width ratios deviate significantly from the trend. In this section we will display data along with empirically-motivated fits, deferring theoretical analysis to later sections. ## 3.1 Approximate Transformer Shape and Hyperparameter Independence Transformer performance depends very weakly on the shape parameters n layer , n heads , and d ff when we hold the total non-embedding parameter count N fixed. To establish these results we trained models with fixed size while varying a single hyperparameter. This was simplest for the case of n heads . When varying n layer , we simultaneously varied d model while keeping N ≈ 12 n layer d 2 model fixed. Similarly, to vary d ff at fixed model size we also simultaneously varied the d model parameter, as required by the parameter counts in Table 1. Independence of n layers would follow if deeper Transformers effectively behave as ensembles of shallower models, as has been suggested for ResNets [VWB16]. The results are shown in Figure 5. ## 3.2 Performance with Non-Embedding Parameter Count N In Figure 6 we display the performance of a wide variety of models, ranging from small models with shape ( n layer , d model ) = (2 , 128) through billion-parameter models, ranging in shape from (6 , 4288) through (207 , 768) . Here we have trained to near convergence on the full WebText2 dataset and observe no overfitting (except possibly for the very largest models). As shown in Figure 1, we find a steady trend with non-embedding parameter count N , which can be fit to the first term of Equation (1.5), so that  Figure 7 <details> <summary>Image 7 Details</summary> ![5e55d896](/v1/image/5e55d896032de2498f12bd5cf49ca6699571849bcc02e23d0b5acf78343bf6e5) ### Visual Description ## Line Graphs: Transformer vs LSTM Performance on Long Contexts ### Overview The image contains two line graphs comparing the performance of Transformers and LSTMs on long-context tasks. The left graph shows test loss trends as model parameters increase, while the right graph illustrates per-token test loss across context lengths for different model sizes. ### Components/Axes **Left Graph:** - **X-axis**: Parameters (non-embedding) (log scale: 10⁵ to 10⁹) - **Y-axis**: Test Loss (linear scale: 2.4 to 5.4) - **Legend**: - Blue: Transformers - Red: LSTMs (1 Layer, 2 Layers, 4 Layers) - **Annotations**: - "Transformers asymptotically outperform LSTMs due to improved use of long contexts" - Arrows pointing to LSTM lines with layer counts **Right Graph:** - **X-axis**: Token Index in Context (log scale: 10¹ to 10³) - **Y-axis**: Per-token Test Loss (linear scale: 2 to 6) - **Legend**: - Red: 400K parameters - Blue: 2M parameters - Light Blue: 3M parameters - Pink: 200M parameters - Dark Blue: 300M parameters - **Title**: "LSTM plateaus after <100 tokens. Transformer improves through the whole context" ### Detailed Analysis **Left Graph Trends:** 1. **Transformers (Blue Line)**: - Starts at ~4.8 test loss at 10⁵ parameters - Declines steeply to ~2.4 test loss at 10⁹ parameters - Slope: -0.0004 test loss per parameter (approximate) 2. **LSTMs (Red Lines)**: - 1 Layer: Starts at ~4.8, declines to ~3.6 at 10⁹ parameters - 2 Layers: Starts at ~4.6, declines to ~3.4 at 10⁹ parameters - 4 Layers: Starts at ~4.4, declines to ~3.2 at 10⁹ parameters - All LSTM lines show diminishing returns after 10⁷ parameters **Right Graph Trends:** 1. **Parameter Size Impact**: - 400K (Red): Starts at 6.0, plateaus at ~4.0 after 10² tokens - 2M (Blue): Starts at 5.8, plateaus at ~3.5 after 10² tokens - 3M (Light Blue): Starts at 5.6, plateaus at ~3.2 after 10² tokens - 200M (Pink): Starts at 5.4, plateaus at ~2.8 after 10² tokens - 300M (Dark Blue): Starts at 5.2, plateaus at ~2.6 after 10² tokens 2. **Context Length Impact**: - All models show rapid improvement until ~10¹ tokens - LSTMs plateau sharply after 10² tokens - Transformers maintain gradual improvement through 10³ tokens ### Key Observations 1. **Parameter Efficiency**: - Transformers achieve 30-40% lower test loss than LSTMs at equivalent parameter counts - LSTM performance plateaus at ~3.0-3.2 test loss regardless of layer count 2. **Context Handling**: - Transformers maintain 20-30% lower per-token loss than LSTMs across all context lengths - LSTM performance degrades by 15-20% after 100 tokens 3. **Scaling Behavior**: - Larger models (200M-300M) achieve 40-50% better per-token performance than smaller models (400K-2M) - Transformer advantage grows with parameter count (from 10% at 10⁵ params to 40% at 10⁹ params) ### Interpretation The data demonstrates two critical architectural advantages of Transformers: 1. **Long Context Mastery**: Their self-attention mechanism enables consistent performance improvement across context lengths, unlike LSTMs which plateau after 100 tokens. This suggests Transformers better capture long-range dependencies. 2. **Scaling Efficiency**: Transformers achieve superior performance gains per additional parameter compared to LSTMs. The steeper decline in test loss with increasing parameters indicates more effective utilization of model capacity. The plateau effect in LSTMs after 100 tokens aligns with known limitations in recurrent architectures for long sequences. Transformers' sustained improvement suggests their architecture fundamentally better handles sequential data at scale, making them preferable for applications requiring long-context understanding (e.g., document analysis, long-form generation). </details> To observe these trends it is crucial to study performance as a function of N ; if we instead use the total parameter count (including the embedding parameters) the trend is somewhat obscured (see Figure 6). This suggests that the embedding matrix can be made smaller without impacting performance, as has been seen in recent work [LCG + 19]. Although these models have been trained on the WebText2 dataset, their test loss on a variety of other datasets is also a power-law in N with nearly identical power, as shown in Figure 8. ## 3.2.1 Comparing to LSTMs and Universal Transformers In Figure 7 we compare LSTM and Transformer performance as a function of non-embedding parameter count N . The LSTMs were trained with the same dataset and context length. We see from these figures that the LSTMs perform as well as Transformers for tokens appearing early in the context, but cannot match the Transformer performance for later tokens. We present power-law relationships between performance and context position Appendix D.5, where increasingly large powers for larger models suggest improved ability to quickly recognize patterns. We also compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure 17 in the appendix. These models re-use parameters, and so perform slightly better as a function of N , at the cost of additional compute per-parameter. ## 3.2.2 Generalization Among Data Distributions We have also tested our models on a set of additional text data distributions. The test loss on these datasets as a function of model size is shown in Figure 8; in all cases the models were trained only on the WebText2 dataset. We see that the loss on these other data distributions improves smoothly with model size, in direct parallel with the improvement on WebText2. We find that generalization depends almost exclusively on the in-distribution validation loss, and does not depend on the duration of training or proximity to convergence. We also observe no dependence on model depth (see Appendix D.8). ## 3.3 Performance with Dataset Size and Compute We display empirical trends for the test loss as a function of dataset size D (in tokens) and training compute C in Figure 1. For the trend with D we trained a model with ( n layer , n embd ) = (36 , 1280) on fixed subsets of the WebText2 dataset. We stopped training once the test loss ceased to decrease. We see that the resulting test losses can be fit with simple power-law  in the dataset size. The data and fit appear in Figure 1. The total amount of non-embedding compute used during training can be estimated as C = 6 NBS , where B is the batch size, S is the number of parameter updates, and the factor of 6 accounts for the forward and backward passes. Thus for a given value of C we can scan over all models with various N to find the model using three principles: 1. Changes in vocabulary size or tokenization are expected to rescale the loss by an overall factor. The parameterization of L ( N,D ) (and all models of the loss) must naturally allow for such a rescaling. 2. Fixing D and sending N → ∞ , the overall loss should approach L ( D ) . Conversely, fixing N and sending D →∞ the loss must approach L ( N ) . 3. L ( N,D ) should be analytic at D = ∞ , so that it has a series expansion in 1 /D with integer powers. Theoretical support for this principle is significantly weaker than for the first two. Our choice of L ( N,D ) satisfies the first requirement because we can rescale N c , D c with changes in the vocabulary. This also implies that the values of N c , D c have no fundamental meaning. <details> <summary>Image 8 Details</summary> ![ef442600](/v1/image/ef442600329d62f356f9ea371958809967e024d29e7776084cb41c671493ef3c) ### Visual Description ## Line Chart and Scatter Plot: Model Performance vs. Parameters and Generalization ### Overview The image contains two charts analyzing model performance. The left chart shows test loss trends across datasets as parameters increase. The right chart compares training loss on specific distributions with generalization performance. ### Components/Axes **Left Chart (Line Chart):** - **X-axis**: "Parameters (non-embedding)" (log scale: 10⁴ to 10⁹) - **Y-axis**: "Test Loss" (linear scale: 2.5 to 7) - **Legend**: - WebText2 (Test) – Blue - Internet Books – Orange - Books – Green - Wikipedia – Red - Common Crawl – Purple **Right Chart (Scatter Plot):** - **X-axis**: "Test Loss on Training Distribution" (linear scale: 2.5 to 5.0) - **Y-axis**: "Loss on Other Distribution" (linear scale: 2.5 to 5.0) - **Legend**: - Books during training – Dashed Blue - Wikipedia during training – Dashed Orange - Books at convergence – Solid Blue - Wikipedia at convergence – Solid Orange ### Detailed Analysis **Left Chart Trends:** - All lines descend as parameters increase, confirming that larger models generally improve test performance. - **WebText2 (Test)** starts highest (~6.5 at 10⁴ parameters) and ends lowest (~2.5 at 10⁹). - **Common Crawl** starts lowest (~6.0 at 10⁴) and ends slightly higher (~2.0 at 10⁹). - Lines diverge slightly at mid-range parameters (10⁶–10⁷), with WebText2 and Wikipedia maintaining the largest gap. **Right Chart Trends:** - Dashed lines (training loss) show a strong negative correlation: lower training loss correlates with lower loss on other distributions. - Solid points (convergence) for Books and Wikipedia are consistently below their respective dashed lines, indicating better generalization when training loss is minimized. - Books at convergence (solid blue) and Wikipedia at convergence (solid orange) cluster tightly near (2.5, 2.5), suggesting optimal generalization. ### Key Observations 1. **Parameter Efficiency**: WebText2 and Wikipedia datasets show the steepest improvement with parameter growth, while Common Crawl plateaus earlier. 2. **Generalization Gap**: Models trained on Books and Wikipedia achieve significantly lower loss on other distributions at convergence compared to their training performance. 3. **Dataset-Specific Behavior**: WebText2 (Test) and Common Crawl exhibit less pronounced parameter-driven improvements, possibly due to dataset complexity or noise. ### Interpretation The left chart demonstrates that increasing model size reduces test loss across datasets, with WebText2 and Wikipedia benefiting most. The right chart reveals that minimizing training loss on Books and Wikipedia leads to superior generalization, as evidenced by the convergence points. This suggests that dataset choice and training efficiency (achieving low training loss) are critical for building models that generalize well. The divergence in parameter efficiency highlights dataset-specific challenges, such as WebText2's complexity requiring more parameters for comparable gains. </details> ∑∐√∐⌉˜√˜√√({{}{∖˜⌉̂˜̂̂]{˜} }}⌋√(̂√√]{˜(√√∐]{]{˜ ∨]⌋]√˜̂]∐(̂√√]{˜(√√∐]{]{˜ ∨]⌋]√˜̂]∐(∐√(̂}{√˜√˜˜{̂˜ ⋂˜√√(⊕}√√(}{(⋂√∐]{]{˜(〈]√√√]̂√√]}{ Figure 8 Left: Generalization performance to other data distributions improves smoothly with model size, with only a small and very slowly growing offset from the WebText2 training distribution. Right: Generalization performance depends only on training distribution performance, and not on the phase of training. We compare generalization of converged models (points) to that of a single large model (dashed curves) as it trains. with the best performance on step S = C 6 BS . Note that in these results the batch size B remains fixed for all models , which means that these empirical results are not truly optimal. We will account for this in later sections using an adjusted C min to produce cleaner trends. The result appears as the heavy black line on the left-hand plot in Figure 1. It can be fit with  The figure also includes images of individual learning curves to clarify when individual models are optimal. Wewill study the optimal allocation of compute more closely later on. The data strongly suggests that sample efficiency improves with model size, and we also illustrate this directly in Figure 19 in the appendix. ## 4 Charting the Infinite Data Limit and Overfitting In Section 3 we found a number of basic scaling laws for language modeling performance. Here we will study the performance of a model of size N trained on a dataset with D tokens while varying N and D simultaneously. We will empirically demonstrate that the optimally trained test loss accords with the scaling law of Equation (1.5). This provides guidance on how much data we would need to train models of increasing size while keeping overfitting under control. ## 4.1 Proposed L ( N,D ) Equation We have chosen the parameterization (1.5) (repeated here for convenience):  ⊕}√√(}{(⊗√[˜√(〈]√√√]̂√√]}{ Figure 9 The early-stopped test loss L ( N,D ) depends predictably on the dataset size D and model size N according to Equation (1.5). Left : For large D , performance is a straight power law in N . For a smaller fixed D , performance stops improving as N increases and the model begins to overfit. (The reverse is also true, see Figure 4.) Right : The extent of overfitting depends predominantly on the ratio N α N α D /D , as predicted in equation (4.3). The line is our fit to that equation. <details> <summary>Image 9 Details</summary> ![c78a7215](/v1/image/c78a7215390887c3eef2a6e25d9fedda89863b32ff71a7c56a7879b582a10950) ### Visual Description ## Line Charts: Model Performance vs Data Size and Overfitting ### Overview The image contains two line charts comparing model performance metrics across different data sizes. The left chart ("Data Size Bottleneck") shows test loss trends as model parameters increase, while the right chart ("Overfitting") illustrates overfitting severity relative to a scaling parameter. Both charts use color-coded data series for data sizes ranging from 21M to 22.0B. ### Components/Axes **Left Chart ("Data Size Bottleneck"):** - **X-axis**: "Params (non-embed)" (log scale, 10⁶ to 10⁹) - **Y-axis**: "Test Loss" (linear scale, 2.5 to 4.5) - **Legend**: - Colors: Purple (21M), Dark Blue (43M), Blue (86M), Teal (172M), Green (344M), Lime (688M), Yellow (1.4B), Orange (22.0B) - Position: Right side of chart - **Markers**: Circular dots with solid/dotted lines **Right Chart ("Overfitting"):** - **X-axis**: "N^a_n/a_o/D" (log scale, 10⁻⁴ to 10⁻¹) - **Y-axis**: "L/L(D=8)-1" (linear scale, 0 to 0.5) - **Legend**: Same color scheme as left chart - **Markers**: Circular dots with solid lines ### Detailed Analysis **Left Chart Trends:** 1. All data series show decreasing test loss with increasing parameters 2. Larger data sizes (yellow/1.4B, orange/22.0B) achieve lower test loss 3. Test loss plateaus at ~3.0 for 22.0B data size at 10⁹ parameters 4. Smaller data sizes (purple/21M) show higher test loss (~4.5 at 10⁶ params) **Right Chart Trends:** 1. Overfitting (y-axis) increases with larger scaling parameter (x-axis) 2. Larger data sizes show steeper overfitting curves 3. 22.0B data size reaches ~0.5 overfitting at x=10⁻¹ 4. Smaller data sizes (purple/21M) show minimal overfitting (<0.1) ### Key Observations 1. **Data Size Impact**: Larger datasets consistently show better generalization (lower test loss) and more pronounced overfitting 2. **Parameter Efficiency**: Models with 10⁸-10⁹ parameters achieve optimal performance for 22.0B data size 3. **Overfitting Threshold**: Data sizes above 1.4B show significant overfitting (>0.3) at moderate parameter scales 4. **Scaling Relationship**: Overfitting severity correlates with (N^a_n/a_o/D)⁻¹ relationship ### Interpretation The charts demonstrate a critical trade-off in model training: 1. **Data Efficiency**: Smaller datasets (21M-86M) require fewer parameters to reach convergence but suffer from higher test loss 2. **Overfitting Paradox**: Larger datasets (1.4B-22.0B) enable better performance but require careful parameter scaling to avoid overfitting 3. **Optimal Zone**: For 22.0B data, parameters between 10⁸-10⁹ achieve best test loss with manageable overfitting (<0.3) 4. **Scaling Law**: The (N^a_n/a_o/D) parameter suggests a power-law relationship between data size and optimal model capacity The data implies that model architecture should scale sub-linearly with data size to maintain generalization, with particular attention to the N^a_n/a_o/D parameter as a critical scaling factor for preventing overfitting in large-scale training. </details> Since we stop training early when the test loss ceases to improve and optimize all models in the same way, we expect that larger models should always perform better than smaller models. But with fixed finite D , we also do not expect any model to be capable of approaching the best possible loss (ie the entropy of text). Similarly, a model with fixed size will be capacity-limited. These considerations motivate our second principle. Note that knowledge of L ( N ) at infinite D and L ( D ) at infinite N fully determines all the parameters in L ( N,D ) . The third principle is more speculative. There is a simple and general reason one might expect overfitting to scale ∝ 1 /D at very large D . Overfitting should be related to the variance or the signal-to-noise ratio of the dataset [AS17], and this scales as 1 /D . This expectation should hold for any smooth loss function, since we expect to be able to expand the loss about the D →∞ limit. However, this argument assumes that 1 /D corrections dominate over other sources of variance, such as the finite batch size and other limits on the efficacy of optimization. Without empirical confirmation, we would not be very confident of its applicability. Our third principle explains the asymmetry between the roles of N and D in Equation (1.5). Very similar symmetric expressions 4 are possible, but they would not have a 1 /D expansion with integer powers, and would require the introduction of an additional parameter. In any case, we will see that our equation for L ( N,D ) fits the data well, which is the most important justification for our L ( N,D ) ansatz. ## 4.2 Results We regularize all our models with 10% dropout, and by tracking test loss and stopping once it is no longer decreasing. The results are displayed in Figure 9, including a fit to the four parameters α N , α D , N c , D c in Equation (1.5): Table 2 Fits to L ( N,D ) | Parameter | α N | α D | N c | D c | |-------------|---------|---------|---------------|---------------| | Value | 0 . 076 | 0 . 103 | 6 . 4 × 10 13 | 1 . 8 × 10 13 | We obtain an excellent fit, with the exception of the runs where the dataset has been reduced by a factor of 1024 , to about 2 × 10 7 tokens. With such a small dataset, an epoch consists of only 40 parameter updates. Perhaps such a tiny dataset represents a different regime for language modeling, as overfitting happens very early in training (see Figure 16). Also note that the parameters differ very slightly from those obtained in Section 3, as here we are fitting the full L ( N,D ) rather than just L ( N, ∞ ) or L ( ∞ , D ) . To chart the borderlands of the infinite data limit, we can directly study the extent of overfitting. For all but the largest models, we see no sign of overfitting when training with the full 22B token WebText2 dataset, so we can take it as representative of D = ∞ . Thus we can compare finite D to the infinite data limit by 4 For example, one might have used L ( N,D ) = [( N c N ) α N + ( D c D ) α D ] β , but this does not have a 1 /D expansion. Figure 10 The critical batch size B crit follows a power law in the loss as performance increase, and does not depend directly on the model size. We find that the critical batch size approximately doubles for every 13% decrease in loss. B crit is measured empirically from the data shown in Figure 18, but it is also roughly predicted by the gradient noise scale, as in [MKAT18]. <details> <summary>Image 10 Details</summary> ![1a5e5a82](/v1/image/1a5e5a8272af595b11d21cfe788468b4e6d578c813c281014817a817db9433b0) ### Visual Description ## Line Chart: Critical Batch Size vs. Performance ### Overview The chart illustrates the relationship between WebText2 training loss (x-axis) and critical batch size in tokens (y-axis). It compares empirical critical batch sizes for two model sizes (N=3M and N=85M) against a theoretical power-law model, alongside noise scale measurements. ### Components/Axes - **X-axis (WebText2 Train Loss)**: Logarithmic scale from 10¹ to 3×10⁰ (10 to 3). - **Y-axis (Critical Batch Size)**: Logarithmic scale from 10³ to 10⁶ tokens. - **Legend**: - Blue line: Empirical B_crit for N=3M. - Orange line: Empirical B_crit for N=85M. - Dashed line: Theoretical B_crit = 2.1×10⁸ tokens·L⁻⁴.⁸. - Green dots: Noise Scale Measurement. ### Detailed Analysis 1. **Empirical B_crit Lines**: - **N=3M (Blue)**: Starts near 10³ tokens at 10¹ loss, rises sharply to ~10⁵ tokens at 3×10⁰ loss. A notable peak (~10⁵ tokens) occurs at ~4×10⁰ loss. - **N=85M (Orange)**: Begins higher (~10⁴ tokens at 10¹ loss) and follows a steeper upward trend, reaching ~10⁶ tokens at 3×10⁰ loss. 2. **Theoretical Model (Dashed Line)**: - Follows a power-law decay (B_crit ∝ L⁻⁴.⁸). Empirical lines closely align with this trend, validating the theoretical relationship. 3. **Noise Scale Measurements (Green Dots)**: - Scattered across the plot, predominantly below the empirical lines. Concentrations near 10³–10⁴ tokens at lower loss values (~10¹–6×10⁰). ### Key Observations - **Trend Verification**: - Both empirical lines slope upward as loss decreases, confirming that lower training loss correlates with larger critical batch sizes. - N=85M consistently requires larger batch sizes than N=3M, with a ~10× difference at 3×10⁰ loss. - **Outliers/Anomalies**: - The blue line’s peak at ~4×10⁰ loss (~10⁵ tokens) deviates from the general trend, suggesting potential instability or measurement noise. - **Noise Distribution**: - Green dots cluster at lower batch sizes, indicating variability in smaller-scale experiments. ### Interpretation - **Model Scaling**: The data demonstrates that larger models (N=85M) demand significantly larger batch sizes to maintain performance, aligning with the power-law relationship. This implies diminishing returns in batch size efficiency as model complexity increases. - **Theoretical Validation**: The empirical lines’ adherence to the dashed theoretical curve (B_crit = 2.1×10⁸ tokens·L⁻⁴.⁸) confirms the validity of the power-law model for predicting critical batch sizes. - **Practical Implications**: The noise measurements highlight the need for robust experimental design, as smaller batch sizes (green dots) may represent edge cases or suboptimal configurations. - **Anomaly Investigation**: The blue line’s peak warrants further scrutiny—it could reflect a transient instability or an outlier in the dataset. This analysis underscores the importance of batch size scaling in training large language models and validates the theoretical framework for optimizing training efficiency. </details> defining $$\delta L ( N , D ) \equiv \frac { L ( N , D ) } { L ( N , \infty ) } - 1 \quad ( 4 . 2 )$$ and studying it as a function of N,D . In fact, we see empirically that δL depends only a specific combination of N and D , as shown in Figure 16. This follows from the scaling law of Equation (1.5), which implies  Note that at large D this formula also has a series expansion in powers of 1 /D . We estimate that the variation in the loss with different random seeds is roughly 0 . 02 , which means that to avoid overfitting when training to within that threshold of convergence we require  With this relation, models smaller than 10 9 parameters can be trained with minimal overfitting on the 22B token WebText2 dataset, but our largest models will encounter some mild overfitting. More generally, this relation shows that dataset size may grow sub-linearly in model size while avoiding overfitting. Note however that this does not typically represent maximally compute-efficient training. We should also emphasize that we have not optimized regularization (eg the dropout probability) while varying dataset and model size. ## 5 Scaling Laws with Model Size and Training Time In this section we will demonstrate that a simple scaling law provides a good description for the loss as a function of model size N and training time. First we will explain how to use the results of [MKAT18] to define a universal training step S min , which accounts for the fact that most of our models have not been trained at an optimal batch size. Then we will demonstrate that we can fit the model size and training time dependence of the loss using Equation (1.6). Later we will use these results to predict the optimal allocation of training compute between model size and training time, and then confirm that prediction. ## 5.1 Adjustment for Training at B crit ( L ) A simple empirical theory for the batch size dependence of training was developed in [MKAT18] (see also [SLA + 18, ZLN + 19]). It was argued that there is a critical batch size B crit for training; for B up to B crit the batch size can be increased with very minimal degradation in compute-efficiency, whereas for B > B crit increases in B result in diminishing returns. It was also argued that the gradient noise scale provides a simple prediction for B crit , and that neither depends directly on model size except through the value of the loss that has been attained. These results can be used to predict how training time and compute will vary with the batch size. To utilize both training time and compute as effectively as possible, it is best to train with a batch size B ≈ B crit . Training at B B crit minimizes the number of training steps, while B B crit minimizes the use of compute. More specifically, it was demonstrated that for a wide variety of neural network tasks, the number of training steps S and the number of data examples processed E = BS satisfy the simple relation $$\left ( { \frac { S } { S _ { \min } } } - 1 \right ) \left ( { \frac { E } { E _ { \min } } } - 1 \right ) = 1 \quad ( 5 . 1 )$$ when training to any fixed value of the loss L . Here S min is the minimum number of steps necessary to reach L , while E min is the minimum number of data examples that must be processed. We demonstrate the relation (5.1) for Transformers in Figure 18 in the appendix. This relation defines the critical batch size  which is a function of the target value of the loss. Training at the critical batch size makes a roughly optimal time/compute tradeoff, requiring 2 S min training steps and processing E = 2 E min data examples. In Figure 10 we have plotted the critical batch size and gradient noise scale 5 as a function of training loss for two different models. We see that B crit ( L ) is independent of model size, and only depends on the loss L . So the predictions of [MKAT18] continue to hold for Transformer language models. The critical batch size can be fit with a power-law in the loss  where B ∗ ≈ 2 × 10 8 and α B ≈ 0 . 21 . We have chosen this parameterization for B crit ( L ) because as the loss approaches its minimum value L min , the gradient noise scale is expected to diverge, and we expect B crit to track this noise scale. We do not know L min , as we see no sign that our models are approaching it, but L min > 0 since the entropy of natural language is non-zero. Since apparently L min is much smaller than the values of L we have achieved, we used a parameterization where B crit diverges as L → 0 . We will use B crit ( L ) to estimate the relation between the number of training steps S while training at batch size B = 2 19 tokens and the number of training steps while training at B B crit . This is simply  for any given target value L for the loss. This also defines a critical value of the compute needed to train to L with a model of size N if we were to train at B B crit ( L ) . This is  where C = 6 NBS estimates the (non-embedding) compute used at batch size B . ## 5.2 Results for L ( N,S min ) and Performance with Model Size and Compute Now we will use S min defined in Equation (5.4) to obtain a simple and universal fit for the dependence of the loss on model size and training time in the infinite data limit. We will fit the stable, Adam-optimized training runs using Equation (1.6), repeated here for convenience:  for the loss. We include all training steps after the warmup period of the learning rate schedule, and find a fit to the data with the parameters: 5 Although the critical batch size roughly matches the gradient noise scale, we are using a direct measurements of B crit from Figures 18 and 10 for all our later analyses. Figure 11 When we hold either total compute or number of training steps fixed, performance follows L ( N,S ) from Equation (5.6). Each value of compute budget has an associated optimal model size that maximizes performance. Mediocre fits at small S are unsurprising, as the power-law equation for the learning curves breaks down very early in training. <details> <summary>Image 11 Details</summary> ![7b75f693](/v1/image/7b75f693fccd153822bd8f39bfb73c6f60faf395ec2d3fad42a5feff2ed3b9a8) ### Visual Description ## Line Charts: Performance vs Compute Budget and Performance vs Steps ### Overview The image contains two side-by-side line charts comparing model performance (test loss) against computational resources. The left chart ("Performance vs Compute Budget") plots test loss against model parameters (non-embedding) with a color gradient representing PF-days (pre-finetuning days). The right chart ("Performance vs Steps") shows test loss against parameters with a color gradient for training steps. Both charts reveal trade-offs between model scale, performance, and resource consumption. ### Components/Axes **Left Chart (Performance vs Compute Budget):** - **X-axis**: Parameters (non-embedding) [10⁴ to 10⁸] - **Y-axis**: Test Loss [2 to 8] - **Legend**: PF-days [10⁻⁵ to 10⁰], color gradient from purple (low) to yellow (high) - **Lines**: Multiple dotted/solid lines in purple, teal, green, and yellow **Right Chart (Performance vs Steps):** - **X-axis**: Parameters (non-embedding) [10⁶ to 10⁹] - **Y-axis**: Test Loss [2.4 to 5.4] - **Legend**: Steps [10⁴ to 10⁵], color gradient from purple (low) to yellow (high) - **Lines**: Multiple dotted/solid lines in purple, teal, green, and yellow ### Detailed Analysis **Left Chart Trends:** - All lines show a **U-shaped curve**: Test loss decreases sharply as parameters increase from 10⁴ to ~10⁶, then plateaus or slightly increases. - Higher PF-days (yellow lines) achieve lower test loss but require exponentially more compute (e.g., 10⁸ parameters at 10⁻³ PF-days vs. 10⁶ parameters at 10⁻⁵ PF-days). - Dotted lines (likely baseline models) underperform compared to solid lines (optimized models). **Right Chart Trends:** - Test loss decreases monotonically with increasing parameters, but the rate of improvement slows after ~10⁷ parameters. - Higher steps (yellow lines) achieve better performance but require orders of magnitude more training (e.g., 10⁹ parameters at 10⁵ steps vs. 10⁶ parameters at 10³ steps). - Dotted lines again represent less efficient training trajectories. ### Key Observations 1. **Compute-Budget Trade-off**: Larger models (10⁸+ parameters) achieve ~30% lower test loss than smaller models but require 100–1000× more PF-days. 2. **Step Efficiency**: Models with 10⁷–10⁸ parameters achieve optimal performance with ~10⁴–10⁵ steps, suggesting diminishing returns beyond this scale. 3. **Dotted vs. Solid Lines**: Solid lines (optimized training) outperform dotted lines (baseline training) by 1–2 test loss units across all parameter ranges. ### Interpretation The data demonstrates a **performance-resource Pareto frontier**: - **Compute Budget**: Larger models improve performance but become prohibitively expensive (PF-days scale super-linearly with parameters). - **Training Steps**: While larger models benefit from more steps, the marginal gain per step diminishes after ~10⁷ parameters. - **Efficiency Gaps**: The ~2-test-loss-unit gap between dotted and solid lines highlights the importance of optimized training strategies (e.g., curriculum learning, regularization). These charts suggest that for practical deployment, practitioners must balance model size against available compute resources. The optimal operating point appears to be ~10⁷ parameters with 10⁴–10⁵ steps, achieving ~3.0 test loss—a 20% improvement over baseline models at 10⁶ parameters. </details> Table 3 Fits to L ( N,S ) | Parameter | α N | α S | N c | S c | |-------------|---------|--------|---------------|--------------| | Value | 0 . 077 | 0 . 76 | 6 . 5 × 10 13 | 2 . 1 × 10 3 | With these parameters, we obtain the learning curve fits in Figure 4. Though the fits are imperfect, we believe they are quite compelling given the simplicity of Equation (5.6). The data and fits can be visualized in a different and more interesting way, as shown in Figure 11. There we study the test loss as a function of model size while fixing either the total non-embedding compute C used in training, or the number of steps S . For the fits we use Equation (5.5) and (5.4) along with the parameters above and Equation (5.6). The power-law dependence of the loss on S min reflects the interplay of optimizer dynamics and the loss landscape. Since the fits are best late in training, when the loss may be approximately quadratic, the powerlaw should provide information about the spectrum of the Hessian of the loss. Its universality suggests that the Hessian eigenvalue density is roughly independent of model size. ## 5.3 Lower Bound on Early Stopping Step The results for L ( N,S min ) can be used to derive a lower-bound (and rough estimate) of the step at which early stopping should occur when training is data limited. It is motivated by the idea that finite and infinite D learning curves for a given model will be very similar until we reach S min ≈ S stop . Thus overfitting should be proportional to the correction from simply ending training at S stop . This will underestimate S stop , because in reality the test loss will decrease more slowly when we have a finite D , and therefore we will require more training steps to reach the optimal test loss at finite D . This line of reasoning leads to the inequality $$S _ { s t o p } ( N , D ) \gtrsim \frac { S _ { c } } { [ L ( N , D ) - L ( N , \infty ) ] ^ { 1 / \alpha _ { S } } }$$ where L ( N, ∞ ) is the converged loss, evaluated with infinite available data. This inequality and its comparison to the empirical data is displayed in Figure 16 in the appendix. In that figure, the values of S stop and L ( N,D ) are empirical (though S stop is adjusted to mimic training at B B crit ), while L ( N, ∞ ) is computed from the fit to L ( N,D ) evaluated at D = ∞ . ## 6 Optimal Allocation of the Compute Budget We displayed the empirical trend of performance as a function of the computation used during training in the top-right of Figure 1. However, this result involved training at a fixed batch size B , whereas we know Figure 12 Left: Given a fixed compute budget, a particular model size is optimal, though somewhat larger or smaller models can be trained with minimal additional compute. Right: Models larger than the computeefficient size require fewer steps to train, allowing for potentially faster training if sufficient additional parallelism is possible. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve, after initial transient effects. <details> <summary>Image 12 Details</summary> ![b3b4346c](/v1/image/b3b4346c662d0aa8918b6be76b05527a271b5df42ea5932fb291fefb98c1fd79) ### Visual Description ## Chart/Diagram Type: Dual-Plot Analysis of Model Efficiency ### Overview The image contains two side-by-side graphs analyzing model efficiency. The left plot examines **excess compute** relative to deviation from an optimal model size, while the right plot analyzes **excess training steps** under the same deviation metric. Both graphs use logarithmic scales on the x-axis and linear/logarithmic scales on the y-axis, with annotations highlighting key insights. --- ### Components/Axes #### Left Plot (Excess Compute) - **Y-axis**: "Excess Compute (C/C_efficient)" (linear scale, 1.0–4.0) - **X-axis**: "Deviation from Optimal Model (N/N_efficient)" (logarithmic scale, 10⁰–10¹) - **Legend**: No explicit legend; single blue line represents the efficiency curve. - **Annotations**: - Text: "Models between 0.6x and 2.2x the optimal size can be trained with a 20% larger compute budget." - Two arrows point to the minimum of the curve (at ~1.0 on the x-axis). #### Right Plot (Excess Steps) - **Y-axis**: "Excess Steps (S/S_efficient)" (logarithmic scale, 10⁰–10¹) - **X-axis**: "Deviation from Optimal Model (N/N_efficient)" (logarithmic scale, 10⁰–10¹) - **Legend**: No explicit legend; single blue line represents the efficiency curve. - **Annotations**: - Text: "Smaller models require more steps to train, while larger models require fewer." - Arrow points to a specific deviation value (~1.5 on the x-axis). - Dashed line with text: "Our framework does not capture early training dynamics." --- ### Detailed Analysis #### Left Plot Trends - The curve starts at ~4.0 excess compute when deviation is 10⁰ (optimal model size). - Excess compute sharply decreases as deviation increases, reaching a minimum of ~1.0 at ~1.0 deviation. - Beyond ~1.0 deviation, excess compute rises again, plateauing near ~2.5 at 10¹ deviation. - The annotation indicates that models within 0.6x–2.2x the optimal size require only a 20% larger compute budget, suggesting a "sweet spot" for efficiency. #### Right Plot Trends - The curve starts at ~10¹ excess steps at 10⁰ deviation (optimal model size). - Excess steps decrease logarithmically as deviation increases, dropping to ~10⁰ at ~1.5 deviation. - Beyond ~1.5 deviation, the curve flattens, indicating diminishing returns in step efficiency. - The annotation highlights that smaller models (deviation <1.0) require exponentially more steps, while larger models (deviation >1.0) require fewer steps. --- ### Key Observations 1. **Optimal Model Size**: Both plots converge at ~1.0 deviation, where compute and step efficiency are minimized. 2. **Compute vs. Steps Tradeoff**: - Smaller models (deviation <1.0) require disproportionately more compute and steps. - Larger models (deviation >1.0) show improved efficiency but face diminishing returns. 3. **Framework Limitation**: The dashed line in the right plot explicitly states that early training dynamics (e.g., initial learning phases) are not modeled. --- ### Interpretation The data suggests that **model size optimization is critical for balancing compute and training efficiency**. The "sweet spot" (0.6x–2.2x optimal size) minimizes excess compute, but training steps remain highly sensitive to deviations below the optimal size. The framework’s inability to model early training dynamics implies potential gaps in understanding how model size impacts initial learning phases, which could affect real-world deployment strategies. The logarithmic x-axis emphasizes that even small deviations from the optimal size have nonlinear impacts, particularly for smaller models. This aligns with the observation that smaller models require exponentially more resources, reinforcing the importance of targeting near-optimal sizes in practice. </details> Figure 13 When adjusting performance to simulate training far below the critical batch size, we find a somewhat altered power law for L ( C min ) when compared with the fully empirical results. The conspicuous lump at 10 -5 PF-days marks the transition from 1-layer to 2-layer networks; we exclude 1-layer networks in the power-law fits. It is the L ( C min ) trend that we expect to provide a reliable extrapolation for larger compute. <details> <summary>Image 13 Details</summary> ![678904c1](/v1/image/678904c1aca54eef340b89927403fa56f2f86fa9c3072e615030b56f2265c713) ### Visual Description ## Line Graph: Test Loss vs Compute (PF-days), non-embedding ### Overview The image depicts a logarithmic-scale line graph comparing two test loss functions (L) against compute resources (PF-days) for non-embedding tasks. Two distinct loss functions are visualized: one based on `C_min` and another on `C`, with differing exponents and constants. The graph shows how test loss decreases as compute increases, with both lines converging at higher compute values. ### Components/Axes - **Y-axis (Test Loss)**: Logarithmic scale ranging from 2 to 7. - **X-axis (Compute, PF-days, non-embedding)**: Logarithmic scale from 10⁻⁸ to 10⁰. - **Legend**: Located in the top-right corner, containing: - **Blue dashed line**: `L = (C_min/2.3·10⁸)⁻⁰·⁰⁵⁰` - **Orange dashed line**: `L = (C/2.0·10⁷)⁻⁰·⁰⁵⁷` ### Detailed Analysis 1. **Blue Dashed Line (`C_min`)**: - Starts at ~6.5 test loss at 10⁻⁸ PF-days. - Decreases steeply, reaching ~3.5 at 10⁻² PF-days. - Continues declining to ~2.5 at 10⁰ PF-days. - Equation suggests sensitivity to `C_min` with a weaker exponent (-0.050). 2. **Orange Dashed Line (`C`)**: - Begins at ~7 test loss at 10⁻⁸ PF-days. - Declines more gradually than the blue line, reaching ~3.0 at 10⁻² PF-days. - Converges with the blue line near 10⁻⁴ PF-days, then follows a similar trajectory. - Equation indicates higher sensitivity to `C` with a steeper exponent (-0.057). 3. **Convergence Point**: - Both lines intersect near 10⁻⁴ PF-days (~0.0001 PF-days). - Beyond this point, the lines overlap almost perfectly, suggesting diminishing differences in loss function performance at higher compute levels. ### Key Observations - **Initial Divergence**: The blue line (`C_min`) starts lower but decreases faster initially, while the orange line (`C`) begins higher but declines more slowly. - **Logarithmic Scaling**: The x-axis compression emphasizes performance differences at low compute levels (10⁻⁸ to 10⁻⁴ PF-days). - **Exponent Impact**: The steeper exponent (-0.057) for `C` amplifies its sensitivity to compute increases compared to `C_min` (-0.050). ### Interpretation The graph demonstrates that both loss functions improve with increased compute, but their efficiency profiles differ: - **`C_min`** (blue) is more effective at low compute levels, achieving lower loss faster. - **`C`** (orange) requires more compute to match `C_min`’s performance but becomes equally effective at higher compute levels (post-10⁻⁴ PF-days). - The convergence implies that optimizing for either loss function becomes equally viable beyond a critical compute threshold (~0.0001 PF-days). This suggests trade-offs in resource allocation: `C_min` may be preferable for constrained compute, while `C` could be better for scalable, high-resource scenarios. </details> that in fact we could train more efficiently 6 by training at the batch size B crit discussed in Section 5.1. Large and small values of the loss could have been achieved with fewer samples or fewer steps, respectively, and correcting for this inefficiency by standardizing to the critical batch size results in cleaner and more predictable trends. In this section we will adjust for this oversight. More importantly, we will use the results of Section 5 to determine the optimal allocation of compute between model size N and the quantity of data processed during training, namely 2 B crit S min . We will determine this allocation both empirically and theoretically, by using the equation for L ( N,S min ) , and we will demonstrate that these methods agree. ## 6.1 Optimal Performance and Allocations Let us first study the loss as a function of the optimally allocated compute from Equation (5.5). The result is plotted in Figure 13, along with a power-law fit. We see that as compared to the compute plot of Figure 1, the new fit with C min is somewhat improved. Given L ( C min ) , it is natural to ask for the optimal model size N ( C min ) that provides the minimal loss with a given quantity of training compute. The optimal model size is shown in Figure 14. We observe that N ( C min ) 6 One might ask why we did not simply train at B crit in the first place. The reason is that it depends not only on the model but also on the target value of the loss we wish to achieve, and so is a moving target. Figure 14 Left: Each value of the compute budget C min has an associated optimal model size N . Optimal model size grows very rapidly with C min , increasing by 5x for each 10x increase in compute. The number of data examples processed makes up the remainder of the increase, growing relatively modestly by only 2x. Right: The batch-adjusted number of optimization steps also grows very slowly, if at all, meaning that most of the growth in data examples processed can be used for increased batch sizes. <details> <summary>Image 14 Details</summary> ![14171e3c](/v1/image/14171e3c61738b3ff1562bd588ecc8a25bbff7d49fc43cdbd4250e5bf613382e) ### Visual Description ## Scatter Plots: Compute vs. Parameters and Steps ### Overview The image contains two side-by-side scatter plots comparing computational efficiency metrics. The left plot shows the relationship between compute (PF-days) and parameters (non-embedding), while the right plot compares compute (excluding embeddings) to steps (training iterations). Both plots use logarithmic scales on the x-axis and linear scales on the y-axis. ### Components/Axes **Left Plot:** - **X-axis**: Compute (PF-days), non-embedding (log scale: 1e-7 to 1e-1) - **Y-axis**: Parameters (non-embedding) (log scale: 1e3 to 1e7) - **Legend**: - Blue dashed line: N = 1.3e9, C_min^0.73 - Orange dotted line: N = 1.6e9, C_min^0.88 **Right Plot:** - **X-axis**: Compute (PF-days), excluding embeddings (log scale: 1e-7 to 1e-1) - **Y-axis**: Steps (linear scale: 0 to 15,000) - **Legend**: - Blue solid circles: S_min (adjusted) - Blue dashed line: S_min = 5.4e3 * C_min^0.03 - Orange squares: S (fixed-batch) ### Detailed Analysis **Left Plot Trends:** 1. Both data series show a positive correlation between compute and parameters. 2. The orange series (N=1.6e9) has a steeper slope (C_min^0.88) compared to the blue series (C_min^0.73). 3. At 1e-3 PF-days, parameters reach ~1e6 for N=1.3e9 and ~1e6.5 for N=1.6e9. 4. At 1e-1 PF-days, parameters approach ~1e7 for both series. **Right Plot Trends:** 1. S_min (adjusted) (blue circles) remains relatively stable (~2,000-4,000 steps) across compute ranges. 2. S_min = 5.4e3 * C_min^0.03 (blue dashed) shows minor fluctuations but stays below 5,000 steps. 3. S (fixed-batch) (orange squares) exhibits exponential growth: - 1e-7 PF-days: ~1,000 steps - 1e-5 PF-days: ~5,000 steps - 1e-3 PF-days: ~10,000 steps - 1e-1 PF-days: ~15,000 steps ### Key Observations 1. **Non-embedding parameters** scale polynomially with compute, with higher N values yielding steeper growth. 2. **Fixed-batch steps** increase dramatically with compute, while adjusted S_min remains stable. 3. The orange series in the left plot (N=1.6e9) consistently outperforms the blue series (N=1.3e9) in parameter efficiency. 4. The right plot reveals a critical threshold: fixed-batch steps surge beyond 1e-3 PF-days, suggesting diminishing returns for larger compute budgets. ### Interpretation The data demonstrates fundamental trade-offs in model training: 1. **Parameter Efficiency**: Larger N values (1.6e9 vs. 1.3e9) achieve higher parameter counts per compute unit, but with diminishing returns (exponent 0.88 vs. 0.73). 2. **Step Efficiency**: Fixed-batch training becomes prohibitively expensive at scale, while adjusted S_min maintains stability through dynamic batch sizing. 3. **Compute Thresholds**: The right plot's exponential step growth beyond 1e-3 PF-days suggests a practical limit for fixed-batch training, while adjusted methods remain viable. These findings highlight the importance of adaptive training strategies (like S_min adjustment) for large-scale model training, particularly when compute resources exceed 1e-3 PF-days. </details> can be fit very well with a power-law where $$N ( C _ { \min } ) \circ ( C _ { \min } ) ^ { 0 . 7 3 } .$$ In Figure 12, we show the effect of training models of sub-optimal sizes (see Appendix B.4). By definition C min ≡ 6 NB crit S , and so we can use N ( C min ) to extract further results. In particular, since prior fits show B ∝ L -4 . 8 and L ∝ C -0 . 05 min , we can conclude that B crit ∝ C 0 . 24 min . This leads us to conclude that the optimal number of steps will only grow very slowly with compute, as $$S _ { \min } \, \infty \left ( C _ { \min } \right ) ^ { 0 . 0 3 } , \quad ( 6 . 2 )$$ matching the empirical results in Figure 14. In fact the measured exponent is sufficiently small that our results may even be consistent with an exponent of zero. Thus we conclude that as we scale up language modeling with an optimal allocation of computation, we should predominantly increase the model size N , while simultaneously scaling up the batch size via B ∝ B crit with negligible increase in the number of serial steps. Since compute-efficient training uses relatively few optimization steps, additional work on speeding up early training dynamics may be warranted. ## 6.2 Predictions from L ( N,S min ) The results for L ( C min ) and the allocations can be predicted from the L ( N,S min ) equation obtained in Section 5. Given our equation for L ( N,S min ) , we can substitute S min = C min 6 NB and then find the minimum of the loss as a function of N , while fixing the training compute. We carry out this procedure in detail in Appendix B, where we also provide some additional predictions. For the loss as a function of training compute, we predict that $$L ( C _ { \min } ) = \left ( \frac { C _ { c } ^ { \min } } { C _ { \min } } \right ) ^ { \alpha _ { C } ^ { \min } }$$ $$\alpha _ { C } ^ { \min } \equiv \frac { 1 } { 1 / \alpha _ { S } + 1 / \alpha _ { B } + 1 / \alpha _ { N } } \approx 0 . 0 5 4 \quad ( 6 . 4 )$$ in excellent agreement with the exponent of Figure 13. We also predict that $$N ( C _ { \min } ) \, \infty \, ( C _ { \min } ) ^ { \alpha _ { C } ^ { \min } / \alpha _ { N } } \approx ( C _ { \min } ) ^ { 0 . 7 1 } \quad ( 6 . 5 )$$ which also matches the scaling of Figure 14 to within a few percent. Our scaling laws provide a predictive framework for the performance of language modeling. Figure 15 Far beyond the model sizes we study empirically, we find a contradiction between our equations for L ( C min ) and L ( D ) due to the slow growth of data needed for compute-efficient training. The intersection marks the point before which we expect our predictions to break down. The location of this point is highly sensitive to the precise exponents from our power-law fits. <details> <summary>Image 15 Details</summary> ![8ede570c](/v1/image/8ede570cb4964b648b798692d9e82dadd04a750d7df5854c2f0df877185d2d18) ### Visual Description ## Line Graph: Test Loss vs. Compute (PF-days), non-embedding ### Overview The image is a line graph comparing two test loss metrics, **L(C_min)** (dashed yellow line) and **L(D(C))** (solid red line), plotted against compute (PF-days) on a logarithmic scale. The graph includes a note about the sensitivity of the intersection point to power-law parameters. --- ### Components/Axes - **X-axis**: "Compute (PF-days), non-embedding" (logarithmic scale: 10⁻⁸ to 10⁷). - **Y-axis**: "Test Loss" (linear scale: 1.5 to 7.5). - **Legend**: - Dashed yellow line: **L(C_min)**. - Solid red line: **L(D(C))**. - **Note**: "The intersection point is sensitive to the precise power-law parameters" (positioned on the right side of the graph). --- ### Detailed Analysis 1. **L(C_min) (Yellow Dashed Line)**: - Starts at ~6.0 test loss at 10⁻⁸ PF-days. - Decreases steeply to ~3.0 at 10⁻² PF-days. - Continues declining to ~1.5 at 10⁴ PF-days. - Crosses below **L(D(C))** near 10⁴ PF-days. 2. **L(D(C)) (Red Solid Line)**: - Starts at ~3.0 test loss at 10⁻⁸ PF-days. - Decreases linearly to ~1.5 at 10⁷ PF-days. - Intersects **L(C_min)** at ~10⁴ PF-days. 3. **Intersection Point**: - Occurs at ~10⁴ PF-days. - Test loss at intersection: ~1.5–1.7 (approximate due to overlapping lines). --- ### Key Observations - **L(C_min)** initially decreases faster than **L(D(C))** but eventually becomes less efficient at higher compute levels. - The intersection suggests a threshold where **L(D(C))** outperforms **L(C_min)** beyond ~10⁴ PF-days. - The logarithmic x-axis emphasizes differences in compute scales (e.g., 10⁻⁸ vs. 10⁷). - The note about power-law sensitivity implies the intersection’s position is highly dependent on model-specific parameters. --- ### Interpretation The graph demonstrates a trade-off between compute efficiency and test loss for two methods. **L(C_min)** is more efficient at low compute levels but becomes outperformed by **L(D(C))** as compute increases. The intersection point’s sensitivity to power-law parameters highlights the importance of precise hyperparameter tuning in optimizing compute-resource allocation. This could inform decisions about when to prioritize one method over the other in resource-constrained environments. </details> ## 6.3 Contradictions and a Conjecture We observe no signs of deviation from straight power-law trends at large values of compute, data, or model size. Our trends must eventually level off, though, since natural language has non-zero entropy. Indeed, the trends for compute-efficient training described in this section already contain an apparent contradiction. At scales several orders of magnitude above those documented here, the performance predicted by the L ( C min ) scaling law decreases below what should be possible given the slow growth in training data with compute. This implies that our scaling laws must break down before this point, but we conjecture that the intersection point has a deeper meaning: it provides an estimate of the point at which Transformer language models reach maximal performance. Since the amount of data used by compute-efficient training grows slowly with the compute budget, the performance predicted by L ( C min ) eventually hits a lower bound set by the L ( D ) power law (see Figure 15). Let us work this out in more detail. To keep overfitting under control, the results of Section 4 imply that we should scale the dataset size as $$D \varpropto N ^ { 0 . 7 4 } \varpropto C _ { \min } ^ { 0 . 5 4 }$$ where we have used the compute-efficient N ( C min ) from Figure 14. Let us compare this to the data requirements of compute-efficient training. If we train at the critical batch size (i.e. C = 2 C min ) and never re-use data during training, we find that data usage grows with compute as $$D ( C _ { \min } ) = \frac { 2 C _ { \min } } { 6 N ( C _ { \min } ) } \approx \left ( 4 \times 1 0 ^ { 1 0 } t o k e n s \right ) ( C _ { \min } / P F - D a y ) ^ { 0 . 2 6 } \quad ( 6 . 7 )$$ This is the maximum rate at which the dataset size can productively grow with compute, since it means that we are only training for a single epoch. But it grows the dataset much more slowly than in Equation (6.6). It appears to imply that compute-efficient training will eventually run into a problem with overfitting, even if the training process never re-uses any data! According to Figure 1, we expect that when we are bottlenecked by the dataset size (ie by overfitting), the loss should scale as L ( D ) ∝ D -0 . 095 . This implies that the loss would scale with compute as L ( D ( C min )) ∝ C -0 . 03 min once we are data-limited. Once again, we have a contradiction, as this will eventually intersect with our prediction for L ( C min ) from Figure 13, where we found a scaling L ( C min ) ∝ C -0 . 050 min . The intersection point of L ( D ( C min )) and L ( C min ) occurs at $$C ^ { * } \sim 1 0 ^ { 4 } \, P F \text {-Days} \quad N ^ { * } \sim 1 0 ^ { 1 2 } \, \text {parameters} , \quad D ^ { * } \sim 1 0 ^ { 1 2 } \, \text {tokens} , \quad L ^ { * } \sim 1 . 7 \, \text {nats/token} \quad ( 6 . 8 )$$ though the numerical values are highly uncertain, varying by an order or magnitude in either direction depending on the precise values of the exponents from the power-law fits. The most obvious interpretation is that our scaling laws break down at or before we reach this point, which is still many orders of magnitude away in both compute and model size. One might also conjecture that this intersection point has a deeper meaning. If we cannot increase the model size beyond N ∗ without qualitatively different data requirements, perhaps this means that once we reach C ∗ min and N ∗ , we have extracted all of the reliable information available in natural language data. In this interpretation, L ∗ would provide a rough estimate for the entropy-per-token 7 of natural language. In this scenario, we would expect the loss trend to level off at or before L ∗ . We can guess at the functional form of L ( C min ) as it levels off by considering a version of our training dataset with added noise. For example, we could append a random string of tokens to each context shown to the model to artificially boost the loss by a constant additive factor. Then, the distance from the noise floor L -L noise would be a more meaningful performance metric, with even a small decrease in this distance potentially representing a significant boost in qualitative performance. Since the artificial noise would affect all of our trends equally, the critical point of 6.8 would not change (aside from the absolute value of L ∗ ), and may be meaningful even if it occurs after the leveling off. ## 7 Related Work Power laws can arise from a wide variety of sources [THK18]. Power-law scalings with model and dataset size in density estimation [Was06] and in random forest models [Bia12] may be connected with our results. These models suggest that power-law exponents may have a very rough interpretation as the inverse of the number of relevant features in the data. Some early [BB01, Goo01] work found power-law scalings between performance and dataset size. More recent work [HNA + 17, HAD19] also investigated scaling between model size and data size; their work is perhaps the closest to ours in the literature 8 . Note, however, that [HNA + 17] found super-linear scaling of dataset size with model size, whereas we find a sub-linear scaling. There are some parallels between our findings on optimal allocation of compute and [Kom19], including power-law learning curves. EfficientNets [TL19] also appear to obey an approximate power-law relation between accuracy and model size. Very recent work [RRBS19b] studies scaling with both dataset size and model size for a variety of datasets, and fits an ansatz similar to ours. EfficientNet [TL19] advocates scaling depth and width exponentially (with different coefficients) for optimal performance of image models, resulting in a power-law scaling of width as a function of depth. We find that for language models this power should be roughly one when scaling up (as width/depth should remain fixed). But more importantly, we find that the precise architectural hyperparameters are unimportant compared to the overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles of shallower models, which could potentially explain this finding. Earlier work [ZK16] has compared width and depth, and found that wide ResNets can outperform deep ResNets on image classification. Some studies fix computation per data example, which tends to scale in proportion to the number of model parameters, whereas we investigate scaling with both model size and the quantity of training computation. Various works [AS17, BHMM18] have investigated generalization in highly overparameterized models, finding a 'jamming transition' [GJS + 19] when the model size reaches the dataset size (this may require training many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do not observe such a transition, and find that the necessary training data scales sublinearly in the model size. Expansions in the model size, particularly at large width [JGH18, LXS + 19], may provide a useful framework for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions [ZLN + 19] in realistic settings. Making this connection quantitative will require a characterization of the Hessian spectrum [Pap18, GKX19, GARD18]. ## 8 Discussion We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N , dataset size D , and optimized training computation C min , as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,C min are power-laws, there are diminishing returns with increasing scale. 7 Defining words using the wc utility, the WebText2 dataset has 1 . 4 tokens per word and 4 . 3 characters per token. 8 After this work was completed, [RRBS19a] also appeared, which makes similar predictions for the dependence of loss on both model and dataset size. We were able to precisely model the dependence of the loss on N and D , and alternatively on N and S , when these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework. One might interpret these relations as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic consituents. It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to test these relations on other domains, such as images, audio, and video models, and perhaps also for random network distillation. At this point we do not know which of our results depend on the structure of natural language data, and which are universal. It would also be exciting to find a theoretical framework from which the scaling relations can be derived: a 'statistical mechanics' underlying the 'thermodynamics' we have observed. Such a theory might make it possible to derive other more precise predictions, and provide a systematic understanding of the limitations of the scaling laws. In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: 'more is different'. For example, the smooth aggregate growth of the economy provides no indication of the specific technological developments that underwrite it. Similarly, the smooth improvements in language model loss may hide seemingly qualitative changes in capability. Our results strongly suggest that larger models will continue to perform better, and will also be much more sample efficient than has been previously appreciated. Big models may be more important than big data. In this context, further investigation into model parallelism is warranted. Deep models can be trained using pipelining [HCC + 18], which splits parameters depth-wise between devices, but eventually requires increased batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallelization [SCP + 18], since large layers can be split between multiple workers with less serial dependency. Sparsity [CGRS19, GRK17] or branching (e.g. [KSH12]) may allow for even faster training of large networks through increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, it might be possible to remain on the compute-efficient frontier for an entire training run. ## Acknowledgements We would like to thank Shan Carter, Paul Christiano, Jack Clark, Ajeya Cotra, Ethan Dyer, Jason Eisner, Danny Hernandez, Jacob Hilton, Brice Menard, Chris Olah, and Ilya Sutskever for discussions and for feedback on drafts of this work. ## Appendices ## A Summary of Power Laws For easier reference, we provide a summary below of the key trends described throughout the paper. Table 4 | Parameters | Data | Compute | Batch Size | Equation | |--------------|--------|------------|--------------------------|-----------------------------------------------------| | N | ∞ | ∞ | Fixed | L ( N ) = ( N c /N ) α N | | ∞ | D | Early Stop | Fixed | L ( D ) = ( D c /D ) α D | | Optimal | ∞ | C | Fixed | L ( C ) = ( C c /C ) α C (naive) | | N opt | D opt | C min | B B crit | L ( C min ) = ( C min c /C min ) α min C | | N | D | Early Stop | Fixed | L ( N,D ) = [ ( N c N ) αN αD + D c D ] α D | | N | ∞ | S steps | B | L ( N,S ) = ( N c N ) α N + ( S c S min ( S,B ) ) α | The empirical fitted values for these trends are: Table 5 | Power Law | Scale (tokenization-dependent) | |-------------------|----------------------------------------| | α N = 0 . 076 | N c = 8 . 8 × 10 13 params (non-embed) | | α D = 0 . 095 | D c = 5 . 4 × 10 13 tokens | | α C = 0 . 057 | C c = 1 . 6 × 10 7 PF-days | | α min C = 0 . 050 | C min c = 3 . 1 × 10 8 PF-days | | α B = 0 . 21 | B ∗ = 2 . 1 × 10 8 tokens | | α S = 0 . 76 | S c = 2 . 1 × 10 3 steps | The optimal parameters for compute efficient training are given by: Table 6 | Compute-Efficient Value | Power Law | Scale | |--------------------------------------------------------|--------------|---------------------------| | N opt = N e · C p N min | p N = 0 . 73 | N e = 1 . 3 · 10 9 params | | B B crit = B ∗ L 1 /αB = B e C p B min | p B = 0 . 24 | B e = 2 . 0 · 10 6 tokens | | S min = S e · C p S min (lower bound) | p S = 0 . 03 | S e = 5 . 4 · 10 3 steps | | D opt = D e · C p D min (1 epoch) | p D = 0 . 27 | D e = 2 · 10 10 tokens | ## B Empirical Model of Compute-Efficient Frontier Throughout this appendix all values of C, S, and α C are adjusted for training at the critical batch size B crit . We have left off the 'adj' label to avoid cluttering the notation. ## B.1 Defining Equations The power-law fit to the learning curves implies a simple prescription for compute-efficient training. In this appendix, we will derive the optimal performance, model size, and number of training steps as a function of the compute budget. We start with the Equation (1.6), repeated here for convenience: $$L \left ( N , S \right ) = \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \left ( \frac { S _ { c } } { S } \right ) ^ { \alpha _ { S } } .$$ Here, S represents the number of parameter updates when training at the critical batch size [MKAT18], which was defined in Equation (5.2) 9 : $$B \left ( L \right ) = \frac { B _ { * } } { L ^ { 1 / \alpha _ { B } } } .$$ We would like to determine optimal training parameters for a fixed compute budget, so we replace S = C/ (6 NB ( L )) , where C is the number of FLOPs used in the training run: $$L \left ( N , C \right ) = \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \left ( 6 B _ { * } S _ { c } \frac { N } { L ^ { 1 / \alpha _ { B } C } } \right ) ^ { \alpha _ { S } } .$$ Now, we set ∂ N L ∣ ∣ C = 0 to find the condition for optimality: $$0 & = \frac { \partial L } { \partial N } | _ { C } \\ & = - \frac { \alpha _ { N } } { N } \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } + \frac { \alpha _ { S } } { N } \left ( 6 B _ { * } S _ { c } \frac { N } { L ^ { 1 / \alpha _ { B } } C } \right ) ^ { \alpha _ { S } } \left ( 1 - 5 \frac { N } { L } \frac { \partial L } { \partial N } \right ) \\ \Rightarrow & \, \frac { \alpha _ { N } } { \alpha _ { S } } \left ( \frac { N _ { c } } { N } \right ) ^ { \alpha _ { N } } = \left ( 6 B _ { * } S _ { c } \frac { N } { L ^ { 1 / \alpha _ { B } } C } \right ) ^ { \alpha _ { S } } \\$$ Equation (B.3) and (B.4) together determine the compute-efficient frontier. ## B.2 Efficient Training Now we assemble the implications of (B.3) and (B.4). First, note that inserting (B.4) into (B.3) yields $$L \left ( N _ { e f f } \left ( C \right ) , C \right ) = \left ( 1 + \frac { \alpha _ { N } } { \alpha _ { S } } \right ) L \left ( N _ { e f f } , \infty \right ) , \quad ( B . 5 )$$ which implies that for compute-efficient training, we should train to a fixed percentage α N α S ≈ 10% above the converged loss. Next, let's determine how the optimal loss depends on the compute budget. Eliminating N yields a power-law dependence of performance on compute: $$L \left ( C \right ) = \left ( \frac { C _ { c } } { C } \right ) ^ { \alpha _ { c } }$$ $$\alpha _ { C } = 1 / \left ( 1 / \alpha _ { S } + 1 / \alpha _ { B } + 1 / \alpha _ { N } \right ) \approx 0 . 0 5 2$$  Similarly, we can eliminate L to find N ( C ) : $$\frac { N \left ( C \right ) } { N _ { c } } = \left ( \frac { C } { C _ { c } } \right ) ^ { \alpha _ { C } / \alpha _ { N } } \left ( 1 + \frac { \alpha _ { N } } { \alpha _ { S } } \right ) ^ { 1 / \alpha _ { N } }$$ $$S \left ( C \right ) = \frac { C _ { c } } { 6 N _ { c } B _ { * } } \left ( 1 + \frac { \alpha _ { N } } { \alpha _ { S } } \right ) ^ { - 1 / \alpha _ { N } } \left ( \frac { C } { C _ { c } } \right ) ^ { \alpha _ { C } / \alpha _ { S } } \quad \text {(B.10)}$$ 9 There is a slight ambiguity here: we can imagine training either at a constant batch size B ( L target ) , or we could instead train at a variable batch size ˜ B ( L ) , where ˜ B is the instantaneous critical batch size (as opposed to B , which is the averaged version). These two prescriptions result in the same number of steps, so we can ignore this subtlety (see [MKAT18]). where we defined and ## B.3 Comparison to Inefficient Typically, researchers train models until they appear to be close to convergence. In this section, we compare the efficient training procedure described above to this more typical setup. We define a the convergence factor f as the percent deviation from the converged loss: $$L \left ( N , C \right ) = \left ( 1 + f \right ) L \left ( N , \infty \right ) .$$ For compute-efficient training we have f = α N /α S ≈ 10% from the previous section, but researchers typically use a much smaller value. Here, we choose f ′ = 2% as an estimate. For a fixed value of the loss, we predict:    So that compute-efficient training uses 7.7x fewer parameter updates, 2.7x more parameters, and 65% less compute to reach the same loss. ## B.4 Suboptimal Model Sizes We can solve A.1 to find an expression for the amount of compute needed to reach a given value of the loss L with a model of size N :  Using A.6 and A.9, we can eliminate L in favor of N eff ( L ) , the model size which reaches L most efficiently. From there, we find an expression for the excess compute needed as a consequence of using a suboptimal model size:  The result is shown in Figure X. Models between 0.6x and 2.2x the optimal size can be used with only a 20% increase in compute budget. Using a smaller model is useful when accounting for the cost inference. A larger model can be trained the the same level of performance in fewer steps, allowing for more parallelism and faster training if sufficient harware is available (see Figure Y):  A2.2x larger model requires 45% fewer steps at a cost of 20% more training compute. Note that this equation should not be trusted for very large models, as it is only valid in the power-law region of the learning curve after initial transient effects. ## C Caveats In this section we list some potential caveats to our analysis. - At present we do not have a solid theoretical understanding for any of our proposed scaling laws. The scaling relations with model size and compute are especially mysterious. It may be possible to understand scaling at very large D holding model size fixed [AS17], and also the shape of learning curves late in training, by modeling the loss with a noisy quadratic. But the scaling with D at very large model size still remains mysterious. Without a theory or a systematic understanding of the corrections to our scaling laws, it's difficult to determine in what circumstances they can be trusted. Figure 16 Left: We characterize the step on which early stopping occurs, as a function of the extent of overfitting. The red line indicates a lower bound for early stopping that is derived in Section 5.3. Right: We display train and test loss for a series of 300M parameter models trained on different sized dataset subsamples. The test loss typically follows that of a run done with unrestricted data until diverging. Note that the degree of overfitting (as compared to the infinite data limit) is significantly overestimated by L test -L train (denoted by a black bar for each run). <details> <summary>Image 16 Details</summary> ![148c2fb5](/v1/image/148c2fb5875c506b2e82c20f40bcda9b3f33ef24d725223f50e05efe6a58190f) ### Visual Description ## Line Charts: Early Stopping Step and Loss Trends ### Overview The image contains two line charts. The left chart ("Early Stopping Step") plots early stopping steps against a derived metric involving dataset size and loss. The right chart compares training and test loss across training steps for datasets of varying sizes. Both charts use logarithmic scales and color-coded data series. ### Components/Axes #### Left Chart ("Early Stopping Step"): - **X-axis**: "S_c × [L(N,D) − L(N,∞)]^(-1/α_s)" (log scale, 10³ to 10⁵) - **Y-axis**: "S_stop" (log scale, 10³ to 10⁵) - **Legend**: Located on the right, mapping colors to dataset sizes: - Purple: 21M - Dark blue: 43M - Medium blue: 86M - Teal: 172M - Light teal: 344M - Green: 688M - Yellow: 1.4B - **Trend line**: Red dashed line (approximate equation: y = x) #### Right Chart ("Loss Trends"): - **X-axis**: "Step" (log scale, 10³ to 10⁵) - **Y-axis**: "Loss" (log scale, 2 to 6) - **Lines**: - Solid blue: Test Loss - Dashed blue: Train Loss - **Color gradient**: Right axis maps colors to dataset sizes (same as left chart legend). ### Detailed Analysis #### Left Chart: - Data points (dots) align closely with the red dashed trend line, confirming the relationship: **S_stop ∝ S_c × [L(N,D) − L(N,∞)]^(-1/α_s)**. - Larger datasets (e.g., 1.4B, yellow) have higher S_stop values, while smaller datasets (e.g., 21M, purple) cluster at lower S_stop values. #### Right Chart: - **Test Loss** (solid lines) consistently exceeds **Train Loss** (dashed lines) across all dataset sizes. - Losses decrease sharply at early steps (10³–10⁴) and plateau near step 10⁵. - Larger datasets (yellow) achieve lower loss values than smaller datasets (purple), indicating better generalization. ### Key Observations 1. **Early Stopping Correlation**: The red dashed line in the left chart validates the theoretical relationship between S_stop and dataset size. 2. **Loss Convergence**: All datasets converge to similar loss values at later steps, but larger datasets start with lower loss. 3. **Dataset Size Impact**: Larger datasets (1.4B) outperform smaller ones in both metrics (higher S_stop and lower loss). ### Interpretation - The left chart demonstrates that early stopping steps scale with dataset size and the gap between finite and infinite-sample loss, suggesting adaptive stopping criteria for larger datasets. - The right chart reveals that larger datasets achieve faster and more stable convergence, reducing overfitting (Test Loss ≈ Train Loss at later steps). Smaller datasets show higher variance in loss, indicating instability. - The consistent color coding across both charts allows direct comparison: datasets with higher S_stop (left) also achieve lower loss (right), reinforcing the value of larger datasets in training efficiency. </details> - We are not especially confident in the prediction of B crit ( L ) for values of the loss far outside the range we have explored. Changes in B crit could have a significant impact on trade-offs between data parallelism and the number of serial training steps required, which would have a major impact on training time. - We did not thoroughly investigate the small data regime, and our fits for L ( N,D ) were poor for the smallest values of D (where an epoch corresponded to only 40 steps). Furthermore, we did not experiment with regularization and data augmentation. Improvements in these could alter our results, quantitatively or qualitatively. - We used the estimated training compute C ≈ 6 NBS , which did not include contributions proportional to n ctx (see Section 2.1). So our scalings with compute may be confounded in practice in the regime of very large n ctx , specifically where n ctx 12 d model . - We tuned learning rates, and we experimented with learning rate schedules. But we may have neglected to tune some hyperparameter (e.g. intialization scale or momentum) that have an important effect on scaling. - The optimal choice of learning rate is sensitive to the target loss. When training close to convergence, it may be necessary to use a smaller learning rate to avoid divergences. But when conducting a short training run (eg due to compute limitations), it may be possible to use a larger learning rate. We did not experiment with higher learning rates for training runs that did not proceed to convergence. ## D Supplemental Figures ## D.1 Early Stopping and Test vs Train In section 5.3 we described the result shown in Figure 16, which provides a prediction for a lower bound on the early stopping step. We also show the train and test loss for a given model size when training on different sized datasets. ## D.2 Universal Transformers We compare the performance of standard Transformers to recurrent Transformers [DGV + 18] in Figure 17. These models re-use parameters, and so perform slightly better as a function of N , but slightly worse as a function of compute C . We include several different different possibilities for parameter re-use. ## D.3 Batch Size We measure the critical batch size using the data displayed in figure 18. This made it possible to estimate B crit ( L ) in figure 10. Figure 17 We compare recurrent Transformers [DGV + 18], which re-use parameters, to standard Transformers. Recurrent Transformers perform slightly better when comparing models with equal parameter count, but slightly worse when accounting for reuse and comparing per FLOP. <details> <summary>Image 17 Details</summary> ![5f7d931d](/v1/image/5f7d931d0e4ddbe6af5042ba8a00455ca942b44225b6fff9c96afc14cfc086fd) ### Visual Description ## Line Chart: Test Loss vs. Parameters with Reuse Configurations ### Overview The image contains two side-by-side line charts comparing test loss across different model configurations (2x, 4x, 8x reuse) against non-recurrent models. The left chart includes parameters with reuse (non-embedding), while the right chart excludes parameters. Both charts use a logarithmic scale for parameters (x-axis) and linear scale for test loss (y-axis). Data points are plotted with distinct colors for each reuse configuration, and a dashed gray line represents non-recurrent models. --- ### Components/Axes - **Left Chart**: - **X-axis**: "Parameters, including reuse (non-embedding)" (log scale: 10⁵ to 10⁹) - **Y-axis**: "Test Loss" (linear scale: 2.5 to 4.5) - **Legend**: - Purple circles: 2x Reuse - Blue squares: 4x Reuse - Yellow triangles: 8x Reuse - Dashed gray line: Non-recurrent Models - **Right Chart**: - **X-axis**: "Parameters (non-embedding)" (log scale: 10⁵ to 10⁹) - **Y-axis**: "Test Loss" (linear scale: 2.5 to 4.5) - **Legend**: Same as left chart. --- ### Detailed Analysis #### Left Chart (Parameters with Reuse) - **Trends**: - All reuse configurations (2x, 4x, 8x) show a **downward trend** in test loss as parameters increase. - The 8x Reuse (yellow) consistently has the **lowest test loss**, followed by 4x (blue) and 2x (purple). - Non-recurrent models (dashed gray) maintain a **flat, high test loss** (~4.0–4.5) across all parameter ranges. - **Data Points**: - At 10⁵ parameters: - 2x Reuse: ~4.3 - 4x Reuse: ~4.1 - 8x Reuse: ~3.9 - At 10⁹ parameters: - 2x Reuse: ~2.7 - 4x Reuse: ~2.6 - 8x Reuse: ~2.5 #### Right Chart (Parameters without Reuse) - **Trends**: - Reuse configurations (2x, 4x, 8x) **converge** toward the non-recurrent line as parameters increase. - At lower parameters (10⁵–10⁷), reuse models outperform non-recurrent models significantly. - At higher parameters (10⁸–10⁹), performance gaps narrow, with 8x Reuse still slightly better. - **Data Points**: - At 10⁵ parameters: - 2x Reuse: ~4.4 - 4x Reuse: ~4.2 - 8x Reuse: ~4.0 - At 10⁹ parameters: - 2x Reuse: ~2.8 - 4x Reuse: ~2.7 - 8x Reuse: ~2.6 --- ### Key Observations 1. **Reuse Improves Efficiency**: Models with reuse (2x, 4x, 8x) consistently achieve lower test loss than non-recurrent models, especially at smaller parameter scales. 2. **Diminishing Returns**: The gap between reuse configurations and non-recurrent models narrows as parameters increase, suggesting diminishing benefits of reuse at scale. 3. **Parameter Impact**: Including parameters (left chart) shows steeper performance improvements with reuse compared to excluding them (right chart). 4. **8x Reuse Dominance**: The 8x Reuse configuration outperforms others across all parameter ranges. --- ### Interpretation The data demonstrates that **reuse mechanisms significantly enhance model efficiency**, reducing test loss even at smaller parameter scales. The left chart highlights that parameter count amplifies the benefits of reuse, while the right chart shows that reuse alone can achieve near-optimal performance without excessive parameter growth. This suggests that **reuse strategies are more critical than raw parameter scaling** for improving model efficiency. The convergence of reuse lines in the right chart implies that beyond a certain parameter threshold, additional parameters offer minimal gains compared to optimized reuse. </details> Figure 18 These figures demonstrate fits to Equation (5.1) for a large number of values of the loss L , and for two different Transformer model sizes. These fits were used to measure B crit ( L ) for Figure 10. <details> <summary>Image 18 Details</summary> ![7efb8e2f](/v1/image/7efb8e2f2ea46967f83f502087b860e5018d9740223307bc91c615c1df21e361) ### Visual Description ## Line Chart: Batch Size Scan - 3M and 85M Parameters ### Overview The image contains two side-by-side line charts comparing token processing efficiency across different batch sizes for models with 3M and 85M parameters. Both charts use logarithmic scales on both axes and share identical formatting conventions. ### Components/Axes - **X-axis (Step)**: Logarithmic scale from 10² to 10⁵ - **Y-axis (Tokens Processed)**: Logarithmic scale from 10⁶ to 10¹¹ - **Legend**: Color gradient from purple (low test loss) to yellow (high test loss), labeled "Test Loss" with values 4-10 - **Data Series**: Multiple colored lines representing different batch sizes, with markers showing individual data points ### Detailed Analysis **3M Parameters Chart**: - Data points form a dense cluster in the lower-left quadrant - Lines show gradual upward slope with increasing step values - Color gradient transitions from purple (batch size 10) to yellow (batch size 1) - Highest tokens processed (~10¹¹) at step 10⁵ with batch size 10 **85M Parameters Chart**: - Data points form a denser cluster in the upper-right quadrant - Lines show steeper upward slope compared to 3M chart - Color gradient shows similar purple-to-yellow transition - Highest tokens processed (~10¹⁰) at step 10⁵ with batch size 10 ### Key Observations 1. **Model Size Correlation**: 85M models process 1-2 orders of magnitude fewer tokens than 3M models at equivalent steps 2. **Batch Size Impact**: Larger batch sizes (purple) consistently show higher token processing capacity 3. **Test Loss Gradient**: Yellow data points (batch size 1) show 3-4x higher test loss than purple points (batch size 10) 4. **Step Efficiency**: Both charts show diminishing returns in token processing efficiency as steps increase beyond 10³ ### Interpretation The charts demonstrate that: - Larger models (85M) achieve higher absolute token processing capacity but with reduced efficiency per step - Batch size optimization significantly impacts both throughput and model performance - The test loss gradient suggests that smaller batch sizes (yellow) may lead to less stable training dynamics - The logarithmic scale reveals exponential growth patterns in token processing capacity across both model sizes The data implies that batch size selection must balance computational efficiency with model stability, with larger models requiring more careful optimization of batch parameters to maintain performance. </details> ## D.4 Sample Efficiency vs Model Size It is easy to see from figure 2 that larger models train faster, and are therefore more sample efficient. We provide another way of looking at this phenomenon in figure 19, which shows when different models reach various fixed values of the loss. Figure 19 The number of minimum serial steps needed to reach any fixed value of the test loss decreases precipitously with model size. Sample efficiency (show here for training far below the critical batch size) improves greatly as well, improving by a factor of almost 100 when comparing the smallest possible model to a very large one. <details> <summary>Image 19 Details</summary> ![bc7afc9e](/v1/image/bc7afc9e0a7cb5bb41ce26bb332fc7f2d003ef9034340fcf87b0b6c8e13b6b10) ### Visual Description ## Line Chart: Minimum Steps and Examples vs. Parameters (Non-Embedding) ### Overview The image contains two side-by-side line charts comparing the relationship between **parameters (non-embedding)** and two metrics: **Minimum Steps (S_min)** (left) and **Minimum Examples (E_min)** (right). Both charts use logarithmic scales for axes and share a color gradient legend indicating **Loss** values (2.5–5.5). Lines represent different loss levels, with colors transitioning from purple (low loss) to yellow (high loss). --- ### Components/Axes - **X-axis (Both Charts)**: - Label: "Parameters (non-embedding)" - Scale: Logarithmic (10⁶ to 10⁸) - **Left Chart (S_min)**: - Y-axis Label: "Minimum Steps (S_min)" - Scale: Logarithmic (10³ to 10⁵) - **Right Chart (E_min)**: - Y-axis Label: "Minimum Examples (E_min)" - Scale: Logarithmic (10⁸ to 10¹¹) - **Legend**: - Position: Right of both charts - Color Gradient: Purple (Loss = 2.5) to Yellow (Loss = 5.5) - Label: "Loss" --- ### Detailed Analysis #### Left Chart (Minimum Steps, S_min) - **Trend**: - All lines slope downward as parameters increase, indicating fewer steps required for larger models. - Higher-loss lines (yellow) decline steeply, while lower-loss lines (purple) flatten. - Example: A loss=5.5 line (yellow) drops from ~10⁵ steps at 10⁶ parameters to ~10³ steps at 10⁸ parameters. - **Key Data Points**: - Loss=2.5 (purple): ~10⁴ steps at 10⁸ parameters. - Loss=5.5 (yellow): ~10⁵ steps at 10⁶ parameters. #### Right Chart (Minimum Examples, E_min) - **Trend**: - Lines slope downward, showing fewer examples needed for larger models. - Higher-loss lines (yellow) decline sharply, while lower-loss lines (purple) remain relatively flat. - Example: A loss=5.5 line (yellow) drops from ~10¹¹ examples at 10⁶ parameters to ~10⁹ examples at 10⁸ parameters. - **Key Data Points**: - Loss=2.5 (purple): ~10⁸ examples at 10⁸ parameters. - Loss=5.5 (yellow): ~10¹¹ examples at 10⁶ parameters. --- ### Key Observations 1. **Loss-Parameter Tradeoff**: - Higher-loss models (yellow) achieve efficiency gains faster with increasing parameters but require fewer steps/examples overall. - Lower-loss models (purple) show diminishing returns, requiring more resources for marginal improvements. 2. **Divergence in Efficiency**: - The steepest declines in S_min and E_min occur for loss=4.0–5.5, suggesting these models are more parameter-efficient but less accurate. 3. **Logarithmic Scale Impact**: - The logarithmic axes emphasize exponential relationships, making small parameter increases appear impactful for high-loss models. --- ### Interpretation The charts reveal a critical tradeoff between **model efficiency** and **performance**: - **High-loss models** (yellow lines) are highly parameter-efficient, requiring fewer steps and examples as model size grows. This suggests they are suitable for resource-constrained scenarios but sacrifice accuracy. - **Low-loss models** (purple lines) demand significantly more resources but achieve better performance, indicating a "quality vs. cost" dilemma. - The divergence in slopes implies that optimizing for lower loss (higher accuracy) comes at a steep computational cost, while accepting higher loss allows for scalable, lightweight models. This analysis aligns with Pareto principles in machine learning, where diminishing returns on accuracy justify resource allocation tradeoffs. </details> Figure 20 This figure provides information about the performance per token as a function of model size and training time. Left: Loss per token as a function of its position T in the 1024-token context. Loss scales predictably as a power-law in T . Right: Test loss per token as a function of training step. <details> <summary>Image 20 Details</summary> ![bdc848b5](/v1/image/bdc848b56c25a81d6b5d641bfad73b4d34f8e5b26812380f226fd7e852d35d2f) ### Visual Description ## Line Graphs: Per-Token Test Loss vs. Token Index and Step ### Overview The image contains two side-by-side line graphs comparing per-token test loss across different model configurations. The left graph plots loss against token index (log scale), while the right graph plots loss against training steps (log scale). Both graphs use color-coded lines to represent model parameters and training-related variables. ### Components/Axes **Left Graph (Per-Token Test Loss):** - **X-axis**: Token Index (log scale, 10⁰ to 10³) - **Y-axis**: Per-Token Test Loss (2 to 8) - **Legend**: - Line styles/colors represent combinations of model parameters (4.0, 3.4, 2.9, 2.7, 2.4, 2.3) and T values (3.2, 4.0, 4.5, 4.9, 5.1, 5.4) - Example: "4.0 + 3.2 - T^0.47" (dashed purple line) - **Color Bar**: Not present (legend uses discrete colors) **Right Graph (Per-token Loss, 774M Params):** - **X-axis**: Step (log scale, 10¹ to 10⁵) - **Y-axis**: Test Loss (2 to 10) - **Legend**: - Color gradient from yellow (10³) to purple (10⁸) representing model parameters - No explicit labels for individual lines ### Detailed Analysis **Left Graph Trends:** 1. All lines show decreasing loss as token index increases, with steeper declines at lower token indices. 2. Lines with higher T values (e.g., 5.4) have shallower slopes compared to lower T values (e.g., 3.2). 3. Model parameter values (4.0 vs. 2.3) correlate with initial loss magnitude: higher parameters start with lower loss. 4. Example data points: - "4.0 + 3.2 - T^0.47": Starts at ~7.5 loss at token index 10⁰, ends at ~4.2 at 10³ - "2.3 + 5.4 - T^0.62": Starts at ~7.8 loss, ends at ~3.0 at 10³ **Right Graph Trends:** 1. All lines show rapid initial loss reduction (steps 10¹–10³), then plateau. 2. Higher parameter models (yellow) maintain lower loss than lower parameter models (purple). 3. Example data points: - 10³ parameters: Loss drops from ~10 to ~4 by step 10³ - 10⁸ parameters: Loss drops from ~10 to ~2.5 by step 10⁵ ### Key Observations 1. **Log Scale Impact**: Both axes use logarithmic scales, emphasizing performance at extreme values (early tokens/steps and large parameter counts). 2. **Parameter Efficiency**: Higher parameter models (left graph's 4.0 vs. right graph's 10⁸) achieve better loss reduction. 3. **Training Dynamics**: The right graph suggests diminishing returns in loss reduction after ~10³ steps for all models. 4. **T Value Influence**: In the left graph, higher T values correlate with slower loss reduction, suggesting a trade-off between T and parameter efficiency. ### Interpretation The data demonstrates that: 1. **Model Capacity Matters**: Larger models (higher parameters) consistently outperform smaller ones in both token index and step-based loss reduction. 2. **Training Efficiency**: Loss reduction follows a power-law decay, with most significant improvements occurring in the initial training phases (first 10³ steps/tokens). 3. **Hyperparameter Trade-offs**: The T values in the left graph appear to modulate the learning curve's steepness, potentially representing regularization or optimization parameters. 4. **Scalability Limits**: The plateauing loss in the right graph suggests diminishing returns for further training beyond ~10³ steps, regardless of model size. The graphs collectively illustrate the relationship between model architecture (parameters), training dynamics (steps/tokens), and optimization hyperparameters (T) in determining per-token loss reduction. The color-coded legends provide critical context for comparing these multidimensional relationships. </details> Figure 21 In addition to the averaged loss, individual tokens within the 1024-token context also improve smoothly as model size increases. Training runs with shorter context n ctx = 8 (dashed lines) perform better on early tokens, since they can allocate all of their capacity to them. <details> <summary>Image 21 Details</summary> ![6061f834](/v1/image/6061f834c5b8aa75b03862b4c0e04507a2d8d6c8a8a66c9afa3bb19b66d55aa7) ### Visual Description ## Line Graph: Test Loss vs. Parameters (excluding embedding) ### Overview The image is a line graph comparing test loss across different parameter configurations (excluding embedding) for various token ratios. The x-axis represents parameters (log scale from 10⁴ to 10⁹), and the y-axis represents test loss (linear scale from 3.0 to 7.5). Multiple lines represent different token configurations, with colors corresponding to legend labels. ### Components/Axes - **X-axis**: "Parameters (excl. embedding)" (log scale: 10⁴, 10⁵, 10⁶, 10⁷, 10⁸, 10⁹) - **Y-axis**: "Test Loss" (linear scale: 3.0, 4.5, 6.0, 7.5) - **Legend**: Located on the right, with 12 entries: - **Token 1/1024** (purple) - **Token 2/1024** (dark purple) - **Token 4/1024** (blue) - **Token 8/1024** (teal) - **Token 16/1024** (light teal) - **Token 64/1024** (green) - **Token 256/1024** (lime) - **Token 1024/1024** (yellow) - **Token 1/8** (dark purple) - **Token 2/8** (purple) - **Token 4/8** (blue) - **Token 8/8** (teal) ### Detailed Analysis 1. **Token 1/1024** (purple): Starts at ~7.5, remains flat with minimal fluctuation. 2. **Token 2/1024** (dark purple): Starts at ~6.0, decreases slightly to ~5.8. 3. **Token 4/1024** (blue): Starts at ~6.0, decreases to ~5.5. 4. **Token 8/1024** (teal): Starts at ~6.0, decreases to ~5.0. 5. **Token 16/1024** (light teal): Starts at ~6.0, decreases to ~4.5. 6. **Token 64/1024** (green): Starts at ~6.0, decreases to ~4.0. 7. **Token 256/1024** (lime): Starts at ~6.0, decreases to ~3.5. 8. **Token 1024/1024** (yellow): Starts at ~6.0, decreases to ~2.5. 9. **Token 1/8** (dark purple): Starts at ~6.0, remains flat. 10. **Token 2/8** (purple): Starts at ~6.0, decreases slightly to ~5.8. 11. **Token 4/8** (blue): Starts at ~6.0, decreases to ~5.5. 12. **Token 8/8** (teal): Starts at ~6.0, decreases to ~4.5. ### Key Observations - **Downward Trend**: All lines show a general decrease in test loss as parameters increase, except for Token 1/1024 and Token 1/8, which remain flat. - **Token Ratio Impact**: Higher token ratios (e.g., 1024/1024) achieve lower test loss compared to lower ratios (e.g., 1/1024). - **Diminishing Returns**: Lines for Token 1/8 and Token 8/8 flatten at higher parameter ranges, suggesting limited sensitivity to parameter growth. - **Overlap**: Lines for Token 1/1024 and Token 1/8 overlap at the top of the graph (~7.5 test loss). ### Interpretation The data demonstrates that increasing parameters reduces test loss, particularly for higher token configurations (e.g., 1024/1024). This suggests that larger models with balanced token ratios (e.g., 1024/1024) achieve better performance. However, the flat lines for Token 1/1024 and Token 1/8 indicate that extremely low token ratios may not benefit from parameter scaling. The diminishing returns for Token 8/8 imply that beyond a certain parameter threshold, further increases yield minimal improvements. This aligns with common machine learning principles where model capacity and data complexity interact to determine performance gains. </details> ## D.5 Context Dependence The trends for loss as a function of model size are displayed for different tokens in the context in Figure 21. We see that models trained on n ctx = 1024 show steady improvement with model size on all but the first token. Fixing model size, it appears that the loss scales as a power-law as a function of position T in the context, see Figure 20. This may be a consequence of underlying power-law correlations in language [EP94, ACDE12, LT16], or a more general feature of the model architecture and optimization. It provides some suggestion for the potential benefits (or lack thereof) from training on larger contexts. Not only do larger models converge to better performance at T = 1024 , but they also improve more quickly at early tokens, suggesting that larger models are more efficient at detecting patterns with less contextual information. In the right-hand plot we show how per-token performance varies for a fixed model as a function of the training step. The model begins by learning short-range information, and only learns longer-range correlations later in training. We have also included models trained with a tiny context n ctx = 8 in order to compare with our longer context models. Even modestly sized models trained on n ctx = 8 can dominate our largest n ctx = 1024 models on very early tokens. This also suggests that further improvements should be possible with much larger models trained on large contexts. ## D.6 Learning Rate Schedules and Error Analysis We experimented with a variety of learning rates and schedules. A host of schedules and resulting test performances for a small language model are plotted in Figure 22. We conclude that the choice of learning rate schedule is mostly irrelevant, as long as the total summed learning rate is sufficiently large, and the schedule includes a warmup period and a final decay to near-vanishing learning rate. Variations among Figure 22 We test a variety of learning rate schedules including cosine decay, linear decay, as well as other faster/slower decays schedules on a 3 million parameter model, shown on the left. For these experiments we do not decay to zero, since we find that this tends to give a fixed improvement close to the end of training. We find that, as long as the learning rate is not too small and does not decay too quickly, performance does not depend strongly on learning rate. Run-to-run variation is at the level of 0.05 in the loss, so averaging multiple runs is necessary to validate performance changes smaller than this level. <details> <summary>Image 22 Details</summary> ![6d65b9a5](/v1/image/6d65b9a5a3ce3b329cf4e4704353f991eca12724baac8b0671b0c515af15e591) ### Visual Description ## Line Chart and Scatter Plot: Learning Rate and Loss Analysis ### Overview The image contains two side-by-side visualizations. The left graph is a multi-line chart showing learning rate decay over training steps, while the right graph is a scatter plot correlating summed learning rates with loss values. Both graphs use scientific notation for axis labels and display numerical trends in machine learning training dynamics. ### Components/Axes **Left Graph (Line Chart):** - **X-axis**: "Step" (0 to 250,000) with linear scale - **Y-axis**: "Learning Rate" (0.0000 to 0.0010) with logarithmic-like spacing - **Legend**: Right-aligned, color-coded labels: - Red: "Initial LR: 0.0010" - Blue: "Initial LR: 0.0008" - Green: "Initial LR: 0.0006" - Orange: "Initial LR: 0.0004" - Purple: "Initial LR: 0.0002" - **Lines**: 5 distinct curves with exponential decay patterns **Right Graph (Scatter Plot):** - **X-axis**: "LR Summed Over Steps" (50 to 250) with linear scale - **Y-axis**: "Loss" (3.65 to 3.90) with linear scale - **Data Points**: 50+ blue dots with no explicit legend - **Trend**: Slight negative correlation between summed LR and loss ### Detailed Analysis **Left Graph Trends:** 1. All lines start near y=0.0010 at x=0 2. Red line (highest initial LR) maintains highest values throughout 3. Purple line (lowest initial LR) shows steepest initial decay 4. Lines converge toward y=0.0000 as steps approach 250,000 5. Blue and green lines show intermediate decay rates 6. All curves exhibit sigmoidal-like decay patterns **Right Graph Patterns:** 1. Data points cluster between x=100-200 and y=3.70-3.85 2. Outliers exist at both high (x=250, y=3.75) and low (x=50, y=3.85) extremes 3. No clear linear relationship, but general trend shows lower loss with higher summed LR 4. Points show significant variance at similar summed LR values ### Key Observations 1. Learning rate decay follows predictable exponential patterns based on initial values 2. Higher initial learning rates maintain greater magnitude throughout training 3. Summed learning rate correlates with but does not perfectly predict final loss 4. Loss values cluster tightly between 3.70-3.85 despite varied training steps 5. All learning rate curves approach zero at similar rates despite different starting points ### Interpretation The left graph demonstrates how different initial learning rates decay exponentially during training, with higher initial values maintaining greater magnitude. This suggests careful tuning of initial learning rates is crucial for maintaining training stability. The right graph reveals an inverse relationship between cumulative learning rate and final loss, though with notable variance. This implies that while total learning rate impacts model performance, other factors (batch size, architecture, data quality) likely contribute to loss variability. The convergence of learning rate curves in the left graph suggests that regardless of initial value, all training processes approach similar magnitude ranges by the final steps, indicating potential for learning rate scheduling optimization. </details> Figure 23 The trend for performance as a function of parameter count, L ( N ) , is fit better by a power law than by other functions such as a logarithm at a qualitative level. <details> <summary>Image 23 Details</summary> ![7457ee30](/v1/image/7457ee30ba32d6e2c8322280c31607a1ea169802afb0d75234078f17c60b6f19) ### Visual Description ## Line Graph: Test Loss vs. Parameters (Non-Embedding) ### Overview The image is a line graph comparing the relationship between the number of parameters (non-embedding) and test loss at convergence. Two lines are plotted: a blue line representing a power-law decay and an orange line representing a logarithmic decay. The x-axis spans parameters from 10⁴ to 10⁹, while the y-axis ranges from 2 to 6 for test loss. ### Components/Axes - **X-axis**: "Parameters (non-embedding)" (logarithmic scale, 10⁴ to 10⁹). - **Y-axis**: "Test Loss (at convergence)" (linear scale, 2 to 6). - **Legend**: Located in the top-right corner, with two entries: - **Blue line**: $ L = \left(\frac{N}{8.8 \cdot 10^{13}}\right)^{-0.076} $ - **Orange line**: $ L = -0.25 \log\left(\frac{N}{7.1 \cdot 10^{12}}\right) $ ### Detailed Analysis - **Blue Line (Power-Law Decay)**: - Starts at approximately (10⁴, 6) and ends at (10⁹, ~2.2). - Slope: Steeper decline, indicating a faster reduction in test loss as parameters increase. - Equation suggests a negative exponent, implying test loss decreases as parameters grow. - **Orange Line (Logarithmic Decay)**: - Starts at approximately (10⁴, 5.1) and ends at (10⁹, ~2.1). - Slope: Gradual decline, slower reduction in test loss compared to the blue line. - Equation uses a logarithmic function, reflecting a sublinear relationship between parameters and loss. ### Key Observations 1. Both lines show a decreasing trend in test loss as parameters increase, but the blue line (power-law) decreases more rapidly. 2. At 10⁴ parameters, the blue line begins ~0.9 units higher than the orange line. 3. By 10⁹ parameters, the lines converge, with the blue line ending slightly lower (~2.2 vs. ~2.1). 4. The logarithmic decay (orange) plateaus more slowly than the power-law decay (blue). ### Interpretation The graph demonstrates that increasing the number of parameters reduces test loss, but the rate of improvement depends on the scaling relationship. The blue line’s power-law decay ($ L \propto N^{-0.076} $) suggests a faster convergence for large parameter counts, while the orange line’s logarithmic decay ($ L \propto \log(N) $) indicates a more gradual improvement. This implies that architectures with parameter scaling governed by the power-law equation may achieve lower test loss more efficiently at scale. The convergence of the lines at high parameter counts highlights diminishing returns in both scaling strategies. </details> schedules appear to be statistical noise, and provide a rough gauge for the scale of variation between different training runs. Experiments on larger models suggest that the variation in the final test loss between different random seeds is roughly constant in magnitude for different model sizes. We found that larger models require a smaller learning rate to prevent divergence, while smaller models can tolerate a larger learning rate. To implement this, the following rule of thumb was used for most runs:  We expect that this formula could be improved. There may be a dependence on network width, likely set by the initialization scale. The formula also breaks down for N > 10 10 parameters. Nevertheless, we found that it works sufficiently well for the models we considered. ## D.7 Fit Details and Power Law Quality We experimented with a number of functional forms for the fits to L ( N ) , L ( C ) , and L ( D ) ; the power-law fits were qualitatively much more accurate than other functions such as logarithms (see Figure 23). For L ( C ) , we do not include small models with only 1 layer in the fit, as the transition from 1 to 2 layers causes a noticable lump in the data. For L ( N ) we also do not include very small models with only 1 layer in the fit, and we exclude the largest models that have not trained fully to convergence. Fit parameters change marginally if we do include them, and the trend extrapolates well in both directions regardless. ## D.8 Generalization and Architecture In figure 24 we show that generalization to other data distributions does not depend on network depth when we hold the total parameter count fixed. It seems to depend only on the performance on the training distribution. Figure 24 We show evaluations on a series of datasets for models with approximately 1.5 Billion parameters. We observe no effect of depth on generalization; generalization performance depends primarily on training distribution performance. The 12-layer model overfit the Internet Books dataset and we show the early-stopped performance; we have not seen this surprising result in other experiments. <details> <summary>Image 24 Details</summary> ![0a118128](/v1/image/0a118128f1de72f73718bfce2c07368878c4e2897fdb594cf7943922983db46f) ### Visual Description ```markdown ## Line Graph: Test Loss vs Depth ### Overview The image is a line graph comparing test loss across six datasets (Wikipedia, Books, Internet Books, Common Crawl, WebText2 (Train), WebText2 (Test)) at three depth levels (10¹, 10², 10³). The y-axis represents test loss (2.3–2.8), and the x-axis represents depth on a logarithmic scale. Each dataset is represented by a distinct colored line. --- ### Components/Axes - **Title**: "Test Loss vs Depth" - **X-axis**: - Label: "Depth" - Scale: Logarithmic (10¹, 10², 10³) - **Y-axis**: - Label: "Test Loss" - Scale: Linear (2.3–2.8, increments of 0.1) - **Legend**: - Position: Right side of the graph - Entries: - Wikipedia (blue) - Books (orange) - Internet Books (green) - Common Crawl (red) - WebText2 (Train) (purple) - WebText2 (Test) (brown) --- ### Detailed Analysis 1. **WebText2 (Train)** (purple): - **Trend**: Flat line with minimal fluctuation. - **Values**: - 10¹: ~2.35 - 10²: ~2.32 - 10³: ~2.33 - **Observation**: Consistently the lowest test loss across all depths. 2. **WebText2 (Test)** (brown): - **Trend**: Slightly upward slope. - **Values**: - 10¹: ~2.42 - 10²: ~2.38 - 10³: ~2.40 - **Observation**: Slightly higher than WebText2 (Train), with a minor increase at 10³. 3. **Common Crawl** (red): - **Trend**: Slightly downward slope. - **Values**: - 10¹: ~2.52 - 10²: ~2.47 - 10³: ~2.48 - **Observation**: Higher than WebText2 datasets but lower than others. 4. **Internet Books** (green): - **Trend**: U-shaped curve (peak at 10¹, dip at 10², rise at 10³). - **Values**: - 10¹: ~2.75 - 10²: ~2.68 - 10³: ~2.72 - **Observation**: Highest loss at 10¹, improving at 10², then worsening at 10³. 5. **Books** (orange): - **Trend**: Relatively flat with minor fluctuations. - **Values**: - 10¹: ~2.82 - 10²: ~2.78 - 10³: ~2.80 - **Observation**: Highest test loss across all datasets. 6. **Wikipedia** (blue): - **Trend**: Slightly downward slope. - **Values**: - 10¹: ~2.72 - 10²: ~2.68 - 10³: ~2.70 - **Observation**: Moderate loss, improving slightly with depth. --- ### Key </details> ## List of Figures | 1 | Summary of simple power laws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 3 | |-----|-------------------------------------------------------------------------------------------------|-----| | 2 | Illustration of sample efficiency and compute efficiency. . . . . . . . . . . . . . . . . . . | 4 | | 3 | How to scale up model size, batch size, and serial steps . . . . . . . . . . . . . . . . . . | 4 | | 4 | Performance when varying model and data size, or model and training steps, simultaneously | 5 | | 5 | Weak dependence of performance on hyperparameter tuning . . . . . . . . . . . . . . . | 8 | | 6 | Comparison of performance trend when including or excluding embeddings . . . . . . . | 8 | | 7 | LSTM and Transformer performance comparison . . . . . . . . . . . . . . . . . . . . . | 9 | | 8 | Generalization to other test datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 10 | | 9 | Universality of overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 11 | | 10 | Critical batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 12 | | 11 | Performance versus compute budget or number of parameter updates . . . . . . . . . . . | 14 | | 12 | Training on suboptimal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 15 | | 13 | Comparison between empirical and adjusted compute trends . . . . . . . . . . . . . . . | 15 | | 14 | Optimal model size and serial number of steps versus compute budget . . . . . . . . . . | 16 | | 15 | Contradiction between compute and data trends . . . . . . . . . . . . . . . . . . . . . . | 17 | | 16 | Early stopping lower bound and training curves for overfit models . . . . . . . . . . . . | 23 | | 17 | Universal transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 24 | | 18 | Batch size scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 24 | | 19 | Another look at sample efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 24 | | 20 | Power-law dependence of performance on position in context . . . . . . . . . . . . . . . | 25 | | 21 | Performance at different context positions versus model size . . . . . . . . . . . . . . . | 25 | | 22 | Learning rate schedule scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 26 | | 23 | Comparison of Power-Law and Logarithmic Fits . . . . . . . . . . . . . . . . . . . . . | 26 | | 24 | Generalization versus depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 27 | ## List of Tables | 1 | Parameter and compute counts for Transformer | Parameter and compute counts for Transformer | 7 | |------------|------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 2 | Fits to L ( N,D ) . | Fits to L ( N,D ) . | 11 | | 3 | Fits to L ( N,S ) . . . . . | Fits to L ( N,S ) . . . . . | 14 | | 4 | Key trend equations . | Key trend equations . | 20 | | 5 | Key parameters to trend fits . | Key parameters to trend fits . | 20 | | 6 | Trends for compute-efficient training . . . | Trends for compute-efficient training . . . | 20 | | References | References | References | References | | [ACDE12] | [ACDE12] | Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long- range correlations in texts. Proceedings of the National Academy of Sciences , 109(29):11582- 11587, 2012. 25 | Eduardo G Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti. On the origin of long- range correlations in texts. Proceedings of the National Academy of Sciences , 109(29):11582- 11587, 2012. 25 | | [AS17] | [AS17] | Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv , 2017, 1710.03667. 11, 18, 22 | Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv , 2017, 1710.03667. 11, 18, 22 | | [BB01] | [BB01] | Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics , pages 26-33. Association for Computational Linguistics, 2001. 18 | Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics , pages 26-33. Association for Computational Linguistics, 2001. 18 | | [BHMM18] | [BHMM18] | Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv , 2018, 1812.11118. 18 | Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv , 2018, 1812.11118. 18 | | [Bia12] | [Bia12] | GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063-1095, 2012. 18 | GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063-1095, 2012. 18 | | [CGRS19] | [CGRS19] | Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509 . 19 | Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR , abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509 . 19 | | [DCLT18] | [DCLT18] | Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2 | Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv:1810.04805. 2 | | [DGV + 18] | [DGV + 18] | Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. CoRR , abs/1807.03819, 2018, 1807.03819. URL http://arxiv.org/ abs/1807.03819 . 6, 9, 23, 24 | Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. CoRR , abs/1807.03819, 2018, 1807.03819. URL http://arxiv.org/ abs/1807.03819 . 6, 9, 23, 24 | | [EP94] | [EP94] | Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english. EPL (Europhysics Letters) , 26(4):241, 1994. 25 | Werner Ebeling and Thorsten Pöschel. Entropy and long-range correlations in literary english. EPL (Europhysics Letters) , 26(4):241, 1994. 25 | | [Fou] | [Fou] | The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org . 7 | The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org . 7 | | [GARD18] | [GARD18] | Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 | Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 | | [GJS + 19] | [GJS + 19] | Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv , 2019, 1901.01608. 18 | Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d'Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv , 2019, 1901.01608. 18 | | [GKX19] | [GKX19] | Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op- timization via hessian eigenvalue density. CoRR , abs/1901.10159, 2019, 1901.10159. URL http://arxiv.org/abs/1901.10159 . 18 | Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net op- timization via hessian eigenvalue density. CoRR , abs/1901.10159, 2019, 1901.10159. URL http://arxiv.org/abs/1901.10159 . 18 | | [Goo01] | [Goo01] | Joshua Goodman. A bit of progress in language modeling. CoRR , cs.CL/0108005, 2001. URL http://arxiv.org/abs/cs.CL/0108005 . 18 | Joshua Goodman. A bit of progress in language modeling. CoRR , cs.CL/0108005, 2001. URL http://arxiv.org/abs/cs.CL/0108005 . 18 | | [GRK17] | [GRK17] | Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com , 2017. 19 | Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com , 2017. 19 | | [HAD19] | [HAD19] | Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu- tational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming , PPoPP '19, pages 1-14, New York, NY, USA, 2019. ACM. doi:10.1145/3293883.3295710. 18 | Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy: Compu- tational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming , PPoPP '19, pages 1-14, New York, NY, USA, 2019. ACM. doi:10.1145/3293883.3295710. 18 | - [HCC + 18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR , abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965 . 19 - [HNA + 17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017, 1712.00409. 18 - [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems , pages 8571-8580, 2018. 18 - [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 7 - [Kom19] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 - [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 , NIPS'12, pages 1097-1105, USA, 2012. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257 . 19 - [LCG + 19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, 1909.11942. 9 - [LOG + 19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR , abs/1907.11692, 2019, 1907.11692. URL http://arxiv.org/abs/ 1907.11692 . 2 - [LSP + 18] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. arXiv:1801.10198 [cs] , 2018, 1801.10198. URL http://arxiv.org/abs/1801.10198 . 2, 6 - [LT16] Henry W Lin and Max Tegmark. Criticality in formal languages and statistical physics. arXiv preprint arXiv:1606.06737 , 2016. 25 - [LXS + 19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha SohlDickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720. 18 - [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training, 2018, arXiv:1812.06162. 3, 5, 6, 12, 13, 21 - [Pap18] Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. CoRR , abs/1811.07062, 2018, 1811.07062. URL http://arxiv.org/abs/1811.07062 . 18 - [RNSS18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/research-covers/languageunsupervised/language understanding paper. pdf , 2018. 2, 6 - [RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673. 18 - [RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 - [RSR + 19] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019, arXiv:1910.10683. 2 - [RWC + 19] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. openai.com , 2019. 2, 5, 6, 7, 8 - [SCP + 18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorflow: Deep learning for supercomputers, 2018, 1811.02084. 19 - [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR , 2015, 1508.07909. 6 - [SLA + 18] Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training, 2018, arXiv:1811.03600. 12 - [SS18] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR , abs/1804.04235, 2018, 1804.04235. URL http://arxiv.org/abs/1804.04235 . 7 - [THK18] Stefan Thurner, Rudolf Hanel, and Peter Klimek. Introduction to the theory of complex systems . Oxford University Press, 2018. 18 - [TL19] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR , abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905. 11946 . 18 - [VSP + 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 5998-6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . 2, 6 - [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 - [Was06] Larry Wasserman. All of nonparametric statistics . Springer Science & Business Media, 2006. 18 - [WPN + 19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019, 1905.00537. 2 - [WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Jul 2017. doi:10.1109/cvpr.2017.323. 19 - [WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 2019, 1906.02909. 19 - [YDY + 19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, arXiv:1906.08237. 2 - [ZK16] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. Procedings of the British Machine Vision Conference 2016 , 2016. doi:10.5244/c.30.87. 18 - [ZKZ + 15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV) , Dec 2015. doi:10.1109/iccv.2015.11. 7 - [ZLN + 19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. CoRR , abs/1907.04164, 2019, 1907.04164. URL http://arxiv.org/abs/1907.04164 . 12, 18

Rendering Paper...